infinite-dao / glean-cetaf-rdfs

Collect and glean RDF data in parallel of stable identifiers of the Consortium of European Taxonomic Facilities (CETAF) and prepare them for import into a SPARQL endpoint
GNU General Public License v3.0
0 stars 0 forks source link

Import Errors Royal Botanic Garden Kew (RBGK) #2

Open infinite-dao opened 1 year ago

infinite-dao commented 1 year ago

Summary of Import from May 23rd. 2022:

urilist date time notes
urilist_RBGK_20220523_per_01x250000.txt 20220523-1650 Done. 249999 jobs took 2d 60h:5m:57s using 10 parallel connections, having URI-Errors: 1470
urilist_RBGK_20220523_per_02x250000.txt 20220526-0457 Done. 250000 jobs took 0d 14h:41m:32s using 10 parallel connections, having URI-Errors: 349
urilist_RBGK_20220523_per_03x250000.txt 20220526-1939 Done. 250000 jobs took 0d 19h:41m:04s using 10 parallel connections, having URI-Errors: 642
urilist_RBGK_20220523_per_04x250000.txt 20220527-1520 Done. 164355 jobs took 0d 19h:18m:50s using 10 parallel connections, having URI-Errors: 643

Cases of Error-Codes in detail: Thread-XX_specimens.kew.org_20220523_all_error.log

Count CSPP-Domain-Pattern Error Codes
2 http://specimens.kew.org/herbarium/CETAF-ID... Codes: ERROR: 404 Not Found;
1 http://specimens.kew.org/herbarium/CETAF-ID... Codes: ERROR: Read error (Connection reset by peer) in headers.;OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;OK: 200 OK;
1 http://specimens.kew.org/herbarium/CETAF-ID... Codes: ERROR: Read error (Connection reset by peer) in headers.;OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;OK: 200 OK;
15 http://specimens.kew.org/herbarium/CETAF-ID... Codes: ERROR: Read error (Connection reset by peer) in headers.;OK: 303 See Other;OK: 200 OK;
4 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: 404 /herbcat/rdfQuery.do;
223 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: 500 Internal Server Error;
744 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: 503 Service Unavailable;
6 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: No data received.;ERROR: No data received.;OK: 200 OK;
8 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: No data received.;OK: 200 OK;
4 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection reset by peer) in headers.;ERROR: Read error (Connection reset by peer) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
2 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
1 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
8 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
9 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
3 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
10 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
10 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
1 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;
3 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
2 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: 503 Service Unavailable;
1 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;OK: 200 OK;
1 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;OK: 200 OK;
7 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;OK: 200 OK;
79 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;ERROR: Read error (Connection timed out) in headers.;OK: 200 OK;
1959 http://specimens.kew.org/herbarium/CETAF-ID... Codes: OK: 303 See Other;ERROR: Read error (Connection timed out) in headers.;OK: 200 OK;

Summary 2022-06-01

Decision:

Check 500 Internal Server Error check

Some tested:

infinite-dao commented 1 year ago

Checking for http://specimens.kew.org/herbarium/K000993285

infinite-dao commented 1 year ago

The error messages at the present for those URIs (like http://apps.kew.org/herbcat/rdfQuery.do?barcode=K000993285) is:

HTTP Status 500 -
type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

javax.servlet.ServletException: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: The content of elements must consist of well-formed character data or markup.
  org.apache.struts.action.RequestProcessor.processException(RequestProcessor.java:535)
  org.apache.struts.action.RequestProcessor.processActionPerform(RequestProcessor.java:433)
  org.apache.struts.action.RequestProcessor.process(RequestProcessor.java:236)
  org.apache.struts.action.ActionServlet.process(ActionServlet.java:1196)
  org.apache.struts.action.ActionServlet.doGet(ActionServlet.java:414)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
root cause

java.lang.RuntimeException: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: The content of elements must consist of well-formed character data or markup.
  org.kew.herbcat.data.RDFHandler.format(RDFHandler.java:56)
  org.kew.herbcat.data.RDFHandler.createRDF(RDFHandler.java:35)
  org.kew.herbcat.actions.RDFAction.downloadRDF(RDFAction.java:166)
  org.kew.herbcat.actions.RDFAction.execute(RDFAction.java:155)
  org.apache.struts.action.RequestProcessor.processActionPerform(RequestProcessor.java:431)
  org.apache.struts.action.RequestProcessor.process(RequestProcessor.java:236)
  org.apache.struts.action.ActionServlet.process(ActionServlet.java:1196)
  org.apache.struts.action.ActionServlet.doGet(ActionServlet.java:414)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
root cause

javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: The content of elements must consist of well-formed character data or markup.
  org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:501)
  org.kew.herbcat.data.RDFHandler.format(RDFHandler.java:52)
  org.kew.herbcat.data.RDFHandler.createRDF(RDFHandler.java:35)
  org.kew.herbcat.actions.RDFAction.downloadRDF(RDFAction.java:166)
  org.kew.herbcat.actions.RDFAction.execute(RDFAction.java:155)
  org.apache.struts.action.RequestProcessor.processActionPerform(RequestProcessor.java:431)
  org.apache.struts.action.RequestProcessor.process(RequestProcessor.java:236)
  org.apache.struts.action.ActionServlet.process(ActionServlet.java:1196)
  org.apache.struts.action.ActionServlet.doGet(ActionServlet.java:414)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
root cause

org.xml.sax.SAXParseException: The content of elements must consist of well-formed character data or markup.
  org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
  org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:484)
  org.kew.herbcat.data.RDFHandler.format(RDFHandler.java:52)
  org.kew.herbcat.data.RDFHandler.createRDF(RDFHandler.java:35)
  org.kew.herbcat.actions.RDFAction.downloadRDF(RDFAction.java:166)
  org.kew.herbcat.actions.RDFAction.execute(RDFAction.java:155)
  org.apache.struts.action.RequestProcessor.processActionPerform(RequestProcessor.java:431)
  org.apache.struts.action.RequestProcessor.process(RequestProcessor.java:236)
  org.apache.struts.action.ActionServlet.process(ActionServlet.java:1196)
  org.apache.struts.action.ActionServlet.doGet(ActionServlet.java:414)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
  org.kew.servlet.filter.TagSwapFilter.doFilter(Unknown Source)
infinite-dao commented 1 year ago

http://apps.kew.org/herbcat/rdfQuery.do?barcode=K000993285 works again, e.g.:

wget --header='Accept: application/rdf+xml' \
  --no-check-certificate \
  --max-redirect 4 -O \
  - "http://specimens.kew.org/herbarium/K000993285" \
  > "K000993285.rdf"
head K000993285.rdf
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dwc="http://rs.tdwg.org/dwc/terms/" xmlns:dwcc="http://rs.tdwg.org/dwc/curatorial/" xmlns:dc="http://purl.org/dc/terms/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:dwciri="http://rs.tdwg.org/dwc/iri/">
<!--This is metadata about this metadata document-->
<rdf:Description rdf:about="http://specimens.kew.org/herbarium/K000993285">
<dc:creator>RBG Kew Science</dc:creator>
<dc:created>2023-01-25 14:22:47+0000</dc:created>
<dc:hasVersion rdf:resource="http://apps.kew.org/herbcat/detailsQuery.do?barcode=K000993285"/>
</rdf:Description>
<!--This is metadata about this specimen-->
<rdf:Description rdf:about="http://specimens.kew.org/herbarium/K000993285">

Reading it in TriG format by converting it with Apache Jena trig command and let it compress the format (use of prefixes):

/opt/jena-fuseki/import-sandbox/bin/apache-jena-4.4.0/bin/trig --formatted=trig K000993285.rdf
@prefix dc:     <http://purl.org/dc/terms/> .
@prefix dwc:    <http://rs.tdwg.org/dwc/terms/> .
@prefix dwcc:   <http://rs.tdwg.org/dwc/curatorial/> .
@prefix dwciri: <http://rs.tdwg.org/dwc/iri/> .
@prefix geo:    <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix owl:    <http://www.w3.org/2002/07/owl#> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xs:     <http://www.w3.org/2001/XMLSchema> .
@prefix xsi:    <http://www.w3.org/2001/XMLSchema-instance> .

<http://specimens.kew.org/herbarium/K000993285>
        rdf:type               dwc:PreservedSpecimen ;
        dc:created             "2023-01-25 14:22:47+0000" , "12/1917" ;
        dc:creator             "RBG Kew Science" , "Buchtien, O." ;
        dc:description         "A herbarium specimen of Anemia australis (Mickel) M.Kessler & A.R.Sm. collected by Buchtien, O. #794" ;
        dc:hasVersion          <http://apps.kew.org/herbcat/detailsQuery.do?barcode=K000993285> ;
        dc:identifier          "K000993285" ;
        dc:license             "https://creativecommons.org/licenses/by/4.0/" ;
        dc:publisher           <https://www.kew.org> ;
        dc:title               "Anemia australis (Mickel) M.Kessler & A.R.Sm.Buchtien, O.794" ;
        dc:type                "PhysicalObject" ;
        dwciri:inCollection    <http://biocol.org/urn:lsid:biocol.org:col:15867> ;
        dwc:basisOfRecord      "PreservedSpecimen" ;
        dwc:catalogNumber      "K000993285" ;
        dwc:collectionCode     "K" ;
        dwc:collectionDate     "19171200" ;
        dwc:country            "Bolivia" ;
        dwc:family             "Schizaeaceae" ;
        dwc:genus              "Anemia" ;
        dwc:institutionCode    "http://biocol.org/urn:lsid:biocol.org:col:15867" ;
        dwc:locationRemarks    "Yungas" ;
        dwc:recordNumber       "794" ;
        dwc:recordedBy         "Buchtien, O." ;
        dwc:sampleID           "http://specimens.kew.org/herbarium/K000993285" ;
        dwc:scientificName     "Anemia australis (Mickel) M.Kessler & A.R.Sm." ;
        dwc:specificEpithet    "australis" ;
        dwc:verbatimElevation  "1300.0 m" ;
        owl:sameAs             <http://specimens.kew.org/herbarium/K000993285> .