infinite-dao / glean-cetaf-rdfs

Collect and glean RDF data in parallel of stable identifiers of the Consortium of European Taxonomic Facilities (CETAF) and prepare them for import into a SPARQL endpoint
GNU General Public License v3.0
0 stars 0 forks source link

PHP Notice in RDF fields blocks Apache Jena’s processing #11

Open infinite-dao opened 1 year ago

infinite-dao commented 1 year ago

Hej-hej,

There are some PHP Notice messages that block apache jena’s processing, and also the import e.g.:

https://data.rbge.org.uk/service/rdf/herb.php?barcode=E00004459

40          <dc:title><br />
41  <b>Notice</b>:  Undefined property: stdClass::$current_name_plain_ni in <b>/var/www/html/service/rdf/herb.php</b> on line <b>72</b><br />
42  Chagas E Silva, F. #01423 </dc:title>

${apache_jena_bin}/rdfxml --validate brought up problems from inserted PHP Notice messages, e.g. from the log report:

Validate Thread-1_data.rbge.org.uk_20221107-1729_modified.rdf.gz :: 
17:34:42 WARN  riot :: [line: 40, col: 25] {W104} Unqualified typed nodes are not allowed. Type treated as a relative URI.
17:34:42 WARN  riot :: [line: 40, col: 25] {W136} Relative URIs are not permitted in RDF: specifically <br>
17:34:42 ERROR riot :: [line: 41, col: 4 ] {E201} Multiple children of property element

One solution would be

Edit: to just remove these messages, could be done with sed (but unfortunately it does not help to restore the missing information):

# remove formatted PHP Notice, Warning aso.
sed --regexp-extended \
 --null-data \
 's@<br */*>[\n\r]*<b>[^<>\n]+</b>:[^\n]*on line <b>[[:digit:]]+</b><br />[\n\r]*@@g' \
 RDF-file.rdf

# substitute formatted PHP Notice, Warning aso. to ??
sed --regexp-extended \
 --null-data \
 's@<br */*>[\n\r]*<b>[^<>\n]+</b>:[^\n]*on line <b>[[:digit:]]+</b><br />[\n\r]*@??@g' \
 RDF-file.rdf
infinite-dao commented 1 year ago

Comparing with old garnered data, this is likely to be the missing (title) specimen name. So the data cannot be restored really, and it seems, this bit of data is lost by this failure and it would be good to restore the functionality again.

rogerhyam commented 1 year ago

This should be fixed now. Let me know of other errors.

infinite-dao commented 1 year ago

Right now I have no http_code that throw any errors. Good so far.

Is it almost fixed? … ( ;-) … I find for (it seems all) <http://purl.org/dc/terms/description> there is always a ?> in it, e.g. https://data.rbge.org.uk/service/rdf/herb.php?barcode=E00018110

<dc:description>A herbarium specimen of indet.  ?></dc:description>

I mean this does not throw any error, but seems something left from programming or so.

Edit: another example http://data.rbge.org.uk/herb/E00968698 (https://data.rbge.org.uk/service/rdf/herb.php?barcode=E00968698)

<dc:title> Sanguisorba minor subsp. muricata (Spach) Briq.</dc:title> 
<dc:description>A herbarium specimen of Sanguisorba minor subsp. muricata (Spach) Briq.  ?></dc:description>

Checking if the URIs appears in the data of GBIF Occurrence, this can be done using the JavaScript-Console:

// e.g. in JavaScript
var cspp_uri_encoded = "http://data.rbge.org.uk/herb/E00968698".replace(/\//g,"~2F"), 
  // you can try the advanced search
  url = 'https://www.gbif.org/occurrence/search?occurrence_id=' + cspp_uri_encoded + '&advanced=1';

url // return url OR try to open it
// 'https://www.gbif.org/occurrence/search?occurrence_id=http:~2F~2Fdata.rbge.org.uk~2Fherb~2FE00968698&advanced=1'
window.open(url, '_blank');
infinite-dao commented 1 year ago

All right, after personal communication we wait also until the field dwciri:recordedBy is present again, it needs time to be properly curated, perhaps in a few month. The dwciri:recordedBy [^1] is the interlinkage to get together all the different institutes´ collectors into one query, that’s way we wait until it is there again.

[^1]: see also https://dwc.tdwg.org/rdf/#25-terms-in-the-dwciri-namespace-normative