plazi / treatments-rdf

The treatments as RDF in Turtle
2 stars 0 forks source link

spaces and or illegal characters in IRIs or values #8

Closed jhpoelen closed 7 months ago

jhpoelen commented 1 year ago

As I was indexing a versioned copy of treatments-rdf using a Nomer development version today, I noticed the following warning messages that suggest some of the IRIs produced by Plazi's treatments-rdf may need some attention.

You can find the provenance of the treatments-rdf via -

$ nomer properties | grep preston
nomer.preston.dir=
nomer.preston.remotes=https://zenodo.org/record/7196029/files
nomer.preston.version=hash://sha256/b3742bf43d9da0a8ed5522659199f47d68d31aaf46c90381190f324c1ac143f2

or

Poelen, Jorrit H. (2022). Nomer Corpus of Taxonomic Resources hash://sha256/b3742bf43d9da0a8ed5522659199f47d68d31aaf46c90381190f324c1ac143f2 hash://md5/26a9b6c796567b3985e8bfe750ea2341 (0.7) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7196029

[main] WARN org.apache.jena.riot - [line: 58, col: 1 ] Bad IRI: <http://taxon-concept.plazi.org/id/Animalia/Tatargina_picta_Walker_[1865] 1864> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 213, col: 30] Bad IRI: <http://taxon-concept.plazi.org/id/Animalia/Tatargina_picta_Walker_[1865] 1864> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 24, col: 1 ] Bad IRI: <http://taxon-concept.plazi.org/id/Animalia/Aphyocharacinae]_Eigenmann_1909> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 25, col: 22] Bad IRI: <http://taxon-name.plazi.org/id/Animalia/Aphyocharacinae]> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 180, col: 1 ] Bad IRI: <http://taxon-name.plazi.org/id/Animalia/Aphyocharacinae]> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 320, col: 16] Bad IRI: <http://taxon-concept.plazi.org/id/Animalia/Aphyocharacinae]_Eigenmann_1909> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
jhpoelen commented 1 year ago

Here's some more warnings:

[main] WARN org.apache.jena.riot - [line: 323, col: 23] Bad IRI: <http://taxon-name.plazi.org/id/Animalia/[unassigned]_Caenogastropoda> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 332, col: 1 ] Bad IRI: <http://taxon-name.plazi.org/id/Animalia/[unassigned]_Caenogastropoda> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 107, col: 245] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/INDEX19, SMF 358984> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 107, col: 341] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/INDEX19, SMF 358985> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 107, col: 437] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/INDEX19, SMF 358987> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 107, col: 533] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/INDEX19, SMF 358988> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 107, col: 629] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/SMF 358986> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 134, col: 1 ] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/INDEX19, SMF 358984> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 142, col: 1 ] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/INDEX19, SMF 358985> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 145, col: 25] Lexical form '' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 146, col: 26] Lexical form '' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 150, col: 1 ] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/INDEX19, SMF 358987> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 158, col: 1 ] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/INDEX19, SMF 358988> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 166, col: 1 ] Bad IRI: <http://treatment.plazi.org/id/03D08794FFD1FFEBECE2968258F6FF38/SMF 358986> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 169, col: 25] Lexical form '' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 170, col: 26] Lexical form '' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 36, col: 1 ] Bad IRI: <http://taxon-concept.plazi.org/id/Animalia/Indolestes_sp_"o"_Fraser_1922> Code: 4/UNWISE_CHARACTER in PATH: The character matches no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs.
[main] WARN org.apache.jena.riot - [line: 37, col: 22] Bad IRI: <http://taxon-name.plazi.org/id/Animalia/Indolestes_sp_"o"> Code: 4/UNWISE_CHARACTER in PATH: The character matches no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs.
[main] WARN org.apache.jena.riot - [line: 84, col: 1 ] Bad IRI: <http://taxon-name.plazi.org/id/Animalia/Indolestes_sp_"o"> Code: 4/UNWISE_CHARACTER in PATH: The character matches no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs.
[main] WARN org.apache.jena.riot - [line: 125, col: 20] Bad IRI: <http://taxon-concept.plazi.org/id/Animalia/Indolestes_sp_"o"_Fraser_1922> Code: 4/UNWISE_CHARACTER in PATH: The character matches no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs.
[main] WARN org.apache.jena.riot - [line: 115, col: 23] Bad IRI: <http://treatment.plazi.org/id/03D2AB06FFCE5139F8EC2395DCE9E30A/MHNC 13906> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 115, col: 105] Bad IRI: <http://treatment.plazi.org/id/03D2AB06FFCE5139F8EC2395DCE9E30A/MHNC 13947, MHNC 8270, MHNC 13933, MHNC 13935> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 118, col: 1 ] Bad IRI: <http://treatment.plazi.org/id/03D2AB06FFCE5139F8EC2395DCE9E30A/MHNC 13906> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 121, col: 25] Lexical form '' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 122, col: 26] Lexical form '' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 126, col: 1 ] Bad IRI: <http://treatment.plazi.org/id/03D2AB06FFCE5139F8EC2395DCE9E30A/MHNC 13947, MHNC 8270, MHNC 13933, MHNC 13935> Spaces are not legal in URIs/IRIs.
[main] WARN org.apache.jena.riot - [line: 129, col: 25] Lexical form '' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 130, col: 26] Lexical form '' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 132, col: 25] Lexical form '28.6160000°' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 133, col: 26] Lexical form '032.2931667°' not valid for datatype XSD decimal
jhpoelen commented 1 year ago

Some more . . .

[main] WARN org.apache.jena.riot - [line: 135, col: 25] Lexical form '−21.671' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 149, col: 25] Lexical form '−12.068' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 159, col: 25] Lexical form '−29.550' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 293, col: 25] Lexical form '−21,668' not valid for datatype XSD decimal
[main] WARN org.apache.jena.riot - [line: 294, col: 26] Lexical form '34,847' not valid for datatype XSD decimal

Looks like "em" dashes are used instead of negative signs "-" . Also, it appears that US style numeric comma's are not allowed with XSD decimal.

nleanba commented 7 months ago

there has been a major overhaul of the xml to rdf generator (https://github.com/plazi/gg2rdf)

I'm pretty sure that it should have resolved this issue; reopen it if this is not the case.

jhpoelen commented 7 months ago

@nleanba I went through the effort to give you a detailed description on non-conforming plazi rdf . I suggest you check rather than assume that the issue has been resolved. Please re-open this issue and close only when you have evidence to suggest that the issue no longer occurs.

fyi @myrmoteras

nleanba commented 7 months ago

I have now run

find '.' -type f -name '*.ttl' -exec bash -c '~/Downloads/apache-jena-5.0.0-rc1/bin/riot --validate "$0" || echo "$0"' '{}' \;

on a current checkout of this repository.

It has not found anything so far, but it is also progressing very slowly through the files. I will let it run over night and report if it finds anything.

(It does report an error if I manually add one of the above-mentioned malformed literals to one of the files, so the command seems to work)

jhpoelen commented 7 months ago

@nleanba thanks for making the effort to check that the illegal characters and spaces no longer occur.

I am looking forward to hearing your results.

nleanba commented 7 months ago

The process of validating every file is very slow, it is only at data/0C/62/.. by now (going alphabetically 00,0A,0B,0C,..!)

It has so far found one (different) error where was a single \ in a string somewhere. I have now updated gg2rdf to better escape string literals.

nleanba commented 7 months ago

The validation proces is now running on the gg2rdf server, see https://gg2rdf.ld.plazi.org/workdir/log/validate/current to see what file it is currently at and https://gg2rdf.ld.plazi.org/workdir/log/validate/log for all files it has found errors in so far (none yet)

jhpoelen commented 7 months ago

@nleanba curious to hear the final result

nleanba commented 7 months ago

well, it'll be done in approx 2.5 months if the current speed holds, we'll see then if it found anything

jhpoelen commented 7 months ago

@nleanba interesting - thanks for sharing the verification rates. If you can't quickly verify the data products produced, what is your long term plan to ensure the integrity of the data?