ozekik / lightrdf

A fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3
Apache License 2.0
28 stars 2 forks source link

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92 #9

Open plasticfist opened 2 years ago

plasticfist commented 2 years ago

First I have to say nice job on the library, I really love the speed and simplicity of lightrdf, and it worked very well with ChEMBL_27, but I ran into an issue when I tried to read the wikidata tll.

Details

wikidata's file latest-all.ttl

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92

--- nearby lines from file, including problematic line, if my sed is correct

sed -n '3135042,3135046p;3135047q' latest-all.ttl

ref:8f38c16e1f141b68f172b65f48b0982234890e56 a wikibase:Reference ;
    pr:P854 <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> ;
    pr:P813 "2020-03-12T00:00:00Z"^^xsd:dateTime ;
    prv:P813 v:ae909ef12942e232eea24326bdd78c8e ;

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz gunzip and parse the file

Note: these are big files, another strength of lightrdf, the ability to process large files without requiring the entire data set to fit in RAM.


parse the file` 'revisions_lang=en_uris.ttl'`
for each .ttl in this dataset, I'm simply dumping the triples to a file like this
parser = lightrdf.Parser()
triples = parser.parse(str(input_file), format="ttl", base_iri=None)
for (s, p, o) in triples:
    f_triples.write(f"{s}\t{p}\t{o}\n")


Other notes: 
- if there is a way to use a try catch to move past the line I would like to know how that is done
- ubuntu 20.04, conda env pip install lightrdf
ozekik commented 2 years ago

Thank you for reporting!

The problem is that #Crew?oldid=2476206#Command_crew in <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> is, strictly speaking, an invalid IRI part with # followed by unescaped # (and therefore the document is an invalid RDF, in a precise sense.) Some library such as rdflib just ignores it, but Rio (Rust RDF library behind lightrdf) is rigid and raises an exception.

As the resume-after-exception feature is WIP in Rio, I think a possible workaround for now is to fix invalid IRIs before parsing, like:

sed -r 's/([^#]*)#/\1%23/2g' latest-all.ttl

(Use -i to replace in-place and gsed on Mac)

plasticfist commented 2 years ago

Thank you for the quick response, this is very helpful. I'm usually hesitant to manually patch source files, but this might be the best fix for the moment, agree. (thank you for the sed as well) I'm still looking at dbpedia ttls, it throws an error with that dataset as well, which I can't make sense of. At first I thought the problem was that it wasn't actually turtle format in their .ttl files, but as I start to review the spec, maybe it is turtle? (just a bare lazy dump with no prefixes?). Still looking and trying converting back and forth to other formats (e.g. with rapper)

ozekik commented 2 years ago

I understand that huge datasets in RDF tend to be more or less malformed. In my opinion, if an ntriples file is available, it is easier than turtle to find and "patch" problems and track the changes.

plasticfist commented 2 years ago

here is the (first) dbpedia (ttl file, but turtle?) issue, for reference

../dbpedia/ttl/revisions_lang=en_uris.ttl lightrdf.Error: error while parsing IRI 'http://dbpedia.org/resource/󠄀': Invalid IRI code point '󠄀' on line 19841225 at position 35

$ sed -n '19841223,19841227p;19841228q' revisions_lang=en_uris.ttl
<http://dbpedia.org/resource/𨳒> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/𨳒?oldid=786024110&ns=0> .
<http://dbpedia.org/resource/𩧢> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/𩧢?oldid=951071761&ns=0> .
<http://dbpedia.org/resource/󠄀> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄀?oldid=949255578&ns=0> .
<http://dbpedia.org/resource/󠄁> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄁?oldid=949255580&ns=0> .
<http://dbpedia.org/resource/󠄂> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄂?oldid=949255609&ns=0> .

including a screen capture, because terminal seems to give more information about the characters in these 5 lines image

djstrong commented 1 year ago

I have tried with this sed solution while parsing Wikidata, but: lightrdf.Error: error while parsing IRI 'http://archive.is/EKEWo#34.7%': Invalid IRI percent encoding '%' on line 49533684 at position 41 Another: lightrdf.Error: error while parsing language tag 'zh-classical': A subtag may be eight characters in length at maximum on line 59030363 at position 69 :(