rdfhdt / hdt-cpp

HDT C++ Library and Tools
115 stars 65 forks source link

rdf2hdt parse error (despite the file parses with serdi) #209

Open AxelPolleres opened 5 years ago

AxelPolleres commented 5 years ago

when I just tried to create the latesst wikidata dump hdt I stumbled over the following triple:

----------- onliner.nt ------------

<http://www.wikidata.org/reference/250da9edffc9625b588245400ab612129878c232> <http://www.wikidata.org/prop/reference/P854> <www.stat.gov.pl/broker/access/performSearch.jspa?searchString=Janowo&level=miejsc&wojewodztwo=2222&powiat=6381&gmina=&miejscowosc=&advanced=true> .


When trying rdf2hdt here, I get the following:

$ rdf2hdt oneliner.nt onliner.hdt error: oneliner.nt:1:140: bad IRI scheme char `2F' Catch exception load: Error parsing input. ERROR: Error parsing input.

despite serd seems to swallow it:

$ serdi oneliner.nt

works...

Any ideas where in the code I could try looking for a solution?

Thanks, Axel

AxelPolleres commented 5 years ago

as a quickfix... is there any option in the nt parser that could be used to just skip to the next line and ignore parse errors per line for nt-input or where/how could I add such an option?

wouterbeek commented 5 years ago

Hi Axel, could you be using an older version of Serd than the one that was compiled with HDT? The correct parsing of the IRI scheme component was added relatively recently to Serd. What's your output for serdi -v?

Notice that there is an option in Serd/Serdi to use lax parsing (-l), but it is probably not exposed through the HDT library ATM. Still, you can try lax parsing to a temporary file, and generate an HDT out of that temporary file as a workaround.

(And don't forget to email the Wikidata maintainers to explain them that publishing an absolute IRI with no valid scheme component is not the Pedantic Way.)

AxelPolleres commented 5 years ago

Hi Axel, could you be using an older version of Serd than the one that was compiled with HDT? The correct parsing of the IRI scheme component was added relatively recently to Serd. What's your output for serdi -v?

0.28

On 22.07.2019, at 16:44, Wouter Beek notifications@github.com wrote:

Hi Axel, could you be using an older version of Serd than the one that was compiled with HDT? The correct parsing of the IRI scheme component was added relatively recently to Serd. What's your output for serdi -v?

Notice that there is an option in Serd/Serdi to use lax parsing (-l), but it is probably not exposed through the HDT library ATM.

that could be a workaround, would be nice to expose this also in rdf2hdt if that works, any idea where I need to start looking to add that...?

Still, you can try lax parsing to a temporary file, and generate an HDT out of that temporary file as a workaround.

... hmmm, I thought I tried that, but need to check again.

(And don't forget to email the Wikidata maintainers to explain them that publishing an absolute IRI with no valid scheme component is not the Pedantic Way.)

good point...

Axel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

AxelPolleres commented 5 years ago

p.s.: as suspected by Wouter... lax parsing would remedy this issue (justs tested on a local machine with serdi 0.30) ... so, I'd serioudly opt for adding the lax parsing option to rdf2hdt.

AxelPolleres commented 5 years ago

Looking at http://drobilla.net/docs/serd/ which says that non-strict parsing is set by default, so I am a bit confused now, not finding anyway in the code a call to

serd_reader_set_strict( ... )

hmmm, any help/hints welcome, I have to admit I don't really understand the serd interface and how it is called by hdt... I suspect, within libhdt/src/rdf/RDFParserSerd.cpp but again, even if I add

serd_reader_set_strict( reader, false );

there, it doesn't change anything.... or does that interfer with the call to

serd_reader_set_error_sink

??

drobilla commented 5 years ago

The character is the / since that "URI" does indeed have no scheme (so this isn't valid NTriples), but I'm not sure why you would be seeing different behaviour here. I get a failure with serdi (current master), lax or not:

$ serdi -l ./test.nt 
error: ./test.nt:1:140: bad IRI scheme char `2F'
drobilla commented 5 years ago

Note that lax parsing is not a free lunch: it can drop triples. So enabling it by default might not be the best idea. Surely the web has suffered enough under that philosophy? :)

AxelPolleres commented 5 years ago

FWIW, I wasn't arguing for enabling lax parsing by default, but it still might be worthwhile to have the option.

(anyway, managed to create a new wikidata HDT dump in the meanwhile, but sitll looking for where to host it (88GB HDT) ;-))

On 01.08.2019, at 21:23, David Robillard notifications@github.com wrote:

Note that lax parsing is not a free lunch: it can drop triples. So enabling it by default might not be the best idea. Surely the web has suffered enough under that philosophy? :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

mielvds commented 4 years ago

So after a short talk with @AxelPolleres, I was curious and gave the oneliner.ttl a try with serd 0.30.1. I just want to confirm that

So how to proceed? If we go for strict-by default, we need to add the serd_reader_set_strict( reader, true ); in libhdt/src/rdf/RDFParserSerd.cpp for version >= 30 only.

However, what I don't understand is why I don't need the -l when using serdi, it parses just fine without any bad IRI scheme char '2F (which will give you trouble in HDT afterwards anyway). It also seems that the lax parsing has been around for way longer (~0.21.1 -> https://github.com/drobilla/serd/commit/d51be9b8d97791bff796d046d10fe16fd4e41311). So it seems there are two things going on here:

  1. the serd_reader_set_strict added in https://github.com/drobilla/serd/commit/d51be9b8d97791bff796d046d10fe16fd4e41311 protects against invalid characters, which is not causing the error in oneliner.ttl and that's why it doesn't change anything
  2. between 0.28 and 0.30, there was a change that allowed URIs without protocols to be parsed as being correct.

@drobilla could you provide some insight here?

drobilla commented 4 years ago

@mielvds I think the discrepancy is because you are parsing it as .ttl there (serdi deduces the type from the extension if you don't provide it explicitly). This is different for Turtle and NTriples since it could be a URIRef in Turtle, but in NTriples it must be a URI (with a scheme).