Parsing nquads crashes for literals with language annotation

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. data = 
"<http://www.telegraphis.net/ontology/measurement/metric#SquareKilometre> 
<http://www.w3.org/2000/01/rdf-schema#label> "square kilometer"@en-US 
<http://telegraphis.net/ontology/measurement/metric>"
2. g = ConjunctiveGraph()
3. g.parse(data=data, format='nquads')

What is the expected output? What do you see instead?
should parse, but I get the following error:

....
 File "build/bdist.macosx-10.4-x86_64/egg/rdflib_nquads.py", line 118, in parse
rdflib.plugins.parsers.ntriples.ParseError: Invalid line: '-US 
<http://telegraphis.net/ontology/measurement/metric> .'

Please use labels and text to provide additional information.

there seems to be a parser error. Should be somewhat relatively easy to fix (i 
suspect, but could be wrong). Alternatively an "ignore bad lines" option could 
be supported. 

Cheers,
Cosmin

Original issue reported on code.google.com by cosmin.b...@gmail.com on 16 Jul 2011 at 7:43

GoogleCodeExporter commented 8 years ago

it seems that the annotation 'en-US' is causing the problem, perhaps the '-' is 
the tricky part :]. Cosmin

Original comment by cosmin.b...@gmail.com on 16 Jul 2011 at 7:44

GoogleCodeExporter commented 8 years ago

I am confused - rdflib does not support nquads :)

But benosteen seems to have written a parser: 
https://github.com/benosteen/RDFLib-NQuads-parser/blob/master/rdflib_nquads.py

merging this into rdflib would probably make sense.

Original comment by gromgull on 19 Aug 2011 at 12:21

GoogleCodeExporter commented 8 years ago

I added benosteen's nquad parser in an nquads branch, and also a serializer. 

http://code.google.com/p/rdflib/source/browse/?name=nquads

Now I didn't actually check if this fixes your issue - will do soon.

Original comment by gromgull on 20 Aug 2011 at 7:50

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

It doesn't fix the issue but, by my reading of the RDF runes, RDFLib is correct 
(pace a less-than-useful error message) in only permitting lowercase language 
tags because uppercase would seem to be contraindicated ...

The W3C's RDF testcases document 
(http://www.w3.org/TR/rdf-testcases/#langString) presents this BNF spec, to 
which the RDFLib ntriples parser currently conforms, and some clarifying rubric:

"""

literal ::= langString | datatypeString 
langString  ::= '"' string '"' ( '@' language )?    
datatypeString  ::= '"' string '"' '^^' uriref  
language    ::= [a-z]+ ('-' [a-z0-9]+ )*
encoding a language tag.

...

optionally a language tag as defined by [RFC-3066], normalized to lowercase.

Note: The case normalization of language tags is part of the description of the 
abstract syntax, and consequently the abstract behaviour of RDF applications. 
It does not constrain an RDF implementation to actually normalize the case. 
Crucially, the result of comparing two language tags should not be sensitive to 
the case of the original input.

"""
To have the quad successfully parsed by the current code, the OP should simply 
lowercase the second language tag:

"square kilometer"@en-us

For RDFLib to accept uppercase language subtags, this small change to the 
ntriples.py "litinfo" regex would be needed ...

-litinfo = r'(?:@([a-z]+(?:-[a-z0-9]+)*)|\^\^' + uriref + r')?'
+litinfo = r'(?:@([a-z]+(?:-[A-Za-z0-9]+)*)|\^\^' + uriref + r')?'

I'm unsure what the ramifications of this latter change would be for any 
existing RDFLib-provided support for language tag comparisons.

Original comment by gjhigg...@gmail.com on 24 Oct 2011 at 3:26

walidazizi / rdflib

Parsing nquads crashes for literals with language annotation #176