rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 68 forks source link

TTL files as input to rdf2hdt produces invalid blank node IDs #210

Open GregHanson opened 2 months ago

GregHanson commented 2 months ago

Using an input ttl file from W3C SPARQL 1.0 Test Suite (i18n,) I run it through rdf2hdt and dump the contents using hdtSearch:

./bin/rdf2hdt.sh sample.ttl sample.hdt
[INFO] Scanning for projects...
[INFO] Inspecting build with total of 1 modules...
[INFO] Installing Nexus Staging features:
[INFO]   ... total of 1 executions of maven-deploy-plugin replaced with nexus-staging-maven-plugin
[INFO]
[INFO] ----------------------< org.rdfhdt:hdt-java-cli >-----------------------
[INFO] Building HDT Java Command line Tools 3.0.10
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ hdt-java-cli ---
[WARN] base uri not specified, using 'file:///path/to/sample.ttl'
[INFO] Converting path/to/sample.ttl to path/to/sample.hdt as TURTLE
File converted in ..... 524 ms 808 us
Total Triples ......... 9
Different subjects .... 4
Different predicates .. 5
Different objects ..... 9
Common Subject/Object . 0
HDT saved to file in .. 7 ms 942 us
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.314 s
[INFO] Finished at: 2024-05-06T16:36:52-04:00
[INFO] ------------------------------------------------------------------------

./bin/hdtSearch.sh sample.hdt
[INFO] Scanning for projects...
[INFO] Inspecting build with total of 1 modules...
[INFO] Installing Nexus Staging features:
[INFO]   ... total of 1 executions of maven-deploy-plugin replaced with nexus-staging-maven-plugin
[INFO]
[INFO] ----------------------< org.rdfhdt:hdt-java-cli >-----------------------
[INFO] Building HDT Java Command line Tools 3.0.10
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ hdt-java-cli ---
>> ? ? ?
Query: |?| |?| |?|
_:@0 http://www.w3.org/2001/sw/DataAccess/tests/data/i18n/normalization.ttl#resumé "Alice's normalized resumé"
_:@0 http://xmlns.com/foaf/0.1/name "Alice"
_:@1 http://www.w3.org/2001/sw/DataAccess/tests/data/i18n/normalization.ttl#resumé "Bob's non-normalized resumé"
_:@1 http://xmlns.com/foaf/0.1/name "Bob"
_:@2 http://www.w3.org/2001/sw/DataAccess/tests/data/i18n/normalization.ttl#resumé "Eve's non-normalized resumé"
_:@2 http://www.w3.org/2001/sw/DataAccess/tests/data/i18n/normalization.ttl#resumé "Eve's normalized resumé"
_:@2 http://xmlns.com/foaf/0.1/name "Eve"
file:///path/to/sample.ttl http://www.w3.org/2000/01/rdf-schema#comment "Normalized and non-normalized IRIs"
file:///path/to/sample.ttl http://www.w3.org/2002/07/owl#versionInfo "$Id: normalization-01.ttl,v 1.1 2005/10/25 09:38:08 aseaborne Exp $"
Iterated 9 triples in 22 ms 504 us

While I cannot find @ called out in ttl or nt spec, when using @ for blank nodes in the examples from the docs above, riot CLI throws validation errors when a blank node begins with @

cat <<EOF > blanknode.ttl
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

_:@123 foaf:knows _:@1234 .
_:@1234 foaf:knows _:@123 .
EOF

 cat <<EOF > blanknode.nt
_:@123 <http://xmlns.com/foaf/0.1/knows> _:bob .
_:bob <http://xmlns.com/foaf/0.1/knows> _:@123.
EOF

riot --validate --time blanknode.ttl
17:02:00 ERROR riot            :: [line: 3, col: 3 ] Blank node label does not start with alphabetic or _ : '@'
blanknode.ttl :  (No Output) : 1 errors : 0 warnings
riot --validate --time blanknode.nt
17:02:05 ERROR riot            :: [line: 1, col: 3 ] Blank node label does not start with alphabetic or _ : '@'
blanknode.nt :  (No Output) : 1 errors : 0 warnings
GregHanson commented 1 month ago

actually the spec does list valid characters:

RDF blank nodes in Turtle are expressed as _: followed by a blank node label which is a series of name characters. The characters in the label are built upon PN_CHARS_BASE, liberalized as follows:

Where PN_CHARS_BASE is the following list:

[A-Z] 
[a-z] 
[#x00C0-#x00D6] 
[#x00D8-#x00F6] 
[#x00F8-#x02FF] 
[#x0370-#x037D] 
[#x037F-#x1FFF] 
[#x200C-#x200D] 
[#x2070-#x218F] 
[#x2C00-#x2FEF]
[#x3001-#xD7FF] 
[#xF900-#xFDCF] 
[#xFDF0-#xFFFD] 
[#x10000-#xEFFFF]

Which does not include #x0040 for @