rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 69 forks source link

W3C SPARQL 1.0 i18n normalization-02 test case fails #203

Open donpellegrino opened 7 months ago

donpellegrino commented 7 months ago

See https://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0096 for details on the test case. Note the emphases on the presence of "." and ".." in the URLs. Test case definition:

:normalization-2 rdf:type mf:QueryEvaluationTest ;
    mf:name    "normalization-02" ;
    dawgt:approval dawgt:Approved ;
    dawgt:approvedBy <http://lists.w3.org/Archives/Public/public-rdf-dawg/2007JulSep/att-0047/31-dawg-minutes> ;
    rdfs:comment
        "Example 1 from http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0096" ;
    mf:action
        [ qt:data   <normalization-02.ttl> ;
          qt:query  <normalization-02.rq> ] ;
    mf:result  <normalization-02-results.ttl>
    .

Defined in https://github.com/w3c/rdf-tests/blob/main/sparql/sparql10/i18n/manifest.ttl

Using hdt-c++:

user@dunx4:~/projects/oxigraph/oxhdt-sys/tests/resources/rdf-tests/sparql/sparql10/i18n$ rdf2hdt /home/user/projects/oxigraph/testsuite/rdf-tests/sparql/sparql10/i18n/normalization-02.ttl normalization-02.hdt
user@dunx4:~/projects/oxigraph/oxhdt-sys/tests/resources/rdf-tests/sparql/sparql10/i18n$ hdtSearch normalization-02.hdt
Predicate Bitmap in 59 usp: 0 % / 14.86 %
Count predicates in 8 usferences: 0 % / 16.075 %
Count Objects in 4 us Max was: 1: 0 % / 34.3 %
Bitmap in 14 us bitmap: 0 % / 45.64 %
Bitmap bits: 2 Ones: 2
Object references in 50 usces: 0 % / 48.475 %
Sort lists in 8 usblists: 0 % / 68.32 %
Index generated in 196 us
>> ? ? ?                                          %
http://example/vocab#s1 http://example/vocab#p example://a/b/c/%7Bfoo%7D#xyz
http://example/vocab#s2 http://example/vocab#p eXAMPLE://a/./b/../b/%63/%7bfoo%7d#xyz
2 results in 72 us

Using hdt-java:

user@dunx4:~/projects/oxigraph/oxhdt-sys/tests/resources/rdf-tests/sparql/sparql10/i18n$ rdf2hdt.sh /home/user/projects/
oxigraph/testsuite/rdf-tests/sparql/sparql10/i18n/normalization-02.ttl normalization-02.hdt
[WARN] base uri not specified, using 'file:///home/user/projects/oxigraph/testsuite/rdf-tests/sparql/sparql10/i18n/normalization-02.ttl'
[INFO] Converting /home/user/projects/oxigraph/testsuite/rdf-tests/sparql/sparql10/i18n/normalization-02.ttl to normalization-02.hdt as TURTLE
[line: 7, col: 8 ] Not advised IRI: <eXAMPLE://a/b/%63/%7bfoo%7d#xyz> Code: 11/LOWERCASE_PREFERRED in SCHEME: lowercase is preferred in this component
File converted in ..... 517 ms 227 us
Total Triples ......... 2
Different subjects .... 2
Different predicates .. 1
Different objects ..... 2
Common Subject/Object . 0
HDT saved to file in .. 3 ms 2 us
user@dunx4:~/projects/oxigraph/oxhdt-sys/tests/resources/rdf-tests/sparql/sparql10/i18n$ hdtSearch.sh normalization-02.h
dt
Count Objects in 25 us Max was: 1
Bitmap in 106 us
Object references in 11 ms 581 us
Sort object sublists in 17 us
Count predicates in 22 us
Index generated in 14 ms 129 us
[main] . [          ] 0.00  Creating Predicate bitmap 0 / 2
[main] . [          ] 0.00  Generating predicate references
Count predicates in 216 us
Index generated and saved in 62 ms 416 us
>> ? ? ?
Query: |?| |?| |?|
http://example/vocab#s1 http://example/vocab#p example://a/b/c/%7Bfoo%7D#xyz
http://example/vocab#s2 http://example/vocab#p eXAMPLE://a/b/%63/%7bfoo%7d#xyz
Iterated 2 triples in 10 ms 428 us
>>

Note that this test case is referenced multiple times in the hdt-java codebase (hdt-jena/testing/DAWG-Final/i18n and hdt-jena/testing/DAWG). However, it was unclear to me where these tests are being run to check their status.

Test system

hdt-c++

rdf2hdt -V
v1.1.2

hdt-java: hdt-java-package-3.0.10

java -version
openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy
ate47 commented 7 months ago

I think it's due to how RIOT (Jena's parser) is handling the TURTLE:

    @Test
    public void i18nTest() throws IOException, ParserException {
        // https://github.com/rdfhdt/hdt-java/issues/203

        String data = "@prefix : <http://example/vocab#>.\n" +
                "\n" +
                "  :s1 :p <example://a/b/c/%7Bfoo%7D#xyz>.\n" +
                "  :s2 :p <eXAMPLE://a/./b/../b/%63/%7bfoo%7d#xyz>.\n";

        try (InputStream is = new ByteArrayInputStream(data.getBytes(ByteStringUtil.STRING_ENCODING))) {

            RDFParser build = RDFParser.source(is).lang(Lang.TURTLE).build();

            build.parse(new StreamRDF() {

                @Override
                public void triple(Triple triple) {
                    System.out.println(triple);
                }
                @Override
                public void start() {}
                @Override
                public void quad(Quad quad) {}
                @Override
                public void base(String s) {}
                @Override
                public void prefix(String s, String s1) { }
                @Override
                public void finish() {}
            });

        }
    }

returns

http://example/vocab#s1 @http://example/vocab#p example://a/b/c/%7Bfoo%7D#xyz
http://example/vocab#s2 @http://example/vocab#p eXAMPLE://a/b/%63/%7bfoo%7d#xyz

The parser itself is configured in the org.rdfhdt.hdt.rdf.parsers.RDFParserRIOT#parse() method if you want to get a look.

donpellegrino commented 7 months ago

Should this issue be submitted upstream against Jena's RIOT instead of here in hdt-java?

ate47 commented 7 months ago

I don’t know, it might be linked with a missing configuration from our side. It would be better to check it before