Test suite - Githubissues

VladimirAlexiev commented 5 years ago

The sparql 1.1 test suite is useful, but

there are bugs that can't be fixed because the process is over
There are some harnesses, but no harness as a service that any dev can easily use to test his implementation, continuously.
The Implementation Reports are generated from EARL rdf test results, which is great. But afaik these are submitted by devs and taken at face value.

@borderCloud (Karima Rafes) has been running http://sparqlscore.com/ valiantly for 4 years (see documentation), added some tests and fixed some; and given up on others because of ambiguities in the spec.

She proposed and I support that whatever 1.2 features are standardized by this group, should have tests. I also put forward that this group should try to fix 1.1 test suite problems, and help w3c host a continuous testing harness.

The biggest improvements needed on this testing site are

more flexible result comparison by the test runner. Eg using jsonld c14n to make comparison easier
logistical issues eg what do you use as counterparty server for Federated queries

Karima please add more from recent emails

kasei commented 5 years ago

Are there "test suite problems" that are not captured by the updates made as part of the rdf-tests CG?

afs commented 5 years ago

See RDF tests issue 51 -- w3c/rdf-tests#51 -- for previous discussions.

The draft charter for the SPARQL 1.1 CG specifically recognizes liaison with the "RDF Test Curation Community Group".

BorderCloud commented 5 years ago

I do not know if it is the right time to discuss of test suite but the first thing to do is to define exactly the same minimal API for all SPARQL services (#27). When the minimal API will be accepted. We can imagine a new solution to test integrally each SPARQL implementation in parallel of works of future WG.

In this new solution to test SPARQL implementation, I would like:

Simulate all communications of a SPARQL service with a SPARQL client and other services (federated query)
Allow anyone to define a new test (public or private)
Allow anyone to reproduce for free the result of each test (ie. in GitHub with Travis Ci)
Generate customized test reports to help WG members test their softwares (in private during the development)
Allow the WG to select the tests that will be part of the recommendation or not in function of results of tests
After the recommendation of SPARQL 1.2, offer a service allowing users to see the SPARQL 1.2 official features supported or not by each solution on the market.

I think it's time to indutrialise SPARQL.

If the GC officially asks me to participate in the WG to consolidate the next version of SPARQL, I can start to propose a new research project to develop this new platform.

VladimirAlexiev commented 5 years ago

@kasei Good question. Nearly all files at https://github.com/w3c/rdf-tests/tree/gh-pages/sparql11 are last updated 3-4 years ago. See TFT-tests/issues: I believe the following are bugs in the tests: https://github.com/BorderCloud/TFT-tests/issues/18, https://github.com/BorderCloud/TFT-tests/issues/15, https://github.com/BorderCloud/TFT-tests/issues/20, https://github.com/BorderCloud/TFT-tests/issues/2. We even had some absurd discussions like

Fix the name of tvs02 to tsv02
you can ask to fix it in the project of w3c/rdf-tests. After, I will pull their new "official name"

Sure enough, tvs02 is still not fixed.

https://github.com/BorderCloud/TFT-tests/issues/4 is pervasive: many tests use relative URLs but fix no base. @jeenbroekstra said that's not a bug and Karima's runner adds some base, but I think it is a bug.

The other issues are more important: we need a flexible test result comparator (perhaps based on c14n), else there are many false negatives.

@afs the W3C Tests CG site doesn't have any posts since 3.5y ago (2015). I asked a couple days ago "Is the activity of this group closed? TFT-tests runs continuous tests over some RDF repos, and tries to fix some of the tests. The biggest improvements needed in this suite is more flexible result comparison by the test runner". The comment is still awaiting moderation: I think that group is closed and gone.

Karima replied "I finished my thesis (in french): Karima Rafes. Le Linked Data à l'université : la plateforme LinkedWiki. Université Paris-Saclay, 2019. Français. The chapter 5 is the conclusion of this work. I developed the simplest. There are still tests that are difficult or useless to code because several parts of SPARQL 1.1 specifications are too fuzzy. I did my maximum. The next step for me is to consolidate/change the specifications, otherwise SPARQL will never be totally interoperable.

So, the project TFT is in standby and will disappear when W3C offer all tests with a tool such as TFT to validate the compliance with SPARQL. If the tests and the tools to run the tests becomes a prerequisite for validate the specifications, there will be less functionalities but SPARQL 1.2 will not have the interoperability problems of SPARQL 1.1. When the CG will work on the tests needed for SPARQL 1.2, I will try to work with it (if I have the time).

Maybe I should have pressed with w3c/rdf-tests. But I had these exchanges with Karima in 2018 (I was trying to get GraphDB to perfect score), while the last activity I see in the W3C Tests CG and their github is 2015.

Ergo the point of this issue: SPARQL test suite activity needs to be restarted, and kept continuous for 3-4 years. Every SPARQL 1.2 feature must come with tests, and there should be a continuous-testing framework in place. Else there is a risk that users won't know which repo implements what and how well, and the new features won't be used much.

VladimirAlexiev commented 5 years ago

@afs SPARQL 1.1 CG specifically recognizes liaison with the "RDF Test Curation Community Group"

If another group can take over testing that would be great. But it seems to me the W3C Tests CG is disbanded/passive. I think that together with forming this SPARQL 1.2 CG, the Tests CG must be restarted. @iherman and @gkellogg, please comment?

gkellogg commented 5 years ago

CG is not disbanded, it has been quiescent for a long time. It makes sense to have this CG to drive SPARQL tests, but may want to work out of the RDF tests CG repo.

afs commented 5 years ago

@VladimirAlexiev

Nearly all files at https://github.com/w3c/rdf-tests/tree/gh-pages/sparql11 are last updated 3-4 years ago.

because there have been no fixes needed. https://github.com/w3c/rdf-tests/commits/gh-pages and https://github.com/w3c/rdf-tests/pulls?q=is%3Apr+is%3Aclosed show recent activity.

Moving the work across CGs does not change the fact that someone has to do the work. Change happens when pull requests are sent.

Is there a barrier to contributing to RDF test CG?

VladimirAlexiev commented 5 years ago

@afs then please move this task to rdf-tests (but change the title to something more descriptive).

@gkellogg and @kasei and whoever else was active in rdf-tests, you'll be the best people to continue leading this work! I've long marveled at EARL and how EARL reports are used to generate Implementation Report htmls, a work of beauty. But do you agree with the more ambitious goals that Karima and I have proposed above?

A continuous testing framework is better than taking those EARL reports at face value.
A more flexible comparator (perhaps based on c14n) will eliminate false negatives and let vendors focus on the true discrepancies. @gkellogg you and Manu would be the best people to pull this off.

Is there a barrier to contributing to RDF test CG?

Truth be told I never tried, I didn't know it was active. I (or a QA at ONTO) would love to work with rdf-test to eliminate false negatives. I posted some issues to Karima but she basically threw her hands in the air for some of them, saying "it's out of my hands". Or fixed stuff locally, eg look at https://github.com/BorderCloud/TFT-tests/issues/4: she added some base to all queries, but maybe it's better to specify the base explicitly.

afs commented 5 years ago

If that community wish to take the issue, then fine. I do not believe pushing it at them is productive. There is RDF tests issue 51 -- w3c/rdf-tests#51 -- for previous discussions.

Work on a test runner does not need any permission from anyone but the idea of changing SPARQL to fit one particular runner seems a bad idea.

Base URI handling is explained in the SPARQL test suite. RFC 3986 section 5.1 explains the general mechanism that applies to all URI resolution.

VladimirAlexiev commented 5 years ago

Work on a test runner does not need any permission from anyone

I'm not seeking permission, I seek willingness for collaboration on this important topic. Do you think it'd be important to run a centralized continuous test runner for everyone's benefit?

changing SPARQL to fit one particular runner seems a bad idea

Don't know what gave you that idea. I think that using relative URLs in tests without base leaves them underspecfied, and is one issue that needs fixing in the tests.

kasei commented 5 years ago

changing SPARQL to fit one particular runner seems a bad idea

Don't know what gave you that idea. I think that using relative URLs in tests without base leaves them underspecfied, and is one issue that needs fixing in the tests.

Base URL resolution is well defined.

Beyond this issue, there have been other suggestions (e.g. in #27) to make backwards incompatible changes for the benefit of testing. I strongly agree with @afs that this sort thing would be a bad idea.

gkellogg commented 5 years ago

@gkellogg and @kasei and whoever else was active in rdf-tests, you'll be the best people to continue leading this work! I've long marveled at EARL and how EARL reports are used to generate Implementation Report htmls, a work of beauty. But do you agree with the more ambitious goals that Karima and I have proposed above?

A continuous testing framework is better than taking those EARL reports at face value.

RDFa did something like that, which was a pain. Every implementation must maintain a service to respond to test queries. In reality, it was a lot of work. Today, you might use containerized apps, but might be better to define a CI best practice for implementations to use to run the tests, and potentially send an update report. Conceivably, the implementation report could be automatically updates, but it’s required a lot of hand holding in the past.

A more flexible comparator (perhaps based on c14n) will eliminate false negatives and let vendors focus on the true discrepancies. @gkellogg you and Manu would be the best people to pull this off.

I don’t see that it would eliminate false negatives, as C14N and Isomorphism effectively allow equivalent comparisons. C14N might generate more useful diffs when results don’t compare.

Is there a barrier to contributing to RDF test CG?

Truth be told I never tried, I didn't know it was active. I (or a QA at ONTO) would love to work with rdf-test to eliminate false negatives.

Consider joining the CG.

BorderCloud commented 5 years ago

@gkellogg

A continuous testing framework is better than taking those EARL reports at face value.

RDFa did something like that, which was a pain. Every implementation must maintain a service to respond to test queries. In reality, it was a lot of work. Today, you might use containerized apps, but might be better to define a CI best practice for implementations to use to run the tests, and potentially send an update report. Conceivably, the implementation report could be automatically updates, but it’s required a lot of hand holding in the past.

My continuous testing framework works already via Travis CI and the results of tests are collected in a RDF database via a SPARQL service. The CG can already use it to evaluate the compliance with SPARQL 1.1... (and enable my tests about the protocol)

But for the federated query protocol, my first implementation is insufficient. We have to imagine another method in the future.

VladimirAlexiev commented 5 years ago

@kasei

Base URL resolution is well defined.

But when a test doesn't define a base and the test SPARQL can be located at different URLs, what is the result of that resolution? Would you agree with me that a test that uses relative URLs and doesn't specify base is under-specified?

there have been other suggestions (e.g. in #27) to make backwards incompatible changes

I myself don't know what Karima means by #27. But don't throw away the baby with the bath water. Have you looked at http://sparqlscore.com and what do you think of it?

@gkellogg

Every implementation must maintain a service to respond to test queries

Most vendors (and I speak for one) have eval or free versions, that's what Karima used for her service. Vendors have an interest in perfecting their score. Karima's done a good job, but she needs the support of the RDF Test CG to keep it going and to improve it.

the implementation report could be automatically updates, but it’s required a lot of hand holding in the past.

Have you considered the reproducibility of the Implementation Report? If I want to check all claimed results, what am I to do?

Consider joining the CG.

I'll speak to colleagues at ONTO.

I don’t see that it would eliminate false negatives, as C14N and Isomorphism effectively allow equivalent comparisons.

It's easier to compare two c14n-ed result sets (the etalon and the SUT (system under test) response). The SUT response often can include extra triples, which the comparator must allow.

@BorderCloud

federated query protocol, my first implementation is insufficient.

Yes, what do you use as counterparty server for Federated queries is a difficult question.

Eg if you use a Virtuoso, presumably this gives an unfair advantage of Virtuoso as SUT (because presumably, two Vurtuoso will implement federation more smoothly than Virutoso and another SUT).
the uptime of this counterparty system is important, else it'll fail federated tests of other SUT's

afs commented 5 years ago

@VladimirAlexiev,

The tests run from manifest files, which are Turtle. Suppose the manifest file is http://example/manifest.ttl.

    mf:action
         [ qt:query  <agg01.rq> ;
           qt:data   <agg01.ttl> ] ;

when that is read by a Turtle parser, the RDF term for <agg01.rq> is http://example/agg01.rq. When reading the query, the base URI is therefore http://example/agg01.rq. A query can change this during with BASE but out starts out being http://example/agg01.rq. This not a feature of SPARQL, it is part of RFC 3986.

VladimirAlexiev commented 5 years ago

@afs Exactly my point: what is the actual value of http://example? It is not defined by the test suite.

@kasei comment on Protocol validation: https://github.com/w3c/sparql-12/issues/1#issuecomment-480322762. Would be great to include protocol tests in the suite.

afs commented 5 years ago

It is wherever the test suite resides. It is not fixed and does not need to be.

This allows people to download the suite and run it locally as they have done. (After all, it is mostly the test suite for query engines.)

This has been discussed at length before. What is the problem you are facing with relative URI resolution to make the test suite portable?

BorderCloud commented 5 years ago

@afs

After all, it is mostly the test suite for query engines.

It's wrong. It is mostly the test suite for SPARQL clients because It's the SPARQL clients the victims of your different protocols.

A unique and reproductible test suite is not a optional tool when we want to build a real interoperability for the Semantic Web.

I demontrated it is possible to use the same protocol test suite to evaluate our interoperability. It's free and reproductible online by anybody. It's a excellent news for the next version of SPARQL, isn't it ?

It's time to use the same test suite to build a real interoperability for SPARQL 1.1 and 1.2 and 2.0...

VladimirAlexiev commented 5 years ago

Andy's comment "mostly the test suite for query engines" applies to the question of whether queries should specify their BASE.

On the other hand, I believe that protocol tests are definitely fair game for such a test suite.

afs commented 5 years ago

Please update sparqlscore to work with RDF 1.1.

BorderCloud commented 5 years ago

@afs

Please update sparqlscore to work with RDF 1.1.

I would like... but the test suite is implemented in RDF 1.0 (Turtle 1.0). https://github.com/w3c/rdf-tests/blob/gh-pages/sparql11/data-sparql11/manifest-all.ttl

I'm not sure I understood the meaning of the sentence. Sparqlscore loads the turtle 1.0 of the official test suite (compliant in theory with 1.1).

afs commented 5 years ago

The issue for sparqlscore seems to be in the comparison of results. In RDF1.1, simple strings and xsd:string are the same thing and there is a preference for omitting the datatype. For running tests, it is the comparison that can handle that even if up until then a mix of simple strings and xsd:string happens.

BorderCloud commented 5 years ago

@afs A fix for one query engine in sparqlscore may be a new issue for another query engine. For the moment, I wait the next version of SPARQL before to change TFT and SparqlScore.

I dream... In the future version of test suite, the SPARQL results should be strictly the same for the same query on the same data for any query engines (and ofcourse with the same protocol).

kasei commented 5 years ago

@BorderCloud surely it's better to support the current standard than keep outdated implementations appearing to pass while ensuring new implementations appear to fail? sparqlscore.com says:

SPARQLScore is an attempt to evaluate the conformance of triplestores to the W3C standards.

(Emphasis added.) I read that as implying the current standards, so if that's not what you're choosing to do, you might want to explicit state as much.

I dream... In the future version of test suite, the SPARQL results should be strictly the same for the same query on the same data for any query engines (and ofcourse with the same protocol).

The nature of scheduling different working groups and their related standards will make your dream very difficult to achieve in practice. In practice, however, I think there is already broad consensus around the test suite and what counts as a conforming implementation.

afs commented 5 years ago

@BorderCloud It will not invalidate a result from an RDF 1.0 based engine.

kasei commented 5 years ago

@afs

@BorderCloud It will not invalidate a result from an RDF 1.0 based engine.

I think that's true for everything except two tests. This rdf-tests commit explains the reasoning, and removes the old tests from the manifest list.

BorderCloud commented 5 years ago

@afs @kasei I checked the specifications of SPARQL result 1.1 with XML. https://www.w3.org/2007/SPARQL/result.xsd

The attribute "datatype" seems required (for RDF 1.0 or 1.1). There is not a default type when the attribute "datatype" not exists.

kasei commented 5 years ago

The attribute "datatype" seems required (for RDF 1.0 or 1.1). There is not a default type when the attribute "datatype" not exists.

I'm not sure what the problem is. Could you provide some more context?

Possibly helpful to this discussion, I'll point out that the RDF 1.1 Concepts and Abstract Syntax has this to say about literals:

Please note that concrete syntaxes may support simple literals consisting of only a lexical form without any datatype IRI or language tag. Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string. Similarly, most concrete syntaxes represent language-tagged strings without the datatype IRI because it always equals http://www.w3.org/1999/02/22-rdf-syntax-ns#langString.

BorderCloud commented 5 years ago

I'm not sure, it's the best place for this discussion... This is only one of problems that still need to explicitly specify in the next version.

afs commented 5 years ago

The datatype attribute was not required at SPARQL 1.0. 2.3.1. Variable Binding Results

RDF Literal S \\S\\

You are right there is no default datatype because in RDF 1.0 plain strings didn't have a datatype.

VladimirAlexiev commented 5 years ago

@borderCloud I dream... In the future version of test suite, the SPARQL results should be strictly the same for the same query on the same data for any query engines

This dream is neither realistic nor necessary. Query engines are allowed some flexibility eg

return extra triples for wildcard queries (eg GDB returns system ontology axiomatic triples, depending on installed ruleset)
vary result order, if ordering is not specified, or for all CONSTRUCT queries
vary the names of blank nodes
return or omit xsd:string, because it's the default

We need a more flexible comparator

jindrichmynarz commented 5 years ago

We need a more flexible comparator

Comparing serialized results via byte-by-byte equality is brittle. Using a canonical serialization or testing result graph isomorphism helps, but as you mention above, there are still cases, in which we want to give query engines more leeway. In such cases, we can define looser tests via invariants (e.g., ASK queries on results expected to be true/false) or metamorphic relations (some input data permutations produce the same results).

afs commented 5 years ago

Comparing two results sets: https://lists.w3.org/Archives/Public/public-sparql-dev/2014JulSep/0030.html

jindrichmynarz commented 5 years ago

Unordered SELECT results can be parsed as sets of hash-maps (I've done this here). Such data structure provides more fitting equality semantics.

afs commented 5 years ago

Yes - trying to avoid parsing the results in some way becomes more trouble than its worth, effectively becoming a parser eventually. After all, XML and JSON allow layout variations and engines need room to deliver implementation choices and optimizations.

Sounds to me like something to be written up as a "Practice and Experience" note.

VladimirAlexiev commented 3 years ago

I had a chat with Nikolay Kolev, one of our leading testers.

We adopted all of the SPARQL conformance tests in our regression testing
We had to "fix" a number of the expected results to fit legitimate GDB behavior. We could provide the changes, but they can't be adopted as standard because that will cause false negatives in other repos
We adopted several of the TFT additions (eg https://github.com/BorderCloud/TFT-tests/tree/b0fc7769c72905bd8954d116b113aa116914a5dd/GO3/ERT-ART) in our regression testing, but corrected or eliminated some. Eg q05 asks for "dateTime - dateTime = X" (which we return as duration) and then "integer - X" (which we return as null): we eliminated this one.

BorderCloud commented 3 years ago

@VladimirAlexiev Don't forget to insert also the tests about the protocol. https://github.com/BorderCloud/rdf-tests/tree/withJmeter/sparql11/data-sparql11/protocol

namedgraph commented 3 years ago

We have a rather basic test suite based on bash and curl: https://github.com/AtomGraph/Processor/tree/master/http-tests

BorderCloud commented 3 years ago

@namedgraph Great !

gkellogg commented 3 years ago

Note that the RDF Test Suite Curation CG has taken on curation of RDF and SPARQL test suites, and there have been a number of additions and corrections.

https://github.com/w3c/rdf-tests/tree/gh-pages/sparql11.

Issues and PRs are welcome there. Of course, this is not official, as there is no active WG, but it has proven to be a useful resource fo the community.

BorderCloud commented 3 years ago

Hello

I updated my works about test suite with SPARQL (may be for the last time ?).

You can find here a draft report for SPARQL 1.1. All results have produced only with GitHub and Travis CI. I used only 3 databases (I have not any sponsors to pay the time to fix the SPARQL protocol of other SPARQL services). I use now docker-compose in order to simplify the deployment of multiple SPARQL services simultaneously for the tests with federated queries. JMeter is stable and it's the best solution for the moment to develop/debug the necessary tests when the SPARQL services will have (a day) an error protocol to respect during a transaction. With Varnish, I can ignore the different protocols of SPARQL services so I have disabled for the moment all my tests about protocols and I check only the language. Without sponsors, I cannot check correctly the tests about "Entailment Regimes" (see details).

With sponsors, I can develop all the tests on the protocols (query, update and error messages) and I can generate all the possible combinations between all the SPARQL services that really want to share the same protocol.

With this approach, the working group of SPARQL1.2 can remove words like "should be", "Want to be", etc. from the specification and only precise the tests for each functionality in the official repository of W3C. No "bullshits"... only tests for everything and a report generated automatically by an independent entity. In my opinion, if we can't test something, that thing shouldn't be in the final SPARQL 1.2 specification.

I have proven that it is possible to automate tests with SPARQL protocol. It's time to recommend only one protocol at SPARQL 1.2.

Hope my work helps you build a better SPARQL.

w3c / sparql-dev

Test suite #38