w3c / json-ld-api

JSON-LD 1.1 Processing Algorithms and API Specification
https://w3c.github.io/json-ld-api/
Other
76 stars 29 forks source link

Invalid '@base' IRI is accepted by the context processing algorithm in deserialization test suite. #533

Open timothee-haudebourg opened 3 years ago

timothee-haudebourg commented 3 years ago

From the step 5.7 of the context processing algorithm it is clear that a @base value is accepted iff it is null (step 5.7.2) or an IRI (5.7.3) or a relative IRI reference (5.7.4). In particular, if it is an invalid IRI reference, then an invalid base IRI error is always returned.

In the deserialization test #tli12, an invalid IRI reference is given as context @base, but the expansion algorithm is expected to succeed. This seems in contradiction with step 5.7.

My implementation strictly follows step 5.7, so it fails at processing the context defined in this test input, which in turns makes any deserialization algorithm based on my implementation to fail.

gkellogg commented 3 years ago

The IRI validation for output is actually done in the toRdf algorithm, and this is exclusively a toRdf test (https://w3c.github.io/json-ld-api/tests/toRdf-manifest.html#tli12). If you were to just run the expansion algorithm on the input, I would expect the following result:

[{
  "foo:bar": [{
    "@list": [{
      "@id": "http://invalid/<>/test"
    }]
  }]
}]

Step on of the Object to RDF Conversion algorithm says to return null if the value of @id is not a valid IRI. This is what effectively results in an empty list.

It's not unreasonable for you to bail out at 5.7.3 in the context processing algorithm, as it specifically says if it is an IRI, and this is arguably not an IRI, as it is not valid, but other discussions of IRIs in the algorithms do distinguish between IRIs and valid IRIs, and this does not specifically call for IRI validation.

timothee-haudebourg commented 3 years ago

I see, however I am not very comfortable with the blurry concept of an "invalid IRI". Something is or is not an IRI, and since my implementation is statically typed, this allows me to use a dedicated IRI type that provides many advantages for the users. This means that every time there is a condition of the form "If x is an IRI" I will try to parse x into an IRI using RFC3987, validating it on the way.

In #517 caused by another invalid IRI, you said

it would be better if the examples used valid IRIs

I agree with that, at least when the JSON-LD syntax specification requires IRI/IRI references. The concept of "invalid IRIs" (I also read "well-formed IRI") seems to exist only in the "Processing Algorithms and API" document. The syntax specification of context definitions is pretty clear:

If the context definition has an @base key, its value MUST be an IRI reference, or null.

where "IRI reference" directly links to RFC3987. This contrasts with keys in node objects that can, as far as I understand, take any string value (where I was in the past wrongly expecting an IRI):

Keys in a node object that are not keywords MAY expand to an IRI using the active context.

gkellogg commented 3 years ago

I'm not sure why you find "invalid IRI" a blurry concept, as RFC3987 describes the ABNF to validate an IRI ("well-formed IRI" might be considered a synonym). But, this has to take into consider the resolution protocol, so that an IRI reference may be turned into an IRI, which is a necessary part of validation. You might say that an invalid IRI is not an IRI, by definition, which is strictly correct. But data is full of things that look like IRIs but end up failing validation.

If the context definition has an @base key, its value MUST be an IRI reference, or null.

From this you could infer that the API MUST validate everything treated as an IRI and take appropriate action if it is not. As I mentioned, this is entirely reasonable, but might have consequences for existing implementations. It might rise to the point of requiring an Edited Recommendation, after going through the appropriate processes to correct. Certainly, the first step would be a PR that addressed the algorithms and affected test results, which would be necessary to consider further action. It would be great if you were to put something like this together. Even if we don't actually try to go through the process, getting it in the Editor's Draft would be a good start.

Keys in a node object that are not keywords MAY expand to an IRI using the active context.

Yes, this is to deal specifically with the need for being able to process any JSON data, and ignore keys that don't expand to IRIs, so it's specifically not an error for keys to not expand to IRIs (or blank nodes, oddly).

timothee-haudebourg commented 3 years ago

Maybe I should say "not formally defined" instead of "blurry". I don't have a strong opinion on whether or not the described API should validate everything. Giving some wiggle room to implementations is not necessarily a bad idea. However I think the tests should strictly follow the syntax specification and not assume that all the implementations will inevitably use the wiggle room they are given. My implementation tries to precisely follow the syntax specification, which means that in this particular test case, the @base IRI will be rejected, whereas the test is written in a way that assumes a loose interpretation of the specification that accepts it.

gkellogg commented 3 years ago

If you could do a PR with some specific test changes, it would help. We can evaluate the potential algorithmic changes once that is settled.