n-triples documents that cannot be converted to RDF graphs

w3c / rdf-n-triples

https://w3c.github.io/rdf-n-triples/

Other

3 stars 3 forks source link

n-triples documents that cannot be converted to RDF graphs #33

Open pfps opened 1 year ago

pfps commented 1 year ago

The RDF n-triples specifies language tags as

LANGTAG ::= "@" [a-zA-Z]+ ("-" [a-zA-Z0-9]+)*

But RDF concepts specifies language tags as follows

if and only if the [datatype IRI](https://w3c.github.io/rdf-concepts/spec/#dfn-datatype-iri) is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString, a non-empty language tag as defined by [[BCP47](https://w3c.github.io/rdf-concepts/spec/#bib-bcp47)]. The language tag MUST be well-formed according to [section 2.2.9](https://www.rfc-editor.org/rfc/rfc5646#section-2.2.9) of [[BCP47](https://w3c.github.io/rdf-concepts/spec/#bib-bcp47)].

The pointer into BCP47 ends up at a grammar that is considerably more restrictive.

What happens if the language tag in an n-triples document does not conform to this grammar?

This problem might affect other surface syntaxes for RDF.

gkellogg commented 1 year ago

RDF EBNF grammars have always used a simple terminal production for matching LANGTAG; this does not change the requirement in RDF Concepts that language tags be valid according to BCP47. There are other cases where the EBNF terminal productions are permissive (e.g., IRIREF), and can accept tokens that would not be valid when interpreted according the the requirements of RDF Concepts. IIRC, the SPARQL grammar has the same provisions.

There are tests that look for bad IRIs and languages, but they could be better. In particular, a test that had a language tag that was accepted by the grammar but was invalid according to BCP47 would be good to have. There are IRI (URI) tests that pass through the grammar, but are expected to be detected as illegal.

afs commented 1 year ago

BCP47 is not immutable. It tracks the latest RFC.

What happens if the language tag in an n-triples document does not conform to this grammar?

Like any syntax deviation - it is out of scope.

A similar situation occurs with the IRIREF rule. Or illegal Unicode sequences in strings. This is a known design choice. These external standards are not fixed and do change with restrictions as well as additions.

Replicating the full grammars would make the specs unwieldy even when possible. An implementation is expected to apply secondary checks to conform.

There are practicality issues for that. Programming language libraries do not always track the latest specs, preferring backwards-compatibility.

URI is a moving target c.f. RFC6874, or URN changes in RFC8141 which invalidates syntax legal by RFC2141.

For language tags there are also the special cases allowed by RFC 3066 and continued in RFC4646 and RFC5646.

afs commented 1 year ago

IIRC, the SPARQL grammar has the same provisions.

https://www.w3.org/TR/sparql10-query/#rLANGTAG