Representation of Language Tags in the Abstract Syntax

gkellogg commented 1 year ago

Provide sufficient information so that a member of the working group's Use Case Task Force can contact you and enhance your description so that it can be used by the working group to guide their activities. You do not have to fill out all the information requested.

** Contact information

Your name: Gregg Kellogg
How to contact you: @gkellogg

** Brief Description of your use case:

As an aggregator of RDF information, I want to have a predictable number of triples when parsing triples where literals may vary only in the case of the language tag element. I would also like the serialized (possibly canonicalized) form to use the BCP14 formatting recommendations, so that the language tag en-us might canonically be represented as en-US.

[ISO639-1] recommends that language codes be written in lowercase ('mn' Mongolian).
[ISO15924] recommends that script codes use lowercase with the initial letter capitalized ('Cyrl' Cyrillic).
[ISO3166-1] recommends that country codes be capitalized ('MN' Mongolia).

When aggregating data, input can be combined from different documents, where different conventions of formatting language tags are in use, leading the potential duplication of data.

*** What you want to be able to do:

When parsing a document that may be composed of several overlapping triples, I would like the resulting graph to have a unique abstract representation for otherwise equal language tags. As it is, the following Turtle can generate either one or two triples in the abstract representation, depending on if the implementation chooses to normalize language tags, e.g., to lower case.

_:a rdf:value "foo"@en-us, "foo"@en-US .

Implementations that normalize language tags will result in a single triple, those that do not will result in two triples.

*** What is the role of RDF-star quoted triples in your use case:

Not related to quoted triples.

*** Why it is hard or impossible to do what you want to do without quoted triples:

Not related to quoted triples.

*** How you want quoted triples to behave in your use case: (For example, do you want the precise syntax of subjects, predictes, and objects in quoted triples to be important?)

From the start, RDF should have mandated a normalized form for language tags in literals, ideally based on BCP47 formatting. It would also be acceptable if all parsers normalized language tags to lower case for the abstract representation. Concrete syntaxes which can perform canonicalization could then require a particular form for language tags without danger of potentially serializing different graphs, depending on how they were parsed on input.

*** An example RDF graph that shows part of your use case:

_:a rdf:value "foo"@en-us, "foo"@en-US .

If changed to require normalizing to lower case, this would be the same as the following:

_:a rdf:value "foo"@en-us .

N-Triples/N-Quads canonicalization could then either represent using that lower case form, or use BCP47 formatting.

pfps commented 1 year ago

This use case for RDF 1.2 places constraints on how language tags are handled. As it doesn't have implications for the RDF-star semantics it can be just tracked here, without creating a wiki page for it.

lisp commented 11 months ago

unique abstract reputation

is "representation" intended?

gkellogg commented 11 months ago

Thanks, fixed. This UC can probably be marked as addressed at this point.

w3c / rdf-ucr

Representation of Language Tags in the Abstract Syntax #22