w3c / i18n-discuss

A place to hold discussions on i18n topics, and to put documents that summarise, support or initiate those discussions.
16 stars 10 forks source link

RDF URIs for language tags and / or language subtags #13

Open fsasaki opened 4 years ago

fsasaki commented 4 years ago

Over the years, the RDF community has developed several concrete sets of URIs for identifying languages. Examples:

The URIs in these sets are based on ISO 639 , often extended with further URIs e.g. to identify language (variants) that are not part of 639, e.g. underressourced or historic languages.

There are various groups that provide such URIs or the underlying values, e.g. the two efforts mentioned above, or the library of congress.

Some arguments for providing URIs for language (sub) tags, taken from this thread: https://lists.w3.org/Archives/Public/public-ontolex/2020Apr/0006.html

Some open questions:

The above is just a summary of what I read from the thread. Below is an observation.

The RDF community "likes" to provide information as URIs - that is a "selling point" of RDF itself. At the moment, the URI "providers" for language information are scattered across organizations and research groups. Also, there are open questions like the validation aspect of language tags - which are solved in BPC 47, but not in the URI version(s) of language tags. A lot of this discussion has to do with understanding about

Since the RDF community does not have one accepted provider of URIs, it is hard to have the right stakeholders on the table.

A next step for the BCP 47 community could be to fill a gap: provide URIs for the entries of the language sub tag registry. In that way, more understanding of BCP 47 could be brought to the RDF community, and W3C and / or IETF could be recognized as the proper stakeholder for this task.

niklasl commented 2 years ago

This ought to be coordinated with the i18n namespace defined in JSON-LD 1.1.

fsasaki commented 2 years ago

@aphillips , the latest comment from @niklasl is an interesting input to our discussion with John Klensin.

jonquet commented 5 months ago

Can I get a reference / link to the Library of Congress set of URIs for ISO 639 ?

We are facing an issue with the Lexvo ones:

https://github.com/agroportal/project-management/issues/507

aphillips commented 5 months ago

@jonquet I think you might be confused by the distinction between what 639 does and how language tags are composed.

The Library of Congress is a reference for ISO-639-1. This is not the only part of ISO 639: it's only the 2-letter codes. The RA is the Summer Institute of Language (SIL), who maintain ISO-639-3 (parts -1 and -2 are derived from this, note that I'm simplifying a lot). However...

Language Tags are defined by IETF BCP47. These tags include multiple standards, including ISO 639 for languages, ISO 4217 for scripts, ISO 3166 for country/regions. These codes (called "subtags") can be composed to form complete language such as pt-BR, zh-Hans-CN, etc. Our WG maintains an introductory article here.

There is a registry of valid subtags maintained by IANA here[https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry}. This registry tracks all of the parts of ISO639 as well as the other standards that are used in language tags. However, it is one large "cookie-jar" format file with all of the subtags in it.

This issue, where we're discussing this, reflects a known gap for RDF: there is no URL reference for composed language tags. This WG investigated what would be required to create one in the 2020/2021. It would be possible to do this at IETF/IANA, but no one wrote the Internet-Draft to carry the work forward. cf. action result

andjc commented 5 months ago

There are also the T and U extensions to BCP47.

The T extension would as a minimum have to be ticked off. Library of Congress' increasing use of Bibframe and their current preference for romanised data means that most of their linked data will require T extensions as part of the language tag.

jonquet commented 4 months ago

Thanks @aphillips for detailed info. Indeed this confirms the way the 'pt-BR' code is built ... and that there is no URI yet to identify those subtags. It's a pity as a machine would not automatically know the semantics of the code without a semantic representation of them (for which there would be an URI). On our side (https://github.com/agroportal/project-management/issues/507) we will make the use of URIs no require to handle the cases when ontologies use subtags.