Language-tagged strings

dbooth-boston commented 5 years ago

They currently have a special status in RDF. "RDF 1.1 Concepts and Abstract Syntax currently contains many caveats to accommodate the idiosyncratic nature of language-tagged strings" https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0090.html

"It is a real pain to create these 3 component literals and to query for different languages and datatypes in SPARQL. And worse still, if you want to query for strings that may or may not have language tags on, you need to do some real messing about." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0098.html

"Using a general way to make statements about literals sounds good to me. For geographical data I also see too many statements being squashed into a single literal. It is difficult to process and to store. . . . Why have a standard provision for indicating the language of a text string and not its pronunciation for example?" https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0102.html

"language codes do matter, but are pretty inconvenient for multiple reasons:

comparability with untyped/plain strings (of course, and most obviously and counter-intuitive to RDF novices),
complexity (BCP47 defines (a) complex selection rules among ISO 639 language tags, and (b) complex rules for composition, e.g., with script and region codes), and
confusability (having 2-letter codes aside with 3-letter codes for the same language can let people used to work with 3-letter codes chose 2-letter codes, which is an easy error to make, but can result in failure to compare, e.g., "cat"@eng and "cat"@en. Not sure what should happen when you compare "рука"@sr-Cyrl with "рука"@sr. Both are identical, the first is just more explicit in stating that this is Cyrillic.)
coverage (for many applications, ISO639 simply isn't fine-grained or well-defined enough, and its extension is slow, bureaucratic and doubtful)." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0116.html

"RDF seems to violate its own doctrine by having separate systems for data types and languages of literals." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0143.html

IDEA: Eliminate the special status of language-tagged strings

"would it be possible to do away with the special status of language-tagged strings? . . . Would it be possible to define a regular lexical space, e.g., containing "hello@en"^^rdf:langString, together with a value-2-lexical and a lexical-2-value mapping? The N3 and SPARQL notation "hello"@en will of course still be available, and will be syntactic sugar for "hello@en"^^rdf:langString." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0090.html

"Surely languages and datatypes should simply be RDF properties of Literals, which are 1 component things? Much easier to explain to developers, and for them to use." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0098.html

"That also fits in nicely with making it easier to represent property graphs." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0101.html

"it would be much more efficient to declare the language used only once, at the class and/or metadata level. Using plain properties to indicate language enables doing that." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0145.html

CONCERN: "The RDF 1.1 WG did spend some time [on language tags] - both on putting the langtag into the lexical space and putting the lang tag into the datatype. Both are not so easy; in the end the rdf@langString at least meant all literals had a datatype." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0097.html

CONCERN: "chat"@en and "chat"@fr are different. "chat" rdf:lang "en" . "chat" rdf:lang "fr" . makes every use of "chat" both @en and @fr. https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0148.html

"I think the only way to avoid this would be if subject literals are be taken as a notational short-hand for a blank node that carries the literal as an rdf:value. (And, in a separate step, a problem-specific bnode skolemization routine could be provided to give it a proper URI.)" https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0156.html

"I really don't have a problem with every instance of "chat"^^xsd:string being both en and fr if someone has asserted that using rdf:lang. . . . Basically I think language tags are trying to avoid having to say in RDF what should be in the RDF." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0164.html

IDEA: Use W3C OntoLex / Lemony as a basis for language tagging

"[It] is possible already [to declare language only once, at the class and/or metadata level] (using the pointers to ISO639 URIs in my earlier mail), and it is recommended practice to do so in OntoLex/lemon . . . . OntoLex is . . . a W3C community group report, but it would be the most suitable basis for future standardization efforts in this direction." https://www.w3.org/2016/05/ontolex/#lexicon-and-lexicon-metadata https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0145.html

IDEA: Use URIs to identify language

"A much more convenient solution would be to identify the language by means of a URI. This can be an ISO 639 category (see under http://id.loc.gov/vocabulary/iso639-2.html and http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf. http://www.lexvo.org/), or provided by another authority (e.g., https://glottolog.org/). Other properties (e.g., xsd datatypes) could also be stated about a literal. Two strings could be considered identical if the values are the same and the properties of one are a proper subset of the properties of the other." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0116.html

"a downward-compatible notation is possible:

take @ as a short-hand for ^^xsd:string, with language identifiers following
if the language identifier is not a URI, it must be BCP47
BCP47 codes can be decomposed in the background into their sub-properties
permit multiple language URIs/BCP47 codes (if you want to provide both a BCP47 code [indicating region and script] and a URI [unambiguously identifying the language])
let plain literals be untyped" https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0119.html

CONCERN: "No. All literals MUST have a type, so that queries can have a unique response when they ask for the type or specify the type. The RDF 1.1 WG spent a lot of time and effort on this. Allowing untyped plain literals in RDF 2004 was a bug. Please do not screw this up again. Plain literals are syntactically legal (to preserve backward compatibility) but they now have type xsd:string." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0149.html

"But this only means that "рука" entails [a xsd:string] . . . . As far as comparisons between strings are concerned, this makes no difference to the example, as the subset relation between the (implicit) properties of "рука"@sr and "рука" still holds" https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0152.html

kasei commented 5 years ago

I'm concerned that a lot of the proposed solutions would bring with them just as many drawbacks. If a big part of the issue here is the challenge of querying language data, perhaps we should be looking first at the ergonomics of using SPARQL on this data and how that could be improved.

iherman commented 5 years ago

I think that looking at language tags in isolation may not be the right approach. To be really international in nature, there are a number of things one may want to "say" about the text, and the natural language is only one of those. For example:

base direction of the text (see, eg, https://w3c.github.io/bp-i18n-specdev/#text_direction)
pronunciation hints
issues around translations (see also https://www.w3.org/TR/its20/)

etc. The experience I have with other specifications that rely on RDF literals (though serializations like JSON-LD) is that these issues come and bite you all the time.

Yes, this all may converge towards the separate issue on literal as subject (#21), and may force us to fundamentally re-think how RDF treats literals.

draggett commented 5 years ago

A given word may be applicable to multiple languages, especially for loan words where one language borrows from another. Together with @iherman points about other related kinds of properties, this suggests that we need a means to model the combination of a string literal with a given set of properties.

I suspect this also fits in with the desire to be able to model property graphs where nodes and links can be associated with sets of property-value pairs, where the values can themselves be sets of property-values and so forth recursively.

Once you have that, it is straightforward to model a value as being a word in a given language with a given pronunciation, writing direction and so forth. We would still need a small set of core data types, e.g. string, number, boolean, ID, link, but others could be layered on top with properties as annotations. A node that is used for a natural language word or phrase could have one property for the string value, another for the language, and another for the pronunciation.

I will expand on this further in another issue.

HughGlaser commented 5 years ago

Hi @iherman, If I understand you correctly, I think I agree. The more knowledge we put into the representation of literals directly, rather than as properties, the harder it is to process it in RDF/SPARQL. A couple of comments on your bullets:

base direction - I would find it strange to add that to a Literal. It seems to me that the direction is an aspect of the script. So the IANA guidelines say that the scripts are suppressed for Latn, Hebr etc., but they do have scripts, which I assume have default direction. Adding anything so that we can have right-to-left Latn or left-to-right Arab seems a bit over the top.
Pronunciation hints - I really find that indigestible :-). It is bad enough (in my view!) deciding that a Literal (collection of symbols) such as "chat" has a specific language associated with it at all. It is much worse to say not only is it associated with a particular language, but that it is even associated with pronouncing that symbol collection in a particular way. What about "read"@en-gb (reed & red)? Or "potato"@en? This sort of stuff really should be attached to a URI that is the word, as in Wordnet or whatever.

iherman commented 5 years ago

Hi @HughGlaser

On the base direction: what I have learnt is that bidirectional text are sometimes insanely complicated, and that is one of the reason why HTML keeps the dir attribute... Worth looking at the tutorials prepared by the I18N activity at W3C (there are links from that page above).
For the pronunciation: think of a text written in Japanese Kanji and then a version of the text in Hiragana to make it easier to see the pronunciation. Or the usage of bopomofo in traditional chinese or ruby in Japanese... These may all be necessary for a piece of text and currently it is a mess to add them in a consistent way when RDF/JSON-LD is used for metadata.

I do not want to present myself as an i18n expert, I am very very far from it. I just think that RDF literals may have serious i18n issues, and if we want to review literals in general, we will have to seriously look at that, too...

HughGlaser commented 5 years ago

Yeah @iherman no problem with any of that. The language issues are big, and should be worried about. My worry is if people want to push things into the language tag world, rather than a more comprehensive/cleaner solution where the knowledge about the symbols is represented in RDF. Possibly just to avoid a triple or two that actually says the right thing. (I've never tried to do a lot in JSON-LD, so can't comment on those issues - I tend to prefer N3 or whatever.)

dbooth-boston commented 5 years ago

My personal hope is that we:

define a standard form of n-ary relation to represent language-tagged strings as a molecule of triples, consistent with features that i18n experts have found important;
offer a convenient syntactic sugar for writing them in a higher-level RDF language; and
define a mapping to/from the existing RDF 1.1 language-tagging mechanism, to retain backward compatibility.

w3c / EasierRDF