w3c / EasierRDF

Making RDF easy enough for most developers
262 stars 13 forks source link

Language-tagged strings #22

Open dbooth-boston opened 5 years ago

dbooth-boston commented 5 years ago

They currently have a special status in RDF. "RDF 1.1 Concepts and Abstract Syntax currently contains many caveats to accommodate the idiosyncratic nature of language-tagged strings" https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0090.html

"It is a real pain to create these 3 component literals and to query for different languages and datatypes in SPARQL. And worse still, if you want to query for strings that may or may not have language tags on, you need to do some real messing about." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0098.html

"Using a general way to make statements about literals sounds good to me. For geographical data I also see too many statements being squashed into a single literal. It is difficult to process and to store. . . . Why have a standard provision for indicating the language of a text string and not its pronunciation for example?" https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0102.html

"language codes do matter, but are pretty inconvenient for multiple reasons:

"RDF seems to violate its own doctrine by having separate systems for data types and languages of literals." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0143.html

IDEA: Eliminate the special status of language-tagged strings

"would it be possible to do away with the special status of language-tagged strings? . . . Would it be possible to define a regular lexical space, e.g., containing "hello@en"^^rdf:langString, together with a value-2-lexical and a lexical-2-value mapping? The N3 and SPARQL notation "hello"@en will of course still be available, and will be syntactic sugar for "hello@en"^^rdf:langString." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0090.html

"Surely languages and datatypes should simply be RDF properties of Literals, which are 1 component things? Much easier to explain to developers, and for them to use." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0098.html

"That also fits in nicely with making it easier to represent property graphs." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0101.html

"it would be much more efficient to declare the language used only once, at the class and/or metadata level. Using plain properties to indicate language enables doing that." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0145.html

CONCERN: "The RDF 1.1 WG did spend some time [on language tags] - both on putting the langtag into the lexical space and putting the lang tag into the datatype. Both are not so easy; in the end the rdf@langString at least meant all literals had a datatype." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0097.html

CONCERN: "chat"@en and "chat"@fr are different. "chat" rdf:lang "en" . "chat" rdf:lang "fr" . makes every use of "chat" both @en and @fr. https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0148.html

"I think the only way to avoid this would be if subject literals are be taken as a notational short-hand for a blank node that carries the literal as an rdf:value. (And, in a separate step, a problem-specific bnode skolemization routine could be provided to give it a proper URI.)" https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0156.html

"I really don't have a problem with every instance of "chat"^^xsd:string being both en and fr if someone has asserted that using rdf:lang. . . . Basically I think language tags are trying to avoid having to say in RDF what should be in the RDF." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0164.html

IDEA: Use W3C OntoLex / Lemony as a basis for language tagging

"[It] is possible already [to declare language only once, at the class and/or metadata level] (using the pointers to ISO639 URIs in my earlier mail), and it is recommended practice to do so in OntoLex/lemon . . . . OntoLex is . . . a W3C community group report, but it would be the most suitable basis for future standardization efforts in this direction." https://www.w3.org/2016/05/ontolex/#lexicon-and-lexicon-metadata https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0145.html

IDEA: Use URIs to identify language

"A much more convenient solution would be to identify the language by means of a URI. This can be an ISO 639 category (see under http://id.loc.gov/vocabulary/iso639-2.html and http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf. http://www.lexvo.org/), or provided by another authority (e.g., https://glottolog.org/). Other properties (e.g., xsd datatypes) could also be stated about a literal. Two strings could be considered identical if the values are the same and the properties of one are a proper subset of the properties of the other." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0116.html

"a downward-compatible notation is possible:

CONCERN: "No. All literals MUST have a type, so that queries can have a unique response when they ask for the type or specify the type. The RDF 1.1 WG spent a lot of time and effort on this. Allowing untyped plain literals in RDF 2004 was a bug. Please do not screw this up again. Plain literals are syntactically legal (to preserve backward compatibility) but they now have type xsd:string." https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0149.html

"But this only means that "рука" entails [a xsd:string] . . . . As far as comparisons between strings are concerned, this makes no difference to the example, as the subset relation between the (implicit) properties of "рука"@sr and "рука" still holds" https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0152.html

kasei commented 5 years ago

I'm concerned that a lot of the proposed solutions would bring with them just as many drawbacks. If a big part of the issue here is the challenge of querying language data, perhaps we should be looking first at the ergonomics of using SPARQL on this data and how that could be improved.

iherman commented 5 years ago

I think that looking at language tags in isolation may not be the right approach. To be really international in nature, there are a number of things one may want to "say" about the text, and the natural language is only one of those. For example:

etc. The experience I have with other specifications that rely on RDF literals (though serializations like JSON-LD) is that these issues come and bite you all the time.

Yes, this all may converge towards the separate issue on literal as subject (#21), and may force us to fundamentally re-think how RDF treats literals.

draggett commented 5 years ago

A given word may be applicable to multiple languages, especially for loan words where one language borrows from another. Together with @iherman points about other related kinds of properties, this suggests that we need a means to model the combination of a string literal with a given set of properties.

I suspect this also fits in with the desire to be able to model property graphs where nodes and links can be associated with sets of property-value pairs, where the values can themselves be sets of property-values and so forth recursively.

Once you have that, it is straightforward to model a value as being a word in a given language with a given pronunciation, writing direction and so forth. We would still need a small set of core data types, e.g. string, number, boolean, ID, link, but others could be layered on top with properties as annotations. A node that is used for a natural language word or phrase could have one property for the string value, another for the language, and another for the pronunciation.

I will expand on this further in another issue.

HughGlaser commented 5 years ago

Hi @iherman, If I understand you correctly, I think I agree. The more knowledge we put into the representation of literals directly, rather than as properties, the harder it is to process it in RDF/SPARQL. A couple of comments on your bullets:

iherman commented 5 years ago

Hi @HughGlaser

I do not want to present myself as an i18n expert, I am very very far from it. I just think that RDF literals may have serious i18n issues, and if we want to review literals in general, we will have to seriously look at that, too...

HughGlaser commented 5 years ago

Yeah @iherman no problem with any of that. The language issues are big, and should be worried about. My worry is if people want to push things into the language tag world, rather than a more comprehensive/cleaner solution where the knowledge about the symbols is represented in RDF. Possibly just to avoid a triple or two that actually says the right thing. (I've never tried to do a lot in JSON-LD, so can't comment on those issues - I tend to prefer N3 or whatever.)

dbooth-boston commented 5 years ago

My personal hope is that we: