w3c / microdata

Moved to https://html.spec.whatwg.org/multipage/microdata.html
15 stars 19 forks source link

microdata doesn't identify language #21

Closed chaals closed 6 years ago

chaals commented 7 years ago

See the Internationalisation and Localisation section ...

danbri commented 7 years ago

This feels to me to have something in common with the question (just opened: #32) of whether Microdata's properties are considered to be ordered. My memory of the original discussion was that this was delegated to vocabulary providers, e.g. I could create a Microdata-oriented vocabulary and declare that "author" was an ordered property. That creates the question of whether RDF-oriented extractions lose that ordering. A reasonable response is that the ordering is still there in the full HTML document, and that extractions lose some information which is still usefully kept for various apps (e.g. editors). You might try running the same argument here. Ok, I'm trying but I don't think the analogy holds, since consumers of text-valued properties are going to want to know the language. This is somewhat related to #4 (markup in property values, e.g. ruby annotations). The best defense of the Microdata design on these two points is that the information is still there in the full HTML and that nobody claimed microdata extractions would be lossless.

gkellogg commented 7 years ago

The value of @content uses element.language, as does any other literal which is generated.

<p itemscope itemtype="http://schema.org/Person">
  This test created by
  <span itemprop="name" lang="en">Gregg Kellogg</span>.
</p>
chaals commented 7 years ago

@gkellogg but that value isn't reflected in the model, right?

I'd love microdata to be a first-class solution, but I think this means changing pretty much everyone's parsers in various ways... I'm inclined to just say "yeah, if you want to do things like produce quality data, you might want to look at adding lots of annotations, or using RDFa to extract them from the source"…

gkellogg commented 7 years ago

@chaals I don't think that Microdata concerned itself too much with data model, or the association of attributes, which is why we did so much work in Microdata to RDF. For JSON-LD, we said that the data model was RDF, so in a sense, the same might be said of Microdata, inasmuch as that is what consumers expect it to be.

Developers who are familiar with Linked Data technologies will recognize the data model as the RDF Data Model.

chaals commented 7 years ago

The data model for microdata isn't RDF, although if it happens to be used in specific ways microdata can be converted to RDF... It can carry a small amount of language information, but it's limited. If you want to do serious work with multiple languages, then microdata probably isn't the answer you were looking for.

So I propose to clarify the situation in the i18n considerations if necessary, and close this.

chaals commented 7 years ago

I'm going to close this unless anyone screams - the internationalisation story is, sadly, "use RDFa, and you get what you need".

gkellogg commented 7 years ago

In the Microdata to RDF spec, the ability to associate language with a literal value is exactly the same as in RDFa.

If the element has a non-empty language, the value is a language-tagged string created from the value with language information set from the language of the property element. Otherwise, the value is a simple literal created from the value.

The "language of the property element" comes from the HTML spec referencing @lang and @xml:lang attributes.

In this case, RDFa offers nothing more than Microdata already does.

Regarding the data model of Microdata, it really depends on your definition of "data model". In my mind, and certainly as expressed in Mircrodata to RDF, Microdata in HTML describes a directed graph, which is at least isomorphic to what RDF does. Perhaps you notion of "data model" is at a higher level than that. For all practical purposes, the way that Microdata is interpreted by consumers is as if it's RDF, even though they may not be aware of this. The fact that the Linter can validate schema.org Microdata using just RDF (and a couple of schema-specific) rules seems to validate this. I presume that the SDTT similarly parses to some abstract model that they can run rules against.

aphillips commented 7 years ago

I don't think anyone in the i18n community saw this until now. Would you mind holding it open a bit longer? This looks related to our other issues with language and direction metadata open elsewhere.

chaals commented 7 years ago

@aphillips, I'm OK with holding this open a bit longer. I agree that it is related to the issues you have in various specs that don't handle i18n well.

My basic thinking is that microdata isn't ever going to be good enough for real i18n - but with RDFa you can pass an XMLLiteral, including all the sorts of information you would want in a grown-up. ( @gkellogg that's the difference I alluded to above).

Transforming microdata-marked code to RDFa is fairly simple. I'm inclined to recommend authors do that, and provide a section on how, rather than keep trying to tweak microdata until it has the same capability.

chaals commented 7 years ago

From https://github.com/w3c/microdata/issues/61#issue-239027472

The Values section does not make use of the language of the element (as established using lang or xml:lang on an ancestor or self).

This could certainly pertain to the textContent of an element and potentially the value of the content attribute. RDFa uses the current language when creating a literal from content, but it could be argued either way.

Of course, the JSON expression cannot make use of the language, but it is useful to have in an abstract model for the purposes of generating RDF or JSON-LD.

It would be useful if conversion or other tools picked up language information, but a quick test:

<p itemscope itemtype="http://schema.org/Thing" itemid="http://example.org/ID" lang="es">
 <span itemprop="name">prueba</span>
</p>

Did show any sign that Yandex, Google, or SDL pick up the language information. Does anyone have evidence that something does do this for microdata?

gkellogg commented 7 years ago

@chaals The linter does parse it as a language-tagged string, but doesn't render that. For example, my Distiller produces Turtle with the language preserved, and the parser is the same. It's just that, for the purposes of generating a snippet this is not presented.

The Microdata to RDF has always had language in there about preserving the language.

Ivan's processor at W3C also properly handles languages (although file-upload and direct input seem to be broken). Try it with https://raw.githubusercontent.com/ruby-rdf/rdf-microdata/develop/etc/doap.html. Can't say what Yandex or Google do with it internally.

cc/ @iherman

danbri commented 6 years ago

I have investigated Google's Microdata parser behaviour using @gkellogg 's example above. As far as I can determine, Google (correctly afaik) does not take note of the HTML language attribute when extracting literal values of Microdata properties.

As I mentioned above, there is nothing to stop people making more use of the larger HTML than strictly defined here; language is one of several limitations in classic Microdata.

danbri commented 6 years ago

Closing on basis that the spec never said to do this, and we have found no evidence that implementations have gone further. In the spirit of the spec tracking reality (and not trying to mutate it into RDFa, which already exists) let's close this, and let Microdata be what Microdata is.

gkellogg commented 6 years ago

@danbri, note that when using the RDFa transformation algorithm, you will get language-tagged liberals if any ancestor contains @lang.

aphillips commented 6 years ago

The I18N WG has reviewed this thread and actioned me with saying that we do not object to @danbri's resolution of this issue.