Relationship between dc:language and processing language for multilingual resources

gsergiu commented 8 years ago

For multilingual resources (i.e. dc:language) has multiple values, is it mandatory that the value of the processingLanguage must be one of the values in the dc:language?

If this supposition is true, please add a non normative note that the "Processing language should have one of the values available in dc:language"

gsergiu commented 8 years ago

PS: this is also a result of splitting the #335 in multiple smaller issues

azaroth42 commented 8 years ago

Good question. I can't think of a case in which the processing language would not be one of the languages in language, but perhaps @r12a @fsasaki can?

If there isn't, happy to add it as a clarification.

fsasaki commented 8 years ago

Good question. I can't think of a case in which the processing language would not be one of the languages in language

Me neither.

gsergiu commented 8 years ago

well ... I think that there are, only that we didn't make an analysis deep enough of these usecases. as indicated in #341 there are different types of processors, and this property was introduced exactly becasue the processors need it, not becasue the annotation, or the represenation of the annotation needs it.

So ... there are classes of processors that need the exact language of the text, mainly the ones that deal with rendering. And there are classes that don't need the exact representation, like the indexers or entity recognition tools. The most of indexers and ER tools are trained on english texts, so .. their processing language should be en, but they can be successfully applied on resources in other languages (e.g. german, french, italian, romanian).

Therefore it is very important to think types/classes of text processors could be used, and what is the implication of using the processingLanguage of cardinality 1, given the the correct representation would be the one proposed in #341

r12a commented 8 years ago

@gsergiu if i understand your intent correctly, i think the answer to your last comment may be that the processingLanguage value is intended, as you say, to simply indicate the actual language of the text that it is associated with. If you want to indicate what processes were run on the string text, that's a different thing altogether.

Getting back to the original question posed in this issue, i assume that when you refer to dc:language you mean the Language property, in contrast to the processingLanguage property. The only situation i can think of where the processingLanguage would not be the same as or one of the languages in the Language property would be where one value was more carefully specified than the other, which seems an odd case anyway.

I don't suppose there's any problem with adding a non-normative note of the kind you mention. It may produce better results in a few circumstances, but wouldn't do any harm afaict.

gsergiu commented 8 years ago

@r12a well the whole processingLanguage was introduced to be used by text processors as I remember. However that topic is addressed in #341

With regard to processing language, I don't expect that the processignLanguage will be more precise than the (dc:)language property. In the language-processingLanguage relationship, the first one is the "master", and the second one was added only to reduce the cardinality. Therefore the natural consequence would be that processingLanguage is one of the dc:languages. This is if you see the problem from this direction.

If you see the problem from the other direction, that the processingLanguge is there to be used by text processors, combine with a "best-effort" strategy for server implementations, one could end up in indexing german text with an english indexer. Which language should be set in processingLanguage in this case?

It is still not the proper answer as I strongly support the #341 solution, but we could formulate a note like this.

NOTE: If language has multiple values and the processingLanguage is set in the annotation, the value of the processingLanguage should be one of the values available in the language property. However, in particular cases, it is not prohibited to use a different value for the processingLanguage.

r12a commented 8 years ago

As you say, the fact that one expects that processingLanguage will only be used if Language doesn't specify a single language value significantly reduces the likelihood that errors will creep in. That's why i said that it would be an 'odd case'.

If you see the problem from the other direction, that the processingLanguge is there to be used by text processors, combine with a "best-effort" strategy for server implementations, one could end up in indexing german text with an english indexer. Which language should be set in processingLanguage in this case?

But we don't expect processingLanguage to be combined with a best-effort strategy for server implementations. As already said, processingLanguage just indicates the actual language of the natural language text in the string. What a processor does is find out from the Language/processingLanguage property the language of the text it has before it, and then decide what processing it is able to/ought to apply to it. The processingLanguage value doesn't tell the processor what to do, it just says what the language of the text is.

gsergiu commented 8 years ago

@r12a with all teh respect, the current version of the draft says something different (see text bellow). Even if in particular cases it is the same or a similar thing, in general it is not! (Additionality the natural language of multi-lingual texts is the set of used languages (i.e. language).. in my opinion)

The language to use for text processing algorithms such as line breaking, hyphenation, which font to use, and similar.

Given the definition above ... the question is valid, what should a server implementation do if the processingLanguage is not the one of the processor, but the processor could technically still process the text. Is the processing vorbidden?

r12a commented 8 years ago

I think you are getting stuck on what is an editorial issue in the spec.

See https://github.com/w3c/web-annotation/issues/345

gsergiu commented 8 years ago

@r12a I was not aware on #345 at least there is a ticket for it ... as the previous tickets were rejected ...

azaroth42 commented 7 years ago

I think /this/ issue is "should there be a recommendation that processingLanguage is one of the entries in language?". I propose that there's no reason to make that a requirement, allowing implementations to generate the values in separate subsystems without reconciling them. Maybe the processingLanguage is "en-UK" (e.g. for spelling colour) and the language is just "en" to make it easier to discover.

Propose Close, as there are no use cases/requirements and at least one scenario where not having the requirement makes implementations easier.

iherman commented 7 years ago

Discussed 2016-10-28, closing as wontfix, with proviso of notifying the I18N WG

See at: http://www.w3.org/2016/10/28-annotation-irc#T15-23-52

w3c / web-annotation

Relationship between dc:language and processing language for multilingual resources #343