Closed gsergiu closed 7 years ago
@gsergiu +1 for clarification. I think it is also important to consider who should decide the value of the dc:language in the annotation's life cycle. If dc:language can not fulfill important use cases, we will need to use processingLanguage instead. Then we can set a line between the two attributes, right?
Due to Unicode CJK unified ideographs issue, a language attribute is necessary to render text correctly, and I would like to clarify which language value should be applied for that purpose. I think the image in the link below would help understanding what is the issue and why it is necessary to set a language attribute for rendering text. https://gist.github.com/tkanai/c9d64283ae14ecc5250f
@tkanai You are perfectly right, that the meaning and expected behaviour given by dc:language should be also clarified, and the meaning of the dc:language probably needs to be adapted for embedded text, and for external resources.
see also: https://github.com/w3c/web-annotation/issues/335#issuecomment-237790411
and .. thank you for the good example for the "language" problem. This is very good to take it as concrete input and to annalize what is needed for correct representation.
The discussion is: https://www.w3.org/2016/05/18-annotation-minutes.html (search for #213) And then subsequently in the comments of #213.
As there's no new information, just a repetition of a long and resolved discussion, I propose (you guessed it...) close wontfix
.
Again, if there are concrete suggestions of text to replace the existing text with, we're happy to consider those suggestions.
@azaroth42 With all the respect .. I don't find the answers to the question raised by this tickets in the minutes https://www.w3.org/2016/05/18-annotation-minutes.html . Moreover, in the text from the minutes I se only arguments like:
As conclusion, the text of the processingLanguage definition, drops a half of the semantics indicated in the ticket and minutes, and the analysis/usecases that request for introduction of this property are not refleted at all in the draft: http://w3c.github.io/i18n-discuss/notes/annotation-language-use-cases
The argument for wontfix .. is that there is no new information in the ticket. Oh no .. that's not true:
Actually these are reasons why I would consider the #213 and #335 as not closed as the editiorial work is incomplete. The text in the draft reflect only a part of the issues and solutions proposed/agreed in the tickets.
... search & indexing scenario.
Let's say I have an annotation which specifies in the body the dc:language: en, fr, de, and I have a service which is able to index english and german texts, using language specific stemmers, but not french.
What is the implication of having only one processing language?
... search & indexing scenario Nr.2
Let's say we have an annotation that specifies only the romanian language in the dc:language of the body, and the text uses special chars for romanian language. The server is using an indexing service with an english snowball stemmer.
Let's say that the annotation server is generating a jpeg image to store the original representation of the resource, and an indexing service for retrieval purposes.
According to the specifications, the server understands that the processing language should be the romanian language (for correct representation of the text ... ).
What should the server
Given the examples above, my opinion is that the processing language is a matter of client-server negociation, like the content negociation between the browsers and servers. Think of resources that can be serialized in html, json or xml. (content negociation is based on being able to express multiple prefferences)
Yes, it does not solve absolutely every problem of having multiple languages in text, without explicitly identifying which sections of the text are in which language. I don't know how you would expect it to do that, even if it did have multiple values.
The 80% use case we're trying to solve is when there is a primary language, that requires a particular hyphenation algorithm, line breaking or word detection algorithm, a particular font or similar. As per the description in the specification. If there were two processing languages, the client wouldn't be in any better situation than it is with just dc:language -- they'd always be identical.
So you can either have just dc:language, which is generally descriptive of the content and no other property, or you can have dc:language plus textDirection with a single value, to cover the cases where there is a clear language to use for processing the text. We can't solve the unsolvable problem of multiple processing languages. Unless, as repeatedly asked, you have an actual proposal rather than just complaints?
We intentionally do not specify what a client or server SHOULD do with any information in the model. We don't say how a client should render the body, we don't say how it should lay out annotations on annotations, we don't say what it should do with the motivations, nor the agents. We don't say what a server should do with the bodies, or anything else. This is not that specification.
right ... but the 80% of the cases is exatly the one use case in which you have 1 value for the dc:language, meaning that on doesn't need to specify a processing language. In exact 80% of the cases nobody needs that processingLanguage .... In 10% of the cases ... when having multiple dc:language, and only one available NLP service ... the current version of the texwt is ok. The other 10% of the cases, such the once indicated above are again cases where the existance of the processing language can only break the functionality!
Allowing multiple processing languages would not fix the last 10%, and would break the cases where there are multiple languages and one processing language.
For the final time, unless you have a proposal to solve the problem I'm going to close this issue.
So ... the conclusion is that in 80% of cases is not needed, in 10% useful in another 10% is breaking the functionality, and the conclusion is again wontfix (as like all other proposals, it is a recognised issue that the editors chose not to fix).
I don't know how to express it better ... as the others said, this is a property of high risk, with a very big chance to get deprecated in the next version.
80% of the time it's not needed, as dc:language will do. 10% of the time it is necessary. 10% of the time it is not sufficient, and there isn't anything else that would work better.
So ... you can have 80% of the cases covered, or 90% of the cases covered. We choose to have 90% of the cases covered by adding a completely optional property, for hopefully obvious reasons. So it's not that we "choose not to fix" it, it's that there isn't a solution.
As you STILL have not proposed ANY solution despite being asked many times, I'm closing the issue.
there is no proper solution for this problem as the current definition of the processing language is trying to solve with one field more purposes. In this ticket I just tried to demonstrate that the cardinality 1 for the processing languge is a wrong solution. Also ... it is wrong to mix the concepts of presentation language and processing language.... as they have practice, often conflicting values.
An issue does not become invalid when no solution exists.
thus saying ... there is an MxN relationship between the languages of multilingual resources and the text processors that might process that annotation body
the concrete proposal for improving the processing language was added in #341
There are several complains about the processingLanguage property, some saying that this is not property documented, some of them saying that it is not needed at all #335
The processingLanguage was introduced after the i18n review, mainly as solution to the complain that dc:language has multiple values and NLP algorithms need exactly 1 language as input (which I doubt).
213
This ticket is to clarify why does the processingLanguage need to have cardinality 1, in an annotation which has a dc:language with 2 or more values?
Consequently, what is the exact difference between the values that should be included in the processingLanguage and the ones included in dc:language?
Who should decide the the value of the processingLanguage in the annotation's life cycle? Is it the end user? (is yes why?) Is it the client application? (if yes how should these values be derived?) Is it the server? (probably, if it uses NLP, which values should be set, and how is ensured the consistency with the dc:language)