w3c / web-annotation

Web Annotation Working Group repository, see README for links to specs
https://w3c.github.io/web-annotation/
Other
141 stars 30 forks source link

Cardinality of the processingLanguage? #337

Closed gsergiu closed 7 years ago

gsergiu commented 7 years ago

There are several complains about the processingLanguage property, some saying that this is not property documented, some of them saying that it is not needed at all #335

The processingLanguage was introduced after the i18n review, mainly as solution to the complain that dc:language has multiple values and NLP algorithms need exactly 1 language as input (which I doubt).

213

This ticket is to clarify why does the processingLanguage need to have cardinality 1, in an annotation which has a dc:language with 2 or more values?

Consequently, what is the exact difference between the values that should be included in the processingLanguage and the ones included in dc:language?

Who should decide the the value of the processingLanguage in the annotation's life cycle? Is it the end user? (is yes why?) Is it the client application? (if yes how should these values be derived?) Is it the server? (probably, if it uses NLP, which values should be set, and how is ensured the consistency with the dc:language)

tkanai commented 7 years ago

@gsergiu +1 for clarification. I think it is also important to consider who should decide the value of the dc:language in the annotation's life cycle. If dc:language can not fulfill important use cases, we will need to use processingLanguage instead. Then we can set a line between the two attributes, right?

Due to Unicode CJK unified ideographs issue, a language attribute is necessary to render text correctly, and I would like to clarify which language value should be applied for that purpose. I think the image in the link below would help understanding what is the issue and why it is necessary to set a language attribute for rendering text. https://gist.github.com/tkanai/c9d64283ae14ecc5250f

gsergiu commented 7 years ago

@tkanai You are perfectly right, that the meaning and expected behaviour given by dc:language should be also clarified, and the meaning of the dc:language probably needs to be adapted for embedded text, and for external resources.

  1. for the embedded text, we know that this is UTF-8/16/32, and we need to see what else is needed for correct representation. (one proposed solution is to split the text in multiple monolingual bodies , another is to use html formating ... it is up to application developers and their scenarios)
  2. For external resources, the dc:language actually should have no influence on representation of the external resource, it should be only used for search and (nlp) processing purposes. The external resources should provide all required information for correct representation (i.e. if you cannot render correctly a pdf in Acrobat reader, it is not likely that adding an annotation will solve the problem of the pdf file, ... or if the annotation is used to fix the file .. than we are talking about a new resource)

see also: https://github.com/w3c/web-annotation/issues/335#issuecomment-237790411

gsergiu commented 7 years ago

and .. thank you for the good example for the "language" problem. This is very good to take it as concrete input and to annalize what is needed for correct representation.

  1. If .. the provided webpage is used as external resource, ... it is clear than the language needs to be included in the html code, for correct representation, and not in the annotation. (as I said above, the list of used dc:languages can be used for search/filtering purposes, but not for correct rendering )
  2. The example you provided could be also used as embedded text ... which is probably best to be formated as html, however ... a pure json based API might not like to embedd html but would like to split the text in multiple bodies .. so that a 1 tp 1 text-language asignment is possible. (In fact ... this is what the i18n reviewers tried to explain ... without having a deep understanding of the WA model)
azaroth42 commented 7 years ago

The discussion is: https://www.w3.org/2016/05/18-annotation-minutes.html (search for #213) And then subsequently in the comments of #213.

As there's no new information, just a repetition of a long and resolved discussion, I propose (you guessed it...) close wontfix.

Again, if there are concrete suggestions of text to replace the existing text with, we're happy to consider those suggestions.

gsergiu commented 7 years ago

@azaroth42 With all the respect .. I don't find the answers to the question raised by this tickets in the minutes https://www.w3.org/2016/05/18-annotation-minutes.html . Moreover, in the text from the minutes I se only arguments like:

As conclusion, the text of the processingLanguage definition, drops a half of the semantics indicated in the ticket and minutes, and the analysis/usecases that request for introduction of this property are not refleted at all in the draft: http://w3c.github.io/i18n-discuss/notes/annotation-language-use-cases

The argument for wontfix .. is that there is no new information in the ticket. Oh no .. that's not true:

Actually these are reasons why I would consider the #213 and #335 as not closed as the editiorial work is incomplete. The text in the draft reflect only a part of the issues and solutions proposed/agreed in the tickets.

gsergiu commented 7 years ago

... search & indexing scenario.

Let's say I have an annotation which specifies in the body the dc:language: en, fr, de, and I have a service which is able to index english and german texts, using language specific stemmers, but not french.

What is the implication of having only one processing language?

  1. Should the server be allowed to process the annotation and include the text only in the english or german index?
  2. ... if the server decides to include the text in both english and german indexes. Why shouldn't be the server allowed to advertise that the annotation is retrievable from both indices?
gsergiu commented 7 years ago

... search & indexing scenario Nr.2

Let's say we have an annotation that specifies only the romanian language in the dc:language of the body, and the text uses special chars for romanian language. The server is using an indexing service with an english snowball stemmer.

Let's say that the annotation server is generating a jpeg image to store the original representation of the resource, and an indexing service for retrieval purposes.

According to the specifications, the server understands that the processing language should be the romanian language (for correct representation of the text ... ).

  1. What does this means for the indexing service? is the server not allowed to index the annotation in the english index ? (In pactice there are many indexing services that use this stemmer, even if the text is other of the latin based languages)
  2. Or ... why is the server not allowed to advertise that it used 2 different language settings for advertising that it used one processing service with the romania language and one with the english language?

What should the server

gsergiu commented 7 years ago

Given the examples above, my opinion is that the processing language is a matter of client-server negociation, like the content negociation between the browsers and servers. Think of resources that can be serialized in html, json or xml. (content negociation is based on being able to express multiple prefferences)

azaroth42 commented 7 years ago

Yes, it does not solve absolutely every problem of having multiple languages in text, without explicitly identifying which sections of the text are in which language. I don't know how you would expect it to do that, even if it did have multiple values.

The 80% use case we're trying to solve is when there is a primary language, that requires a particular hyphenation algorithm, line breaking or word detection algorithm, a particular font or similar. As per the description in the specification. If there were two processing languages, the client wouldn't be in any better situation than it is with just dc:language -- they'd always be identical.

So you can either have just dc:language, which is generally descriptive of the content and no other property, or you can have dc:language plus textDirection with a single value, to cover the cases where there is a clear language to use for processing the text. We can't solve the unsolvable problem of multiple processing languages. Unless, as repeatedly asked, you have an actual proposal rather than just complaints?

We intentionally do not specify what a client or server SHOULD do with any information in the model. We don't say how a client should render the body, we don't say how it should lay out annotations on annotations, we don't say what it should do with the motivations, nor the agents. We don't say what a server should do with the bodies, or anything else. This is not that specification.

gsergiu commented 7 years ago

right ... but the 80% of the cases is exatly the one use case in which you have 1 value for the dc:language, meaning that on doesn't need to specify a processing language. In exact 80% of the cases nobody needs that processingLanguage .... In 10% of the cases ... when having multiple dc:language, and only one available NLP service ... the current version of the texwt is ok. The other 10% of the cases, such the once indicated above are again cases where the existance of the processing language can only break the functionality!

azaroth42 commented 7 years ago

Allowing multiple processing languages would not fix the last 10%, and would break the cases where there are multiple languages and one processing language.

For the final time, unless you have a proposal to solve the problem I'm going to close this issue.

gsergiu commented 7 years ago

So ... the conclusion is that in 80% of cases is not needed, in 10% useful in another 10% is breaking the functionality, and the conclusion is again wontfix (as like all other proposals, it is a recognised issue that the editors chose not to fix).

I don't know how to express it better ... as the others said, this is a property of high risk, with a very big chance to get deprecated in the next version.

azaroth42 commented 7 years ago

80% of the time it's not needed, as dc:language will do. 10% of the time it is necessary. 10% of the time it is not sufficient, and there isn't anything else that would work better.

So ... you can have 80% of the cases covered, or 90% of the cases covered. We choose to have 90% of the cases covered by adding a completely optional property, for hopefully obvious reasons. So it's not that we "choose not to fix" it, it's that there isn't a solution.

As you STILL have not proposed ANY solution despite being asked many times, I'm closing the issue.

gsergiu commented 7 years ago

there is no proper solution for this problem as the current definition of the processing language is trying to solve with one field more purposes. In this ticket I just tried to demonstrate that the cardinality 1 for the processing languge is a wrong solution. Also ... it is wrong to mix the concepts of presentation language and processing language.... as they have practice, often conflicting values.

akuckartz commented 7 years ago

An issue does not become invalid when no solution exists.

gsergiu commented 7 years ago

thus saying ... there is an MxN relationship between the languages of multilingual resources and the text processors that might process that annotation body

gsergiu commented 7 years ago

the concrete proposal for improving the processing language was added in #341