w3c / web-annotation

Web Annotation Working Group repository, see README for links to specs
https://w3c.github.io/web-annotation/
Other
141 stars 30 forks source link

exactly 0 or 1 language(s) #213

Closed r12a closed 8 years ago

r12a commented 8 years ago

https://www.w3.org/TR/2016/WD-annotation-model-20160331/#bodies-and-targets

The Body or Target SHOULD have exactly 1 language associated with it, but MAY have 0 or more.

The i18n WG thought this would read better as "The Body or Target SHOULD have exactly 0 or 1 language(s) associated with it." Current wording seems a little odd.

r12a commented 8 years ago

Actually, that should probably read: "The Body or Target MUST have exactly 0 or 1 language(s) associated with it."

We do not expect that it makes sense to have more than one language value, since we assume that the language property indicates what we call the 'text-processing' language for the annotation body, rather than the language(s) of the intended reader. (The text-processing language is what is used for automatic font assignment, spellchecking, hyphenation, line-breaking in CJK, etc.)

asmusf commented 8 years ago

The language SHOULD is usually interpreted as "do something different at your peril". The MAY language means that there isn't a formal requirement for any specific number. So, there's nothing wrong with the language from the formal perspective.

The suggested "read better" would not be an editorial change.

The second proposed alternative is even worse, because it makes it equally recommended to have no language associated.

The example given before is an mp3 file. What if the file contains a text followed by its interpretation? That would seem to be a case where associating it with a single language may not be the best alternative.

asmusf commented 8 years ago

In conclusion, I suggest to close the issue without change.

r12a commented 8 years ago

Having exchange brief emails with Ivan, i think we may need to take a step back and come at this from a wider perpective. Ivan told me

I believe the language tag in our case is only used as a 'metadata', in your terminology, to indicate the language of the target or the body resource. How an annotation agent uses this information is beyond the specification; in many cases it actually cannot do anything with the resource (eg, the target), so this term is really only informational.

So the i18n WG's initial assumptions were incorrect. Let me try to outline why i'm concerned.

If an application is going to use the language value provided to perform an operation on the text, it often needs to know what language the text is actually in. For example, such an operation might be running a spellchecker, pronouncing the text in a voice browser, applying hyphenation, case conversion, line breaking and other language-sensitive actions, applying fonts, etc. In these cases it's problematic if you have a list of languages as the value of your language property, because to process the text correctly, you actually need to know whether it's Japanese or whether it's French, for example, that you're dealing with. This is the 'text-processing' application we mention above. In HTML this is the function of the lang attribute, which can only have one language as it's value, because it is indicating the actual language of the text.

The i18n WG tends to refer to another type of language annotation as 'metadata'. This typically indicates the intended linguistic audience of the resource as a whole, and it's possible to imagine that this could, for a multilingual resource, involve a property value that is a list of languages.

It may be that the 'language' property when referring to a target is of the metadata kind (since it's informative, the target is not being operated on, and the target ought to have its own text-processing language declarations), whereas it may be more useful to see the language of the body as of the text-processing kind, since that kind of information can be used to indicate to a voice browser how to pronounce the annotation, or to a graphical browser how to break lines of text when displaying the annotation, etc.(?)

In order to know how to specify the content of the values for the language property, then, it's useful to understand, at least to some extent, how the application is likely to use the information about the language. Which is why we originally raised the question in this issue.

Hopefully that clarifies our frame of reference, although it doesn't yet provide a clear way forward.

(There will, of course, be an additional question wrt text-processing language declarations, in that a content author may need to indicate that parts of the annotation are in different languages, though i'm not clear how much of an issue it will be for annotations if that level of detail is not provided. It may not be a common use case, or one that causes major difficulties if missing(?) However, it is usually important to have at least a default idea of which language to assume for purposes of processing the text of the annotation, in order to manage the text when it comes to display or use.)

asmusf commented 8 years ago

If language annotation can be (is required to be) fine grained enough that each piece of content contains only one language, then requiring annotation with the actual language works as expected.

For bulk content (a whole book, a whole movie) limiting the annotation to a single language is less than useful.

A./

gsergiu commented 8 years ago

well ... there many documents that have multiple languages, in europeana there are ~4 million records out of 53, that are marked to contain multiple languages. I think targets and bodies must support multiple languages, even if there are some drawbacks given the fact that it is not clear which parts of the text are written in one language and which are written in another one. http://www.europeana.eu/portal/search?f[LANGUAGE][]=mul

r12a commented 8 years ago

to complete an action from the i18n WG, i wrote a summary of what i understand wrt use cases for language information in web annotations (with help from @fsasaki ). See http://w3c.github.io/i18n-discuss/notes/annotation-language-use-cases.
The text is intended to provide a basis for discussion, and can be changed as needed.

iherman commented 8 years ago

Thanks @r12a for writing all this down. And, at the moment, I am torn.

I believe the way forward is to say something like: "an annotation SHOULD have zero or one language terms, and MAY have more than 1 in exceptional cases." @gsergiu's use case may be quoted in an informal note where the MAY comes into effect, but we should also note that implementations/users should really try to use one language, because otherwise problems may occur.

And stop there...

asmusf commented 8 years ago

On 5/23/2016 9:27 AM, Ivan Herman wrote:

Thanks @r12a https://github.com/r12a for writing all this down. And, at the moment, I am torn.

*

On the one hand, I think that the situation would become indeed
much clearer if a body (more exactly, in our terminology, a
Specific Resource or a Textual Body describing a body) would
indeed use 0 or one |language| tag. In case the use case is to
have different body resources out there that are conceptually
equivalent but are in different languages, then we can use Choice
or some of the (newly re-established) composites, to separate them
from one another. This makes the model clean, and also covers the
problems @r12a <https://github.com/r12a> has.

*

On the other hand, we cannot ignore the fact that, out there,
there are messy resources that we want to annotate; @gsergiu
<https://github.com/gsergiu> has clearly indicated that he is
facing such a situation. Texts may be out there that /do/ mix
languages (whether we like it or not), and the only way of
conveying this information in an annotation is to allow for more
than one languages.

I believe the way forward is to say something like: "an annotation SHOULD have zero or one language terms, and MAY have more than 1 in exceptional cases." @gsergiu https://github.com/gsergiu's use case may be quoted in an informal note where the MAY comes into effect, but we should also note that implementations/users should really try to use one language, because otherwise problems may occur.

And stop there...

(I didn't see this text on GitHub, so I'm copying the message I am replying to).

SHOULD already implies that one has a good reason for a different choice, so I don't think "exceptional" is either useful or necessary.

The original language had it right:

"The Body or Target SHOULD have exactly 1 language associated with it, but MAY have 0 or more."

As this seemed contradictory to some, perhaps what is needed is an editorial fix, that is, an example:

"The Body or Target SHOULD have exactly 1 language associated with it, but MAY have 0 or more, for example if the language cannot be identified or the resource contains a mix of languages."

(My example for "0" may not be what was intended, so just fix accordingly).

azaroth42 commented 8 years ago

My 2c:

Then if there's the case when there are multiple languages and there's a need to specify which one to use for text processing, there's somewhere to do it. However for the simple (and frequent) case of a single language, then the client knows it should use the language property rather than repeat it in both fields.

Thoughts?

iherman commented 8 years ago

That is an acceptable compromise.

On 23 May 2016, at 23:17, Rob Sanderson notifications@github.com wrote:

My 2c:

language: The Body or Target SHOULD have exactly 1 language associated with it, but MAY have 0 or more. If the resource contains content in a mixture of languages, and there is a particular language to use for text processing, then that language should be given in the processingLanguage property.

processingLanguage: The language to use for text processing algorithms such as line breaking, hyphenation, which font to use, and similar. Each Body and Target MAY have 0 or exactly 1 processingLanguage. If this property is not present and the language property is given as a single language, then the client SHOULD use that language for processing requirements.

Then if there's the case when there are multiple languages and there's a need to specify which one to use for text processing, there's somewhere to do it. However for the simple (and frequent) case of a single language, then the client knows it should use the language property rather than repeat it in both fields.

Thoughts?

gsergiu commented 8 years ago

Dear all,

I would make a simple synthesis of the problem from the implementation point of view:

Facts:

· There are many web resources that use multiple languages (and of course we want that everything is annotatable)

· There are also many of these resources that even don’t use metadata or markup to advertise the use languages

As the goal is to be able to everything, we can even take in consideration the worst case scenario, in which we have the resources that include texts in multiple languages, but we don’t know which languages are used. (this is not a rare situation .. in Europeana there are 3,77 million records for which we know that the metadata is in multiple languages, but we don’t know which ones are these: http://www.europeana.eu/portal/search?f[LANGUAGE][]=mulhttp://www.europeana.eu/portal/search?f%5bLANGUAGE%5d%5b%5d=mul )

Expected user behavior:

· I think that the majority of users would agree to add the used language (list) when creating annotations. (mainly for retrieval purposes)

· I don’t think that will be many users that are willing to mark all parts of the texts with the correctly identified language, but there will be use cases in which this is needed

· Audio browsers might be nice and important, especially for blind people, but I doubt that they are able to correctly read texts in any language and especially old languages (I’m not sure if we have readers that are able to read latin or old german for example, which are frequently used in Europeana resources:

See http://www.europeana.eu/portal/record/92080/FCBC03581F63DA47F920E30CF3000212D7A476F1.html

Or … this … http://www.univie.ac.at/elib/index.php?title=Greg%C3%B4rje,_b%C3%A2best,_geistlich_vater,_wache_und_brich_abe_d%C3%AEnem_slaf_%28Bruder_Werner%29&redirect=no ).

Analysis:

  1. The metadata should be consistent with the resources (if 1 language is used, one should be available in language property, if 10 are used … than 10 must/should be in language field)
  2. For the great majority of cases the text is perfectly as it is. Encouraging the usage of exact 1 language if possible, but allowing multiple as well.
  3. For the i18n problem of multiple texts, it is clear that we don’t have sufficient information to correctly apply NLP and TTS algorithms, so … by removing information we make it only worse, not better. It is obvious that in this case, there is no way to derive the required information from the language property (at least not with 100% confidence, as language detection algorithms might be applied). What we need is a way to express which parts of the text (selectors?) are written in which language (BCP?/RFC) and with which script   (it is already included in RFC 5646 https://tools.ietf.org/html/rfc5646#ref-ISO15924   - http://unicode.org/iso15924/iso15924-codes.html )

Proposed Approach:

  1. In the general case when we have only one language, the text-processing language can be derived from the existing “language” property, and the script code as well.

a. Open question, do we really need text direction if we have the script code? Cannot the text direction be derived from the script code https://github.com/w3c/web-annotation/issues/224 ?

  1. For the correct representation of texts in multiple languages, we need additional information, but I wouldn’t advice for embedded markup, as we shouldn’t break the functionality of the APIs, because of the Browser’s problems.

a. As written above, I think that the best way is to have a special (robust?) selector for adding the missing i18n information! Just let the body to have a clean representation, which is human and machine friendly .. (opposite to browser friendly and human/machine unfriendly, the json representation should be json and not html .. or other markup)

Br,

Sergiu

Von: Ivan Herman [mailto:notifications@github.com] Gesendet: Dienstag, 24. Mai 2016 08:58 An: w3c/web-annotation Cc: Gordea Sergiu; Mention Betreff: Re: [w3c/web-annotation] exactly 0 or 1 language(s) (#213)

That is an acceptable compromise.

On 23 May 2016, at 23:17, Rob Sanderson notifications@github.com<mailto:notifications@github.com> wrote:

My 2c:

language: The Body or Target SHOULD have exactly 1 language associated with it, but MAY have 0 or more. If the resource contains content in a mixture of languages, and there is a particular language to use for text processing, then that language should be given in the processingLanguage property.

processingLanguage: The language to use for text processing algorithms such as line breaking, hyphenation, which font to use, and similar. Each Body and Target MAY have 0 or exactly 1 processingLanguage. If this property is not present and the language property is given as a single language, then the client SHOULD use that language for processing requirements.

Then if there's the case when there are multiple languages and there's a need to specify which one to use for text processing, there's somewhere to do it. However for the simple (and frequent) case of a single language, then the client knows it should use the language property rather than repeat it in both fields.

Thoughts?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/w3c/web-annotation/issues/213#issuecomment-221182575

gsergiu commented 8 years ago

PS: personally I would preffer to have the script code in a separate field (for normalization purposes), but it seems this is not the RFC way of doing it... (However implementations can perform this normalization especially for search purposes)

r12a commented 8 years ago

@azaroth42 's proposal cuts to the centre of what i see as the problem, which is that given a list of languages it's ambiguous which to use for the default text-processing language, so if it's workable to have the processingLanguage property i think that would probably solve the issue.

Just a suggestion: for additional clarity, it may help to add some wording along the lines proposed by Asmus, such as: "The Body or Target SHOULD have exactly 1 language associated with it, but MAY have 0 or more if the language cannot be identified or the resource contains content in more than one language. ..."

gsergiu commented 8 years ago

@r12a @azaroth42 well ... as it seems it is a good thing to add the processingLanguage attribute (no objection against it). However, I might have some preferences about its definitions and maybe on the parts to which this property should/may be attached.

We should first make it clear, what is this used for? It it used only for being able to select "some" text processing algorithms to be applied for the text. Or is it intended to select the "correct" text processing algorithms? Who should add this information into the annotation? Is it the end user (in general case I doubt that this is user's responsability) or is it the implementation (we might think that the first entry that matches some rules from the language property can be copied in the processingLanguage) or both?

If you have ~40% text in rusian and ~60% in english... what do we do in this case? Should we say that the text should be processed only by russian NLP or only by english NLP. I would expect to be processed by both.

hugomanguinhas commented 8 years ago

Hi all,

I believe that the "processingLanguage" will not solve the issue as it is still necessary to choose one of the languages (if more than 1 exists) which may not be possible to do.

If the issue is for client application to decide if the text fits the language of the display, then they can just check if the language is one of the languages in dc:language and accept that part of it may be in a different language than the one that the user has selected.

If the issue is for software processing the text to apply a specific NLP then it can either still try to apply it and accept that the results may not be the best, or just ignore it as there is no sufficient information to apply them.

Best regards, Hugo

gsergiu commented 8 years ago

correct, given that we have a list of languages provided by the annotation creator, why do we need a "default processing language"?

Isn't this the responsability of the clients to decide which of the given languages should/can be used by the NLP algorithms which are known only by the client? I don't think that the annotation creator should normatively enforce a processing language .. (in any case, including the 1 language scenario).

fsasaki commented 8 years ago

For the record, I won't object against the resolution proposed at https://github.com/w3c/web-annotation/issues/213#issuecomment-221098949

iherman commented 8 years ago

Discussed on a joined call with the I18N WG, on 2016-05-26, resolution is to accept the proposal in: https://github.com/w3c/web-annotation/issues/213#issuecomment-221098949

See: http://www.w3.org/2016/05/26-i18n-irc#T15-24-21

azaroth42 commented 8 years ago

@hugomanguinhas, @gsergiu: If it's not possible to choose a language to use for processing the text, then indeed, it's not possible. We can't solve that, but we can allow annotation publishing systems that /do/ know the language to use to provide it, rather than requiring the client to guess. It would not be required, so systems that either do not know or do not wish to provide it, are not adversely affected. The model would not make mandatory requirements on consuming agents as to what to do ... that would be very restrictive and just result in the proposal not getting through CR.

It is also important to remember that there's the possibility of using HTML or other serialization that can record language, fonts, and so forth within the target or body resource. Then it is up to the rendering client to process that as specified by the format's specification.

gsergiu commented 8 years ago

Well ... if the solution is to add some reduncancy becasue some people/scenarios needed, I have not problem with that given that these fields are not mandatory. However, I have the feeling, or more than that, I'm convinced that the solution is incomplete. The clients still need "to guess" some properties in order to be able to correclty process the text with NLP or TTS.

As I indicated above, that "script code" part of RFC 5646 is the key information needed by these algorithms. While this bit of information is still valid to be added in the "language" property (at least accordign to the current specifications), this is not the recommended way to do it.

Was this aspect discussed? By following the other things that got own fields, like "text direction", I would claim that the "script code" should be also explicitly represented in the annotations.

If there was no decision/recommendation taken in this direction, I would be glad to create a new ticket.

BR, Sergiu

iherman commented 8 years ago

Well ... if the solution is to add some reduncancy becasue some people/scenarios needed, I have not problem with that given that these fields are not mandatory. However, I have the feeling, or more than that, I'm convinced that the solution is incomplete. The clients still need "to guess" some properties in order to be able to correclty process the text with NLP or TTS.

As I indicated above, that "script code" part of RFC 5646 is the key information needed by these algorithms. While this bit of information is still valid to be added in the "language" property (at least accordign to the current specifications), this is not the recommended way to do it.

Was this aspect discussed? By following the other things that got own fields, like "text direction", I would claim that the "script code" should be also explicitly represented in the annotations.

If there was no decision/recommendation taken in this direction, I would be glad to create a new ticket

We did discuss, on a slightly more general level, that this solution will not cover all the possible cases and, because the format of the body and target is completely open-ended, it is impossible to cover them. It was agreed that this solution covers the vast majority of the cases (the magic 80/20 cut…) and we would stop there.

gsergiu commented 8 years ago

Ok .. thanks for the answer. Does this imply that the WA recommendation will sound "use the script code in langauge tag, if you need it"? ... I can leave with this, but this is against the i18n recommendations, and I would say it is worth to write an informative note on it.

Br, Sergiu

azaroth42 commented 8 years ago

I don't think we need to say that explicitly, it's part of BCP47 and hence is available for use along with all of the other features.

gsergiu commented 8 years ago

well ... this is a kind of implicit recommendation, which would make sense to be added as note, especially given the 20% mentioned by @iherman that are not covered by the current solution of using processingLanguage and textDirection. These 20% are covered by placing the script code in the language tag, but currently there is no mention of the script code in the WA draft (I supose).

I would propose to add a note for the language like: "If the script of the text is different from the Suppress-Script in language subtag registry, it is recommended to add the script code into to the language, for correct representation of the text on different systems/platforms"

http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry