w3c / wcag2ict

WCAG2ICT deliverable of Accessibility Guidelines WG
https://wcag2ict.netlify.app/
Other
18 stars 5 forks source link

Programmatically determine the language of text #463

Closed xfq closed 5 days ago

xfq commented 1 month ago

https://www.w3.org/TR/wcag2ict-22/#applying-sc-3-1-2-language-of-parts-to-non-web-documents-and-software

The human language of each passage or phrase in the [non-web document or software] can be programmatically determined except for proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text.

What are you trying to accomplish in this statement? Does "programmatically determined" mean heuristic detection of language? If so, please note that such detection will often be wrong.

Also, why do these exceptions exist?

GreggVan commented 1 month ago

WCAG was trying to make it so that when a phrase in text was in a different language -- screen readers and other AT could identify them and pronounce them properly.

It is phrased as "programmatically determined" because

  1. that means AT (a program) can determine it
  2. today sufficient techniques require that it be marked up because heuristic and AI have not IN THE PAST been good enough
  3. BUT -- in the near future they will be plenty good enough - and programs themselves will be able to determine this without markup. WHEN (and only when) this is true - markup would no longer be needed for this requirement to be met.
mitchellevan commented 1 month ago

I agree 100% with @GreggVan's last comment.

In addition, there is another scenario. With or without AI, a web app or native app might use automation, built into the app itself, to mark up its own content programmatically with the HTML lang attribute or a native code equivalent. Such an automated mechanism could pass SC 3.1.2 Language of Parts — but only if the mechanism made no mistakes or omissions, which is unlikely today.

r12a commented 1 month ago

Does 'programmatically determined' mean:

  1. can be guessed at from plain text using heuristics, by a program which then adds markers or field boundaries such as lang attributes
  2. can be identified by a program from existing markup or other annotations in the text, etc.

The wording of the definition of that phrase is so vague that i can't really tell which of these two opposing meanings is relevant.

xfq commented 1 month ago

We discussed this during the 8 August i18n working group teleconference, and we would like AG to add a note to clarify the intention of this statement, especially about the list of exceptions:

except for proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text

ChrisLoiselle commented 3 weeks ago

@mitchellevan @GreggVan @xfq , do you have a proposal that we could bring forward ? Happy to assist. @pday1 for reference, either you or I can add to Google Doc and move forward.

ChrisLoiselle commented 3 weeks ago

@maryjom see prior comment. I've self-assigned.

GreggVan commented 3 weeks ago

@r12a asked

Does 'programmatically determined' mean

  • can be guessed at from plain text using heuristics, by a program which then adds markers or field boundaries such as lang attributes\
  • can be identified by a program from existing markup or other annotations in the text, etc.

Well - almost both. But heuristics is NOT the right word (see definition in Wiktionary)

@ChrisLoiselle asked for wording for definition so here you are.

programmatically determined' (as used in WCAG)

determined by software using the content as input

ChrisLoiselle commented 3 weeks ago

@xfq is your ask to AG or WCAG2ICT ? If WCAG2ICT, @xfq please provide a proposal of the change that I can then bring to the task force for response. Thanks for clarifying!

xfq commented 3 weeks ago

Sorry, I was asking WCAG2ICT. I cannot give a proposal at the moment, because we want to first get clarification about the intention behind this statement, especially the intention of the list of exceptions:

except for proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text

because we think the statement (and the definition of "programmatically determined") is rather vague, and readers who read this sentence may not necessarily understand how the language is determined, nor why these exceptions exist.

GreggVan commented 3 weeks ago

Let me take a pass at answering your question about intent.

The intent was to prevent authors from having to mark up things (or be called out for not marking up things) that did not need to be marked up or that an author would not think needed to be marked up.

First -- why do we ask that things from another language be marked? So that they can be pronounced properly. It is not so that you can look up the meanings. If that were true we would have had to put in a rule that all idioms and all words above high school level should be marked up so people could go find out what they meant (especially idioms).

Now -- why not mark up words that are from another language -- but are used in the common vernacular (commonly used in the language of the page) ?\ That would mean marking up a whole slew of English words like rendezvous (french) Déjà vu (French), Café (French), Ballet (French), Fiesta (Spanish), Tsunami (Japanese), Kindergarten (German), Safari (Swahili), Pizza (Italian), Karaoke (Japanese), and Taco (Spanish) that many people do not even know are foreign words. And in Holland -- you would have to mark up a majority(?) of words since their commonly spoken language is an amalgam of many different languages including their own.

Why not mark up technical terms from another language? (Same thing) And many languages have no words other than the technical word. Transistor for example is used in all sorts of languages to stand for -- transistor. Like the above words -- they are considered the word for that concept in their language

Why not words from an indeterminant language. Well, this one is self-explanatory. You can't mark it up if you can't determine what language it came from.

Why not proper names? Proper names are really hard. Marking them up actually gives you no clue as to how they should be pronounced. For example, there is a city in Illinois called Des Plaines. It it french in origin but is not pronounced in French (at least not if you live there). And there is a town called Peru which is pronounced PEEE RUE. So marking it up would cause it to be mispronounced. And then there are names like Smith, which is pronounced in at least five different ways that I know of, including with short and long i and more.

Does this help?

bruce-usab commented 2 weeks ago

@xfq -- please share if you are satisfied (or not) with the explanations provided.

Does "programmatically determined" mean heuristic detection...

The term means algorithmic detection.

maryjom commented 2 weeks ago

Final WCAG2ICT Task Force Answer:

@xfq Thank you for your comment. To answer your questions:

Does "programmatically determined" mean heuristic detection of language?

Programmatically determined means determined by software from author-supplied data, per the definition of ‘programmatically determined’ in WCAG. Examples are provided indicating what would be considered author-supplied data that user agents and assistive technologies can extract and present to the user (e.g. markup language elements and attributes or technology-specific data structures today and possibly direct determination by software in the future).

Why do these exceptions exist?

These exceptions were defined by WCAG 2.0 (and have persisted without change ever since).

Note that the WCAG2ICT TF has not modified these exceptions in any way. The only changes WCAG2ICT made to the language for 3.1.2 Language of Parts is the word substitution exchanging web-centric terms with terms more applicable to non-web technologies. Per the WCAG2ICT Task Force’s scope of work, we are unable to change the meaning of a requirement by modifying normative language, including the exceptions.

If there is a concern with the exceptions or with the definition of “programmatically determined,” we recommend that you raise an issue in the WCAG repository. The WCAG2ICT repository is limited to comments on the interpretation, word substitutions, and notes describing how to apply WCAG in an non-web context. You could also refer to WCAG 2.2 Understanding SC 3.1.2 Language of Parts.

Unless the WCAG language gets changed by the Accessibility Guidelines Working Group (AGWG), the WCAG2ICT Task Force cannot make any further modifications to them in the WCAG2ICT Group Note.

xfq commented 1 week ago

Thank you for your reply. We discussed this during the 5 September i18n working group teleconference.

We understand that the definition of ‘programmatically determined’ and the list of exceptions are based on WCAG. The i18n WG has been promoting language metadata for strings (see specdev and string-meta). One of the reasons is that without language metadata, the language can only be determined accurate enough for a small number of languages ​​through some algorithm (AI for language detection is improving, but it can still give false positives). We would like WCAG2ICT to have a note to point this out and encourage authors to provide language metadata (by using markup, for example) when possible.

For the text in WCAG, we hope to change this sentence to something like this (we will raise an issue against WCAG):

You SHOULD programmatically determine the language of the text. If metadata or author-supplied information about the language is not available, you MAY use a heuristic (or algorithm) to determine the language.

maryjom commented 1 week ago

@xfq It is good to see that the i18n is promoting language metadata, as this helps content developers have documented methods of meeting SC 3.1.2 Language of Parts. Once published, these Notes will be good documents for WCAG to reference.

Because all of the content in the WCAG2ICT Note is non-normative, our Task Force has to be mindful to not provide techniques for meeting or not meeting particular success criteria. Doing so would be going outside of our statement of work.

However, we could encourage use of actual metadata and markup by providing that as an example in our note on 3.1.2 Language of Parts, drafted below.

Here's a possible draft change to the note in WCAG2ICT (by adding the first sentence):

Note 1: Examples of programmatic identification include language metadata or markup. There are some software and non-web document technologies where there is no assistive technology supported method for marking the language for the different passages or phrases in the non-web document or software, and it would not be possible to meet this success criterion with those technologies.

Would this change be sufficient to address your concern? If not, please help us draft what you are looking for, as we are very close to publication of the finalized WCAG2ICT Group Note and need to get any substantive changes incorporated as soon as possible for the AG WG final review before their CfC to publish.

xfq commented 6 days ago

Works for me. We will continue to discuss changes with the WCAG folks. Thank you for working on this!

maryjom commented 6 days ago

Closing as answered. Thanks @xfq.

xfq commented 5 days ago

I don't seem to see these changes in https://w3c.github.io/wcag2ict/#applying-sc-3-1-2-language-of-parts-to-non-web-documents-and-software . Are the changes still pending?

daniel-montalvo commented 5 days ago

Hi @xfq @maryjom

I also don't seem to find any PR addressing this. I'll reopen this issue and link #512, so that the issue closes when the PR is merged.

maryjom commented 5 days ago

@daniel-montalvo @xfq Thanks for double checking. Not sure what happened to my PR. Might have missed the second step of creating the PR and thought I had done it. The language is in the editor's draft now. See 3.1.2 Language of Page Note 1.