w3c / w3c-website

W3C Website feedback and bug reports
https://www.w3.org/
242 stars 81 forks source link

How to sort the list of languages? #97

Closed vivienlacourba closed 1 year ago

vivienlacourba commented 2 years ago

Describe the issue

When a page has multiple translations available it is not clear in which order the various languages should be listed.

The translation component does not specify this. https://design-system.w3.org/components/translations.html

URL

Any page that has translations for example press releases e.g. https://www-dev.w3.org/press-releases/2021/webrtc-rec/

But also when displaying the various translations available for a TR. https://design-system.w3.org/templates/tr.html

2 translations (日本語, Magyar) for Accessible Rich Internet Applications (WAI-ARIA) 1.0

Recommended solution

That seems to be a tricky UX problem which I started to discuss with @jean-gui, @simonrjones and @koalie. User Experience Stack Exchange - How do you sort a list of languages?.

My initial suggestion was to start by the languages we most often use on our site which would be "English" then "Chinese", "Japanese", "French" then the others in whatever order we decide.

@simonrjones looked at the BBC World Service news language order. Looks like they are ordering by the latin name (e.g. Chinese before Japanese). Otherwise ordering by popularity makes sense.

@jean-gui confirmed that sorting by English name was easy.

@r12a do you have a specific suggestion?

gosko commented 2 years ago

One easy option would be to sort by language code. This seems to be the order used on the W3C i18n site, for example on the articles & tutorials page.

r12a commented 2 years ago

This is a long-standing question, and i'm not sure there is any perfect answer. I also think the answer depends a little on the size and visibility of the list. Here are some thoughts.

Sorting by most-often-used language is mostly basing the decision on our view of the world, rather than the user's needs. The issue at hand is rather how to help the individual user locate their language as painlessly as possible, and with the minimal amount of implied bias. I don't think it should be an exercise in classification. We should look for a way of ordering items that implies no bias, and is predictable.

Let me make suggestions separately about general ordering and about raising things to the top of the list. We'll start with the former, because the general ordering is needed anyway whether or not the raising occurs.

Let me also preface what i say with the thought that there are different types of use case. Mostly, it seems to me that we'll be dealing with a smallish number of languages which will all be visible to the user, rather than dealing with a very long list of languages in a selection control that requires scrolling. (This may actually make it less important to raise certain languages to the top, but see below.)

Ordering by Latin name is problematic, mainly because it is highly biased to one culture and smacks of either cultural imperialism, or lack of concern. But it also has practical ramifications: a person looking up their own language would have to know how it is written in English, eg. the endonym Surayt is Turoyo in English, Farsi is Persian or Dari depending on the region, Nasa Yuwe is Páez, etc. It may also mean deciding between two or more alternative names that change the order – should a user expect to find their language under Jula or Djoula, Burmese or Myanmar, Swahili or Kiswahili, or Tamazight or Berber (which, although the more common name in English, is a non-preferred name for its speakers because it means 'barbarian')? Again, it doesn't seek to help the user quickly locate their language, but is a method that is simply convenient for the content creators.

Another possibility is to sort the items using the Unicode Collation Algorithm. This produces a fixed and predictable order for any sequence of items, but in this case all items using the same script are presented together – so the user looks for the script first, and then for the language. The appropriate order of languages within a script group is a little odd for the average user, since it won't follow the tailored collation algorithms for their particular language (not least because those alphabetic rules won't address all the characters needed for all the languages). This may not be an issue for typical lists of non-Latin script, since the number of items is likely to remain small, but for Latin-script (and Cyrillic or Arabic script languages) where the number of items might be larger, then it won't correspond to the alphabetic ordering for each language (eg. ä comes after z in Swedish, ch comes after h in Slovak, mb comes before ɓ and then c in Fula, etc.)

The way we order the 22 languages in the selector at https://www.w3.org/International/articlelist is to go by the English alphabetic order of the BCP language subtags for each language. It's not a perfect solution, but at least it produces a predictable order, and with slightly less apparent bias, since it's based on a global standard, rather than on English. For example, Greek is sorted under e for el, and German is under d for de. It also avoids the need to worry about language-specific tailoring of collation.

Now for raising certain languages to the top.

I always find it annoying when a pull-down list puts USA or US-English at the top, and i have to scroll forever to find UK or UK-English. In those situations, i can't help feeling a little as if the content developers thought i was less important than our American cousins. Sometimes, in a long list, if UK-English isn't at the top, i'll waste time scrolling down to find that it isn't there anyway, and i have to waste more time going back to the beginning.

Note that this is not so much of an issue if you're only dealing with up to 10 or 15 languages that are simultaneously visible, however if done well it could still be nice for the user. The question is how to do it well.

Any kind of ordering based on page usage rates sounds either like a non-user-centric view of the world, or implies a ranking of importance. It may also produce different orders from page to page or from time to time, which is also problematic.

I think that raising items to the top of the list needs to be done in a way that is clearly aimed at helping each user access their own language quickly, taking into account their individual point of view on the world.

So here's a suggestion. I think that a whizz bang implementation could look at the browser language preferences of the user and pull those items, in their already ranked order, to the top of the list. This would be very user-centric – adapting the list to reflect who is looking at it. Then the remaining languages would be ordered per one of the default orderings described above (i favour the language subtag approach). (Yes, sometimes, the user's language preferences won't be set in a way that reflects their actual language preferences, but actually much of the time it will, since those preferences tend to be set when the user installs a browser, and can also be changed by the user.)

To make it clear to the user what's going on, it would probably be best to visually show a clear division between the items that are raised to the top, and those that follow.

hope that helps

gosko commented 2 years ago

Sorting by language subtag sounds good to me.

If we also want to reorder the list or highlight entries based on the user's preference we would likely need to do that using javascript since most pages won't be customized per-user (for performance and cache/CDN efficiency)

It looks like javascript has access to accept-language using navigator.languages (and can fall back to UI language if desired)

vivienlacourba commented 2 years ago

Thx @r12a for the detailed answer.

Based on your input this is what systeam will implement:

r12a commented 2 years ago

Thanks Vivien. We also discussed this during our i18n telecon last week. There was some question about whether reordering was needed for short lists of links, where all are visible and readily accesssible – compared to long lists in select controls, which helps avoid scrolling.

There's obviously a clear practical benefit for long lists. But I don't have a strong opinion about whether reordering is useful for short, all-visible lists. At first, i thought it may not be needed, since it's easy enough to spot say French in a list of 10 to 15 items, however I imagine that it could in fact be useful in that the user will always know where to find the language(s) that they are most interested in. Such short lists can change from page to page, depending on what translations are available, which means that, although the relative order of the default sorting is always the same, the position of say French (if it's there at all) relative to other links can potentially vary. If the user is a French speaker, it therefore perhaps makes some sense to move French to the top of the list, so they can find it even quicker.

On the client side @gosko will work on the JavaScript to re-order this list based on user's browser preference.

preference -> preferences. I expect that if the user has 2 or 3 languages set in their preferences, then it will be useful to raise all of them to the top. The order of the raised items should be the same as the ranking used for the browser preferences.

hope that helps.

gosko commented 1 year ago

I just noticed lists of translations are still not displayed in any particular order (at least not any order I can identify), for example in translated specs on the TR page or a group's publications page

This is probably not super important, but @jean-gui would it be easy to sort these lists by their hreflangs?

jean-gui commented 1 year ago

Sorting of pages has been implemented a while ago, and I just implemented sorting of spec translations. I'm leaving this issue open so @denis and/or @gosko can implement the client-side part.

deniak commented 1 year ago

I'll take care of the JS

gosko commented 1 year ago

Thanks @deniak, let me know if I can help or review.

deniak commented 1 year ago

The following JS should work on /TR and the group publications page:

document.addEventListener("DOMContentLoaded", function() {
    const userLangs = navigator.languages || [navigator.language || navigator.userLanguage];

    // duplicate to remove variations e.g. fr-fr, fr-ca -> fr
    const userLangVars = userLangs.flatMap(i => [i, i.split("-")[0]]);
    const langs = userLangVars.filter((item, index) => userLangVars.indexOf(item) === index);

    // only look for specs with at least 2 translations
    const secondTranslations = document.querySelectorAll("p > a[hreflang]:nth-child(2)");
    secondTranslations.forEach((link) => {
        const parent = link.parentNode;
        const children = parent.querySelectorAll("a[hreflang]");
        const sortedChildren = [...children].sort((a, b) => {
            const aLang = a.getAttribute("hreflang");
            const bLang = b.getAttribute("hreflang");
            const aIndex = langs.indexOf(aLang);
            const bIndex = langs.indexOf(bLang);
            if (aIndex === -1 && bIndex === -1) {
                return 0;
            } else if (aIndex === -1) {
                return 1;
            } else if (bIndex === -1) {
                return -1;
            } else {
                return aIndex - bIndex;
            }
        });

        children.forEach((child, index) => {
            let parentIndex = index * 2 + 1; // skips comma separator
            const newNode = sortedChildren[index].cloneNode(true); // clone to avoid moving the node
            if (child !== sortedChildren[index]) {
                parent.replaceChild(newNode, parent.childNodes[parentIndex]);
            }
        });
    });
});

One thing to note is the language code associated with a translation is different from the ones the browsers allow you to select. In the symfony backend, we rely on the type LocaleType so the code can be fr, fr_FR, zh or zh_Hant. That's the code we use for the hreflang attribute.

On the other side, browsers follow the rfc5646 to identify languages. So navigator.languages will return values like fr, fr-FR, zh and zh-CN.

I'm not sure what's the best way to handle that particular issue. I was thinking we could simply drop the part after the country code when sorting the list but I don't know if this is acceptable.

xfq commented 1 year ago

One thing to note is the language code associated with a translation is different from the ones the browsers allow you to select. In the symfony backend, we rely on the type LocaleType so the code can be fr, fr_FR, zh or zh_Hant. That's the code we use for the hreflang attribute.

I think this is wrong, because the value of hreflang must be a valid BCP 47 language tag. The system should turn underscores into hyphens. As an example, the hreflang="zh_Hans" and lang="zh_Hans" in https://beta.w3.org/TR/?filter-tr-name=XML+Information+Set+%28Second+Edition%29 should be changed.

I'm not sure what's the best way to handle that particular issue. I was thinking we could simply drop the part after the country code when sorting the list but I don't know if this is acceptable.

The part after the primary language subtag is not needed in most cases, but it is sometimes necessary. For example, zh-Hans and zh-Hant are used to distinguish Simplified Chinese and Traditional Chinese, which is very useful, otherwise the browser may choose the wrong font when rendering the text, choose the wrong dictionary when doing spell checking, or search engines won't be able to show the most appropriate version to the reader, etc.

deniak commented 1 year ago

One thing to note is the language code associated with a translation is different from the ones the browsers allow you to select. In the symfony backend, we rely on the type LocaleType so the code can be fr, fr_FR, zh or zh_Hant. That's the code we use for the hreflang attribute.

I think this is wrong, because the value of hreflang must be a valid BCP 47 language tag. The system should turn underscores into hyphens. As an example, the hreflang="zh_Hans" and lang="zh_Hans" in https://beta.w3.org/TR/?filter-tr-name=XML+Information+Set+%28Second+Edition%29 should be changed.

Good catch. @jean-gui, I guess this means updating the type of the field and the existing records, or at least, find a way to convert the code. @xfq, are you saying zh_Hans should be translated to zh-Hans? Are you able to configure your browser to get that tag from your browser using navigator.languages? I'm only able to have zh, zh-cn, zh-hk, zh-sg and zh-tw.

I'm not sure what's the best way to handle that particular issue. I was thinking we could simply drop the part after the country code when sorting the list but I don't know if this is acceptable.

The part after the primary language subtag is not needed in most cases, but it is sometimes necessary. For example, zh-Hans and zh-Hant are used to distinguish Simplified Chinese and Traditional Chinese, which is very useful, otherwise the browser may choose the wrong font when rendering the text, choose the wrong dictionary when doing spell checking, or search engines won't be able to show the most appropriate version to the reader, etc.

I'm not saying we shouldn't add the subtag. I'm only talking about how we should/could sort the translations based on the hreflang. It we can have a consistent mapping between the language tags we store for a translation and the languages a user can select from their browser, then the sort should be fairly easy but that's not the case today.

xfq commented 1 year ago

One thing to note is the language code associated with a translation is different from the ones the browsers allow you to select. In the symfony backend, we rely on the type LocaleType so the code can be fr, fr_FR, zh or zh_Hant. That's the code we use for the hreflang attribute.

I think this is wrong, because the value of hreflang must be a valid BCP 47 language tag. The system should turn underscores into hyphens. As an example, the hreflang="zh_Hans" and lang="zh_Hans" in https://beta.w3.org/TR/?filter-tr-name=XML+Information+Set+%28Second+Edition%29 should be changed.

Good catch. @jean-gui, I guess this means updating the type of the field and the existing records, or at least, find a way to convert the code. @xfq, are you saying zh_Hans should be translated to zh-Hans?

Yes.

Are you able to configure your browser to get that tag from your browser using navigator.languages? I'm only able to have zh, zh-cn, zh-hk, zh-sg and zh-tw.

I tried setting the preferred language to "Chinese (Simplified)" in Chrome and Firefox and the result was indeed zh-cn, but in theory it should be zh-hans.

Maybe we should map zh-cn/zh-sg to zh-hans, and map zh-hk/zh-tw to zh-hant. @r12a, WDYT?

I'm not sure what's the best way to handle that particular issue. I was thinking we could simply drop the part after the country code when sorting the list but I don't know if this is acceptable.

The part after the primary language subtag is not needed in most cases, but it is sometimes necessary. For example, zh-Hans and zh-Hant are used to distinguish Simplified Chinese and Traditional Chinese, which is very useful, otherwise the browser may choose the wrong font when rendering the text, choose the wrong dictionary when doing spell checking, or search engines won't be able to show the most appropriate version to the reader, etc.

I'm not saying we shouldn't add the subtag. I'm only talking about how we should/could sort the translations based on the hreflang. It we can have a consistent mapping between the language tags we store for a translation and the languages a user can select from their browser, then the sort should be fairly easy but that's not the case today.

OK. Thanks for the explanation.

deniak commented 1 year ago

Are you able to configure your browser to get that tag from your browser using navigator.languages? I'm only able to have zh, zh-cn, zh-hk, zh-sg and zh-tw.

I tried setting the preferred language to "Chinese (Simplified)" in Chrome and Firefox and the result was indeed zh-cn, but in theory it should be zh-hans.

Maybe we should map zh-cn/zh-sg to zh-hans, and map zh-hk/zh-tw to zh-hant. @r12a, WDYT?

I'm also wondering if there's an official list of all the BCP47 language tags. I'm only able to find some lists people uploaded (https://appmakers.dev/bcp-47-language-codes-list/ or https://gist.github.com/typpo/b2b828a35e683b9bf8db91b5404f1bd1) but they seem far from complete and they also don't mention zh-hans or zh-hant

r12a commented 1 year ago

@deniak please read https://www.w3.org/International/articles/language-tags/ TLDR: there is no single list of BCP47 language tags but there are definitive lists of the subtags that make them up. See https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry or for a more user friendly option see my app at https://r12a.github.io/app-subtags/

Chinese is a special case, so you'll need to recognise zh-CN as Simplified Chinese as well as zh-Hans, but it's best to produce zh-Hans rather than perpetuate the bad usage. Same applies for traditional wrt zh-TW and zh-Hant.

deniak commented 1 year ago

@jean-gui I've submitted a PR to fix the lang attributes. Once it's merged and deployed, we should be able to add the following JS to sort translations based on user's preferences:

document.addEventListener("DOMContentLoaded", function() {
    const userLangs = navigator.languages || [navigator.language || navigator.userLanguage];

    // duplicate to remove variations e.g. fr-fr, fr-ca -> fr
    const userLangVars = userLangs.flatMap(i => [i.toLowerCase(), i.split("-")[0].toLowerCase()]);
    const langs = userLangVars.filter((item, index) => userLangVars.indexOf(item) === index);
    // add zh-hans and zh-hant
    if (langs.some(i => ["zh-cn", "zh-sg", "zh"].includes(i))) {
        langs.splice(Math.max(langs.indexOf("zh-cn"), langs.indexOf("zh-sg"), langs.indexOf("zh")), 0, "zh-hans");
    }
    if (langs.some(i => ["zh-hk", "zh-tw"].includes(i))) {
        langs.splice(Math.max(langs.indexOf("zh-tw"), langs.indexOf("zh-hk")), 0, "zh-hant");
    }

    // only look for specs with at least 2 translations
    const secondTranslations = document.querySelectorAll("p > a[hreflang]:nth-child(2)");
    secondTranslations.forEach((link) => {
        const parent = link.parentNode;
        const children = parent.querySelectorAll("a[hreflang]");
        const sortedChildren = [...children].sort((a, b) => {
            const aLang = a.getAttribute("hreflang").toLowerCase();
            const bLang = b.getAttribute("hreflang").toLowerCase();
            const aIndex = langs.indexOf(aLang);
            const bIndex = langs.indexOf(bLang);
            if (aIndex === -1 && bIndex === -1) {
                return 0;
            } else if (aIndex === -1) {
                return 1;
            } else if (bIndex === -1) {
                return -1;
            } else {
                return aIndex - bIndex;
            }
        });

        children.forEach((child, index) => {
            let parentIndex = index * 2 + 1; // skips comma separator
            const newNode = sortedChildren[index].cloneNode(true); // clone to avoid moving the node
            if (child !== sortedChildren[index]) {
                parent.replaceChild(newNode, parent.childNodes[parentIndex]);
            }
        });
    });
});

@xfq, @r12a, I'm treating zh/zh-cn/zh-sg as zh-hans and zh-hk/zh-tw as zh-hant.

vivienlacourba commented 1 year ago

@deniak as discussed your proposed JS hardcodes some markup assumptions like the list of translation is a suite of links within a <p> and those links are separated by a comma. In the hope to make it more generic and resilient to markup changes would it be cleaner to use a "language-list" class for finding that list and maybe use an explicit list with <ul>?

deniak commented 1 year ago

@deniak as discussed your proposed JS hardcodes some markup assumptions like the list of translation is a suite of links within a <p> and those links are separated by a comma. In the hope to make it more generic and resilient to markup changes would it be cleaner to use a "language-list" class for finding that list and maybe use an explicit list with <ul>?

+1 to add a class to the <p> but in that particular case, the list of languages is in the middle of a <p> so a <ul> will break the sentence. I don't think we should use such tag here.