w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.42k stars 652 forks source link

Consider Canonicalization of language tags in :lang() selector maching #4154

Open frivoal opened 5 years ago

frivoal commented 5 years ago

Cantonese may be described with any of the following language tags:

Even though these refer to the same language, :lang(yue) would only match the last one.

RFC 5646 section 4.5 together with the IANA Language Subtag Registry define canonicalization and mappings that would allow to match the last two (by extlang form). RFC4647 section 3.2 says we should use this.

The last paragraph of that same section 3.2 also says we may also want to consider mappings like zh-HK to zh-yue, and that too could seem appropriate, but as far as I know there’s no equivalent to IANA Language Subtag registry for such mappings.

While it is possible for authors to deal with this themself by using :lang(yue, zh-yue, zh-HK), to the extend we can automate this, they shouldn't have to deal with this.

(I'm taking Cantonese as an example, but the same can be said about other languages)

r12a commented 5 years ago

Section 4.5 talks about canonicalising items that are marked in the registry as deprecated (eg. grandfathered tags) or that contain extlang subtags (eg. zh-yue -> yue). These are things that can be determined automatically by using the information in the registry. But it doesn't include zh-HK.

Section 3.2 mentions zh-CN as possible equivalent to zh-Hans, but as i read this as a configuration option offered to the user, not an assumption that they meant the same thing in all contexts (which is what canonicalisation means) or that the author intended one while writing the other. And I think assuming that zh-HK means yue is more of a stretch than the zh-CN/zh-Hans assumption.

I think we need to be careful about making inappropriate assumptions on the behalf of the content author. zh-HK may be used to mean Mandarin chinese, but with traditional script, or possibly written text that includes the few additional characters that are used in Hong Kong - in fact, i thought that the predominant legacy usage for zh-HK arose from an early workaround before zh-Hant existed - not for yue. Of course, it could also be used for Cantonese. But it's possible that it's even used for Hakka or Minnan Chinese as spoken in Hong Kong. It depends, really, on what the author needed and intended. It may also depend on whether it is used for a spoken or written phrase.

So i think it makes sense to assume equivalence for language tags that can be automatically paired using information in the registry, but not for things like zh-HK.

css-meeting-bot commented 4 years ago

The CSS Working Group just discussed canonicalization of :lang() selectors.

The full IRC log of that discussion <heycam> Topic: canonicalization of :lang() selectors
<heycam> github: https://github.com/w3c/csswg-drafts/issues/4154
<heycam> florian: the :lang selector lets you select pieces of the DOM for styling based on the language
<heycam> ... it's alreay somehat smart, since lang tags are structured
<heycam> ... selecting zh, and the document saing zh-Hant, it will do the right thing and match it
<heycam> ... that logic is already built in
<heycam> ... the IANA maintains a registry of the langauges that exist and what they mean
<heycam> ... tags and subtags
<heycam> ... and in addition to just listing them, there is logic in that registry. some languages are a deprecated version of some other languages
<heycam> ... Cantonese used to be zh-yue. that is deprecated and replaced with yue
<heycam> ... the lang selector does not take that logic into account
<heycam> ... so if you have a document marked as lang="yue", and you are matching :lang(zh) or :lang(zh-yue), it won't match
<heycam> ... we may want to use the registry definitions of how to match
<heycam> ... I propose we do that
<heycam> addison: some tag canonicalization is defined by BCP 47 to consume some of the information in the registry
<heycam> ... you've been corresponding on the IETF langauges list and I think some of your questions have been about handling macro-languages -- zh-yue is a macro language
<heycam> florian: zh-yue is a macro language, zh is a macro language
<heycam> addison: there's a separate thing. previous to the current BCP 47, there was a mechanism for regsitring whole tags
<heycam> ... that's grandfathered now
<heycam> ... some of them match subtags, some don't
<heycam> ... [...] is replaced by xtg
<heycam> addison: ignoring grandfathered tags, they all map to something. the ones you're referring to are structurally identical, the tags are composed of subtags
<heycam> ... like zh-yue
<heycam> florian: the way I'm looking at this, there are variety of reasons for why certain langauges might be the same
<heycam> ... there is a defined canonicalization that handles some of them
<heycam> addison: for the BCP 47 canonicalization, that will do awy with the grandfathered ones and other strucutral weirdness
<heycam> florian: it won't deal with the two types of norwegian
<heycam> ... this is a complicated topic with many weird variants
<heycam> addison: there's a subset there that's well defined
<heycam> ... there's a second set of rules, which are in CLDR
<heycam> ... UTF 35
<AmeliaBR> s/UTF/UTR/
<heycam> ... for handling some additional cases around Chinese, where you have different script subtags that you want to appear or not in some circurmstances
<heycam> ... some of those may be of interest, but it's more complicated
<heycam> ... I don't want to pretend that doesn't exist, but they do
<heycam> florian: if you have a link, please drop it
<heycam> addison: defining matching, if you're just using BCP 47 "lookup" IINM
<heycam> florian: extended filtering
<heycam> ... the text for extended filtering says you should canonicalize
<heycam> addison: yes you should
<heycam> florian: thanks for bringing up that the topic is broader
<heycam> addison: if you do the minimum set, it'll make it the most predictable. the other aspects are worth studying
<heycam> ... there are some annoying corner cases in Chinese
<heycam> florian: I hear support for the current proposal, and complicatd problems to think about in addition to that
<heycam> addison: yes I agree with your current proposal and then do further study, and track the other standards happening in that space
<heycam> florian: there is a PR for this
<heycam> addison: should we review that?
<fantasai> https://github.com/frivoal/csswg-drafts/commit/3cff5d844b6415ef30d3e2dac221f9479e0ec7aa
<heycam> florian: if you haven't I suggest you do
<heycam> AmeliaBR: the other question on the topic, do we have implementor commitments?
<heycam> r12a: the current text I'm looking at says "... must be converetd to x-lang form"
<heycam> ... that's a slightly different discussion from what you canonicalize it as
<heycam> ... zh-yue would become yue
<heycam> florian: I had that discussion on the list as well
<heycam> ... this is the right direction
<heycam> ... zh doens't match yue. so if you canonicalize both to x-lang format, it'll match
<heycam> florian: I raised this on the mailing list, and they agreed it was the right form to canonicalize it to
<heycam> addison: some people on the list did
<heycam> ... the challenge is taht this will bring you more promiscuous matching than the author may have intended
<heycam> ... it'll make Canontese match Mandarin Chinese in some cases
<heycam> florian: if you want to match Mandarin specifically that's also possible
<heycam> addison: normally Mandarin is tagged just as zh
<heycam> r12a: for all the macro languages there's usually a preferred language
<heycam> fantasai: if the author cares that much, they can put the information there
<Rossen__> q?
<heycam> addison: that's right
<duerst> q+
<heycam> ... you don't want to have them with a correctly tagged document, have the :lang match things they were [...]
<xfq> ack du
<heycam> duerst: that mailing list is no longer a WG
<addison> http://www.unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers
<heycam> ... so people can give you opinions and background knowledge, but no formal resolutions
<AmeliaBR> So, to cases: (A) author used zh in stylesheet and yue in HTML; doesn't expect a match. (B) author used zh in stylesheet and zh-yue in HTML; does expect a match. Canonicalizing both yue and zh-yue to the same value will break one or the other.
<Rossen__> q?
<heycam> florian: I agree that the problem can exist in both directions, too much or not enough, I think since we're doing it for typographical purposes, and the languages are realted, most of the time if you have zh styles you want it to match Cantonese too
<addison> http://www.unicode.org/reports/tr35/#Likely_Subtags
<heycam> ... it's possible to style Mandarin differently from Cantonese, Hakka, etc., but that's rare
<heycam> r12a: it's not just Chinese we're talking about
<heycam> ... there are other languages that have much more differentiation between the language depending on which of the subtags you choose
<AmeliaBR> q+ to suggest that this is better dealt with in the user agent stylesheet
<heycam> ... the point I watned to make was that we said that let's go ahead with the proposal at the moment
<heycam> ... looking at the issue, there was a proposal you wrote, I responded saying you had to modify that
<heycam> ... the PR doesn't say much
<heycam> ... not sure what the exact proposal is
<heycam> ... I think this information we're talking about now should also be part of that
<heycam> florian: the earlier proposal that you rightfully pointed out I wrote too much, including making zh-HK match yue and things like this, that's not defined in the repo I'm referring to
<heycam> ... I'm just saying, just the canonicalization to x-lang form as defined by BCP 47
<heycam> ... and as supported by the mailing list that used to be the WG defining that document
<heycam> ... btu whichever way we go, including no change at all, has a risk of mismatching things in some cases
<heycam> addison: not all tags match all values, otherwise what's the point
<dbaron> s/WG defining/WG that used to define/
<heycam> ... the problem is to arrive at something that authors understand how to get the results they want
<heycam> ... we'll make some compromises, the question in which ones
<heycam> fantasai: based on the conversation so far, it seems like I don't think canonicalizing yue to zh-yue is going to be good. either we don't canonicilze, or in a direction where zh encompasses Cantonese
<heycam> ... I am sure there are style sheets that just use :lang(zh), and they'll expect it to match
<heycam> addison: the other possibility is that the inclusion or non-inclusion of the enclosing subtag -- in this case zh -- is a choice the author is making deliberately. if they've made that choice deliberately, if we mess eith their tags when doing matching it may produce results they don't expect
<heycam> ... most of the matching algorithms are strict "remove from right" subtag matching
<heycam> ... to make it obvious what's happening
<heycam> ... what's you start adding or subtracting subtags in ways other than the deprecation/renaming, I think that has more risk to it in your space
<heycam> ... since it's not obvious what's going to happen
<heycam> ... I would support doing the mappings that's in the registry, since that's where if you have mlutiple variations, because people have older documents and style sheets, they'll get the right answer
<heycam> ... that's different than adding or subtracting subtags
<xfq> ack Ame
<Zakim> AmeliaBR, you wanted to suggest that this is better dealt with in the user agent stylesheet
<heycam> AmeliaBR: we covered a lot of what I was going to say, but witha different conclusion
<heycam> ... it's important that when matching a style sheet and a document that we respect the way that the author matched it, don't want to introduce spurious matching from canonicalization
<heycam> ...also don't want to break matching
<heycam> ... from the examples brought up, it's obvious that any canoniclization may end up breaking one site or the other
<heycam> ... the question is then how do we make it easier in the general case for having new style sheets or new UA style rules deal with all these deprecated synonyms
<heycam> ... at the UA style sheet, that can just be an advice to UAs to look up the BCP deprecation list
<heycam> ... then also included the deprecated synonmous
<heycam> .. that doesn't work for things like a style sheet that is coming from a library or CSS reset
<heycam> ... or the case of newer code, writing a new new style sheet, but still apply to the old pages with the older language tags
<heycam> ... one approach that might address that use case is something like what we do with case insensitive selector matching
<heycam> ... a flag in the selector that means "this value or any synonms"
<heycam> florian: so an opt in for canonicalization
<heycam> addison: there are three sets
<heycam> ... the grandfathered list is permanently fixed and has been for 10 years
<heycam> ... all those tags have explicit mappings, you can safely map them to modern equivalents or vv
<heycam> addison: individual subtags that ahve mappings, it's mostly about countries going out of business
<heycam> ... yiddish has two subtags, hebrew has two subtags, there's a canonical one
<heycam> .... the third thing is the x-lang thing, which is inconvenient
<heycam> ... because there's two ways to say things. with or without the enclosing subtag
<heycam> ... the canonicalization rule in BCP 47 says you can drop the primary langauge subtag and use the x-lang by itself
<heycam> ... it's permissible for implementations to do that
<heycam> ... I don't recall it says you can put it back
<heycam> florian: there are 2 sets of rules
<heycam> ... one that just strips it off. the other says when you're done stripping it off, put it back
<heycam> r12a: it says you could consider doing that
<heycam> addison: the first two are completely safe
<heycam> ... you want to do those
<heycam> ... for interop
<heycam> ... the x-lang thing, I think you can choose
<heycam> ... whether to put the enclosing subtag on
<heycam> ... the challenge is that Chinese you'd want to do that, but some of the other macro languages are not as crisp. Arabic is one of these, Malaysian
<r12a> https://r12a.github.io/app-subtags/
<heycam> r12a: Omani Arabic and Moroccan Arabic, which treat certain things differently, may have different font requirements
<heycam> ... but they both resolve to "ar" if we follow this PR
<heycam> ... but that's used for standard Arabic
<heycam> florian: I think we're not ready to merge the PR
<heycam> ... action items: the safe subset of canonicalization, I don't think it's defined as a canonicalizing operation separately from the x-lang thing
<heycam> ... action on me to find out if we can
<heycam> addison: this is an area that probably deserves better documentation from us
<heycam> ... we can go offline and make sure we get the right answer
<heycam> ... we can go back and talk to the locale folks at UNicode and the languages list and make sure we're capturing the sense of this
<heycam> florian: one, figure it ouf if the safe subset exists as a standard operation
<heycam> ... two, if we do what I'm proposing, look at the affected languages and see if it's good for them
aphillips commented 4 years ago

Thanks for the opportunity to discuss at TPAC. I'm gonig to add some personal notes here that hopefully will help further discussion.

The section in BCP47 on canonicalization can be found here. As noted, this process includes several operations for dealing with different types of mapping for grandfathered/redundant tags as well as changes to region or language subtags over time. All of these mappings further matching operations in a positive way and should be part of CSS Selectors.

The canonical form of language tags is without extlangs: this was the intention when extlangs were created. The extlang form exists because there are cases where content authors may find some utility in using them, but BCP47 implementations are encouraged to "lose" the enclosing primary language subtag. CSS may still choose to do differently.

The matching challenge has multiple considerations. I'll try to illustrate using the yue (Cantonese) and zh (Chinese) subtags. Suppose you have a stylesheet and content like so:

:lang(zh) { /* something */ }
:lang(yue) { /*something else */}

<p lang="zh-Hans">...
<p lang="yue">...

With the basic canonicalization of language tags, the range zh only matches the first

and the range yue only the second. With the extlang canonicalization, we see the above transformed to the following for matching:

:lang(zh) { }
:lang(zh-yue) { }

<p lang="zh-Hans">...
<p lang="zh-yue">...

Now the first selector matches both content items (the second still only matches one item). This is probably not the intention of the author. Obviously, as was pointed out in the meeting, the opposite case exists (if one started with zh-yue and wrote styles for :lang(zh)).

Note that the tag zh-yue was registered during the RFC3066 era and is one of the so-called "redundant" grandfathered tags. It's replacement is in the BCP47 registry as yue (and not zh-yue). However, tags such as zh-yue-Hant or zh-yue-CN or such are well-formed and valid since RFC4646.

The question of which sort of content-to-stylesheet canonicalization incompatibility to incur is what is at question here. Note there are quite a few macrolanguages in the registry--this affects more than the Chinese complex of languages and the degree to which the extlang form matches or interferes with use varies by the macrolanguages. Arabic or Malay, to pick two examples, probably do not want the extlang form in as many cases as Chinese--the enclosed languages really have different needs.

I will supply additional links and some further guidance anon: I've solicited more input from the CLDR community.

frivoal commented 5 months ago

The PR got accidentally merged, but we didn't have a resolution on this.

aphillips commented 1 week ago

Reviewing as part of w3c/i18n-action#117.

In reviewing #4212, I disagree with using the extlang form. The matching of extlangs to the primary language is too complicated for most page authors to deal with and it's a little weird to create subtags out of nothing (e.g. yue => zh-yue) when the created subtags interfere with prefix-based matching. CLDR and BCP47 contain data to allow some subtag inference (notably, the script subtag)

Both the [=content language=] and the [=language range=] must be canonicalized and converted to extlang form as per section 4.5 of [[!RFC5646]] prior to the extended filtering operation. The matching is performed case-insensitively within the ASCII range.

Would be better off as:

Both the [=content language=] and the [=language range=] must be canonicalized as per section 4.5 of [[!RFC5646]] prior to the extended filtering operation. Such matching is performed [=ASCII case-insensitively=].

Happy to discuss. Don't forget that at least one of the BCP47 authors 🙈 is available to you for consultation when dealing with these issues.

macchiati commented 1 week ago

At least 2!

On Thu, Aug 29, 2024, 15:17 Addison Phillips @.***> wrote:

Reviewing as part of w3c/i18n-action#117.

In reviewing #4212 https://github.com/w3c/csswg-drafts/pull/4212, I disagree with using the extlang form. The matching of extlangs to the primary language is too complicated for most page authors to deal with and it's a little weird to create subtags out of nothing (e.g. yue => zh-yue) when the created subtags interfere with prefix-based matching. CLDR and BCP47 contain data to allow some subtag inference (notably, the script subtag)

Both the [=content language=] and the [=language range=] must be canonicalized and converted to extlang form as per section 4.5 of [[!RFC5646]] prior to the extended filtering operation. The matching is performed case-insensitively within the ASCII range.

Would be better off as:

Both the [=content language=] and the [=language range=] must be canonicalized as per section 4.5 of [[!RFC5646]] prior to the extended filtering operation. Such matching is performed [=ASCII case-insensitively=].

Happy to discuss. Don't forget that at least one of the BCP47 authors 🙈 is available to you for consultation when dealing with these issues.

— Reply to this email directly, view it on GitHub https://github.com/w3c/csswg-drafts/issues/4154#issuecomment-2319146095, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFPV45255SIFYDIGBTZT6MYDAVCNFSM4IHK4XH2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMZRHEYTINRQHE2Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>