w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.46k stars 657 forks source link

Poor description of :lang() psuedo class selector [css-3] [selectors-3] #3022

Open GLRoylance opened 6 years ago

GLRoylance commented 6 years ago

https://drafts.csswg.org/selectors-3/#lang-pseudo

The draft says

The pseudo-class :lang(C) represents an element that is in language C. Whether an element is represented by a :lang() selector is based solely on the element's language value (normalized to BCP 47 syntax if necessary) being equal to the identifier C, or beginning with the identifier C immediately followed by "-" (U+002D). The matching of C against the element's language value is performed case-insensitively within the ASCII range. The identifier C does not have to be a valid language name.

The "normalized to BCP 47 syntax if necessary" opens a can of worms. It implies that the user agent should take locale strings such as "en_US" or "it_IT.utf8" and normalize them to BCP 47's syntax if necessary (which would be "en-US" and "it-IT"). Please do not suggest that an element's language can be set with xml:lang="en_US" or lang="it_IT.utf8" and the user agent will "normalize" it to a BCP 47 language tag.

AmeliaBR commented 6 years ago

I suspect part of the vagueness is because the CSS pseudoclass is designed to work with many different document types, which may have their own syntaxes for specifying the element language.

CSS uses BCP 47 in the :lang() selector. If a document type uses a different syntax, the user agent needs to convert it to BCP 47 in order to test equality.

The allowed values for the lang attribute in HTML and the xml:lang attribute in XML are defined in those specifications. HTML specifies BCP 47, XML references rfc 3066. I'm not an expert on the differences between those two, but I'm pretty sure "en_US" isn't valid for either.

GLRoylance commented 6 years ago

BCP 47 is the concatenation of RFC 5646 and RFC 5647.

RFC 5646 supersedes RFC 4646 which supersedes RFC 3066.

On Thu, Aug 16, 2018 at 1:35 PM, Amelia Bellamy-Royds < notifications@github.com> wrote:

I suspect part of the vagueness is because the CSS pseudoclass is designed to work with many different document types, which may have their own syntaxes for specifying the element language.

CSS uses BCP 47 in the :lang() selector. If a document type uses a different syntax, the user agent needs to convert it to BCP 47 in order to test equality.

The allowed values for the lang attribute in HTML https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes and the xml:lang attribute in XML https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-lang-tag are defined in those specifications. HTML specifies BCP 47, XML references rfc 3066. I'm not an expert on the differences between those two, but I'm pretty sure "en_US" isn't valid for either.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/w3c/csswg-drafts/issues/3022#issuecomment-413676381, or mute the thread https://github.com/notifications/unsubscribe-auth/Ah_aIDdj9p03obydXLDq39ufcdCL3Vxtks5uRdekgaJpZM4WAJ2q .

AmeliaBR commented 6 years ago

By the way: Selectors Level 4, which is the version that is being actively edited, already has additional clarifying notes on this point. In particular:

Note: The content language of an element is defined by the document language. For example, in HTML, the content language is determined by a combination of the lang attribute, information from meta elements, and possibly also the protocol (e.g. from HTTP headers). XML languages can use the xml:lang attribute to indicate language information for an element.

svgeesus commented 6 years ago

HTML specifies BCP 47, XML references rfc 3066.

Almost.

The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor

(my italics). So XML in practice uses BCP 47, same as HTML.

fantasai commented 5 years ago

@AmeliaBR is exactly right. Selectors can be used with markup languages other than HTML, and not all of them will use BCP47 syntax to represent the content language, so Selectors requires the UA to convert to BCP47 syntax before making the comparison. For example, DocBook 3.1 accepts en_US https://tdg.docbook.org/tdg/3.1/refelem.html#DBRE.X.COMMON on its lang attribute.

I've tweaked the wording a bit to not imply that we're normalizing arbitrary strings to BCP47. https://github.com/w3c/csswg-drafts/commit/2df8680b5aa0be3ba3dca0ae512c62aad7a39c8e https://drafts.csswg.org/selectors-4/#the-lang-pseudo Let me know if this is acceptable @GLRoylance