[css-syntax] Give up on <urange> production?

tabatkins commented 1 year ago

In the previous versions of Syntax I tried my best to be grammar-agnostic, so you could parse CSS generically, and only throw out invalid rules/declarations as a final step. Implementations actually grammar-checked as they went and didn't add invalid stuff to the stylesheets, but that was meant to be an unobservable performance optimization.

However, we also wanted to fix the "u+a { color: red; } is invalid because <urange> isn't valid in a selector" problem; CSS2 defined <urange> as a generic token. We discussed this in the mailing list (transferred to #3588), and I ended up solving it by defining a very complex production that required preserving the representation of tokens, which nothing else in the whole spec required. Accordingly, no browser has actually implemented this: Chrome just never fixed the bug, and Firefox implemented it by invoking a different tokenizer.

(<an+b> is also now a complex production, but aside from preserving whether a positive number had a + or not, it relied on the standard token values, so wasn't as problematic. Everybody seems to have actually followed the spec text.)

However, now that Syntax is required to grammar-check as it parses, we can potentially correct this in a less weird way, by having a tokenizer flag that gets invoked only for the 'unicode-range' descriptor and allows the tokenizer to produce urange tokens again. This'll better match what Firefox is doing, and should make it easy for Chrome to correctly implement as well.

Thoughts?

tabatkins commented 1 year ago

/cc @emilio @sesse @andruud (I'm not sure who to ping from WebKit anymore)

emilio commented 1 year ago

cc @SimonSapin in case he has any opinions, since he implemented the Firefox parser

SimonSapin commented 1 year ago

However, now that Syntax is required to grammar-check as it parses,

What does this mean?

tabatkins commented 1 year ago

@SimonSapin The new Nesting behavior we agreed on is that, in nesting contexts, you first attempt to parse a declaration and grammar-validate it; if that fails, you back up and instead parse as a rule. This is how we resolved the grammar overlap with foo:bar ... looking like both a declaration and a rule. You can see this in https://drafts.csswg.org/css-syntax/#consume-block-contents (note that the "validate" check is done in https://drafts.csswg.org/css-syntax/#consume-declaration).

SimonSapin commented 1 year ago

Firefox has a mechanism to "rewind" arbitrarily far to implement <something> | <other> alternate syntax, so that aspect of Nesting isn’t be a problem. But that’s all at the parser level.

Having a descriptor require switching "modes" in the tokenizer is more than that. It feels unsatisfactory but if it’s the least bad solution given other constraints then :shrug:

emilio commented 1 year ago

Yeah I don't think this is needed for nesting afaict.

tabatkins commented 1 year ago

Yeah, this isn't about Nesting, it's just a wart I've been worrying about in Syntax since I wrote it. Literally nobody implements the spec as written.

Having a descriptor require switching "modes" in the tokenizer is more than that.

That is, effectively, what Firefox is doing right now anyway, in https://github.com/servo/rust-cssparser/blob/master/src/unicode_range.rs#L43. Once you see a u ident, you grab the rest of the tokens in the property, drop back to the text range they covered, and then run a custom parser over the text to try and extract a UnicodeRange token.

andruud commented 1 year ago

I'm sceptical about switching tokenizer modes as a general concept that can happen anywhere, because tokenization is logically separated from parsing (in Blink anyway), and in many cases we tokenize the input in full before looking at any of the tokens (which means it's too late to switch modes by the time we discover the need for that switch).

I think the current spec is not too unreasonble, except for the requirement that tokens retain their representation.

Thought: implementations must already be able to retain the original string for declarations due to custom properties, can we work with that here?

Remove <urange>, instead:
unicode-range: <any-value>.
Spec that unicode-range retains the original string of its declaration, exactly like custom props.
At some point interpret that original string into a range per 7.1.

How would we define unicode-range if we introduced it today? unicode-range: <string>?

tabatkins commented 1 year ago

Thought: implementations must already be able to retain the original string for declarations due to custom properties, can we work with that here?

Yeah, that probably works, tho it would mean that we lose the ability to check the syntax for the property at parse time. Maybe we can do this, but still hardcode a check into the parser that works on the string value of the entire declaration?

How would we define unicode-range if we introduced it today? unicode-range: ?

Maybe a string, but we could also have really easy used existing syntax. For example, if we'd just used u- rather than u+ this wouldn't be a problem at all: u-14a and u-100-14a are single idents with unambiguously interpretable values, and u-1?? is an ident followed by ? delim tokens. The grammar would just have been <urange> = <custom-ident> '?'*, with a restriction that no whitespace is allowed in the production, and then some prose rules about how to interpret the ident's value + question marks into a range. Or we could've done the same with a hash token, since that's already used for hex values in colors; it also uses ident syntax after the hash, so it would parse identically - #100-14a, etc.

andruud commented 1 year ago

Maybe we can do this, but still hardcode a check into the parser that works on the string value of the entire declaration?

Yeah, that was my thinking. We say in prose that (original) strings that don't match the microsyntax are invalid (parse-time). This is arguably "ugly", but at least it's isolated to unicode-range and the rest of CSS doesn't have to care.

Maybe a string, but we could also have really easy used existing syntax.

If we need unicode ranges in a new property/descriptor in the future, would we use a <string> for that? Asking because we should probably keep the look-at-the-original-string-hacks to a bare minimum.

having a tokenizer flag that gets invoked only for the 'unicode-range' descriptor and allows the tokenizer to produce urange tokens again.

By the way, I think we can still implement "parse the original string" (in Blink) even if it's specified like this, as it's not observable what we actually do (I think). I'm just a little worried that we're opening the door to lots "oh, we'll just switch tokenizer mode to solve that" in the future, which may result in a lot of new "Chrome never fixed the bug"-situations. On the other hand we seem to be opening some unwanted door regardless, so if the mode-switch is a substantially nicer spec than the alternatives, then maybe it indeed is the least bad option ...

SimonSapin commented 1 year ago

Thought: implementations must already be able to retain the original string for declarations due to custom properties,

Is that really the case? Do custom properties not work if an implementation saves tokens? (Regardless of whether it might prefer a string as it’s more compact in memory)

tabatkins commented 1 year ago

Do custom properties not work if an implementation saves tokens? (Regardless of whether it might prefer a string as it’s more compact in memory)

Yes, there's a very explicit requirement for saving the original string representation (or enough information to reconstruct it), because of things like putting numeric-ish data into custom properties (for non-CSS purposes) and expecting it to come out unmangled by serialization when they read it later. See https://drafts.csswg.org/css-variables/#serializing-custom-props (particularly the Note at the end of the section) and the discussion at https://logs.csswg.org/irc.w3.org/css/2016-09-19/#e722375.

I mean, you can save tokens, so long as each one individually remembers the bit of the string it was parsed from. There's no observable difference between that and saving the entire property value as a unit.

tabatkins commented 1 year ago

By the way, I think we can still implement "parse the original string" (in Blink) even if it's specified like this, as it's not observable what we actually do (I think). I'm just a little worried that we're opening the door to lots "oh, we'll just switch tokenizer mode to solve that" in the future, which may result in a lot of new "Chrome never fixed the bug"-situations. On the other hand we seem to be opening some unwanted door regardless, so if the mode-switch is a substantially nicer spec than the alternatives, then maybe it indeed is the least bad option ...

That's a fair worry! I can assure that, at least so long as I'm around to keep a hand in CSS's syntax, that won't happen. I'm extremely annoyed that we have to do this in the first place; I'm just trying to find the most realistic way to write it that implementations will actually follow, since my first attempt failed (Chrome never updated to it; Firefox does something completely different).

But since we do have a "preserve the string exactly" requirement in custom properties (which should probably be at least noted in the Syntax spec, since it does inform how you have to build your parser somewhat), I think leaning on that for unicode-range is probably the most straightforward way to go. (Plus, Firefox's behavior will then match the spec.) I'll put down some text and see how I feel about it.

tabatkins commented 1 year ago

Okay, first pass at a new unicode-range approach is checked in. The tokenizer now has a flag that allows or disallows unicode ranges, and the "consume a declaration" explicitly grabs the text of the declarations' value and retokenizes it with the flag on if it sees the declaration's name is "unicode-range". The tokenization rules were taken verbatim from the old version of CSS Syntax, so the Chrome impl of the token is exactly correct still. (Firefox is close but not exact - it'll consume more than 6 digits+question marks, and just fail to produce a token if it does so. This doesn't matter in practice, tho.)

The invocation is very obviously clumsy, and there's a note in the algorithm explicitly calling out that this is to fix a legacy syntax mistake, so it should be clear this isn't to be reproduced.

I also explicitly introduced the concept that your token stream needs to be able to access the underlying string, and invoke that capability in "consume a declaration" if you're parsing a custom property. That should be the only place it's technically needed.

cdoublev commented 1 year ago

I mean, you can save tokens, so long as each one individually remembers the bit of the string it was parsed from. There's no observable difference between that and saving the entire property value as a unit.

I think there are minor differences with trailing whitespaces/comments at the top-level of the declaration value.

You also loose the ability to serialize the fallback argument of var() with the original string but the other parts as tokens. This is something I suggested in #6484 but I am aware that this may not be ideal for memory savings.

This may be subjective, but it also seems easier to handle whitespaces disallowed in <an+b> subproductions (rather than for a whole production) when each token remembers its own string bit.

tabatkins commented 1 year ago

The only part of the "original string" that an+b needed to remember was whether a positive number had a + or not on it; I've reworded the spec to just track that information explicitly on the token and rely on it, rather than preserve the exact string.

cdoublev commented 7 months ago

Do you think unicode-range should keep <urange># as its value definition, and require the start/end code point to be lower than 110000, and start <= end?

U+110000, U+11????, U+1-0, are currently invalid unicode-range values on WPT and in Chrome and FF.

Check if three code points would start a unicode-range does not allow u+? and similar, which are valid on WPT and in Chrome and FF.

I do not think <unicode-range-token> can be produced when setting CSSFontFaceRule.style.unicodeRange, at the moment.

mscuthbert commented 5 months ago

If <urange> does disappear I hope that the unicode-range value can be specified more completely than just <string> for the aid of parsers. The ability to specify more than one (comma-separated) unicode range is at present scarcely documented, but is crucial for users of fonts whose ranges have been broken up due to historical circumstances in the unicode spec. The musical symbols ranges, for instance, is divided into at least two (three or more depending on completeness) ranges, and some fonts (including the W3C Standard Music Font Layout, SMuFL which defines both Unicode standard and Private Use definitions) depend on implementors being able to subset non-connected ranges to create composite-font rendering experiences.

cdoublev commented 5 months ago

It can be parsed against <unicode-range-token>#, ignoring the minor flaws reported above.

w3c / csswg-drafts

[css-syntax] Give up on <urange> production? #8835