[selectors-4] Augment the grammar to unambigously encode handling of white-space?

I've now read the useful prior, since closed issue at https://github.com/w3c/csswg-drafts/issues/7027 which touches upon similar subject.

I must say it helps greatly that the selectors grammar is defined in terms of [a more or less] formal notation (not saying it's outright informal, since the "Values & Units" draft specifies the notation [elements] quite well), but the parable(s) following [the grammar] that clarify handling of white-space don't help dispel my confusion implementing a parser [generator] correctly, I must admit. By that I mean that it's not made very clear where to allow white-space in the parser and where not to, at least not in a way that I can be confident in inserting the "rules" into the grammar proper.

In the aforementioned related issue, @tabatkins says, quoting, whitespace is allowed between any two grammar productions -- is this still the case [with the Editor's Draft] edition of the Selectors 4 specification? Because it's not specified outright in the latter while I think it wouldn't hurt to do so, even informally as it stands.

As for the issue I wanted to share -- would it not make sense -- if we (you) attempt to encode handling of white-space into the grammar, to eliminate interpretation ambiguity? The white-space production is defined in the Syntax module [Level 3, Editor's Draft] spec.: https://www.w3.org/TR/css-syntax/#whitespace, after all -- why not build upon that and insert as required into the Selectors grammar?

Specifically, what about replacing the combinator? part (which currently requires the [less formal] clarification on omission of the ?-subject production, following the grammar), with e.g. [ combinator | whitespace ]? (let's for the sake of the example assume a rule equivalent to <whitespace> = '\n' | '\t' | ' ')

...and in similar fashion approach re-writing of the informal part(s) following the grammar, to be specified with the grammar.

This isn't critical, admittedly, but for me personally it makes a hard choice deciding whether I could use a parser generator (taking a grammar file as input) or whether I must resort to hand-written parser (since quality of a machine-generated parser, e.g. from grammar, would stand and fall on the quality of the grammar being its chief input).

Yes, it is still the case that whitespace is allowed between any two tokens (or allowed to not be there) (except for the rare cases where some text defines specific requirements about it). It's not defined in Selectors because it's defined in Values & Units, which defines the grammar CSS uses.

In general (unless specific text overrides), it's not that whitespace is allowed/disallowed, it's that whitespace is ignored; all that matters is that the tokens are parsed properly. Whitespace is a convenient way to ensure that two things are tokenized separately, but not the only way: foo bar produces two idents, but so does foo/**/bar, with no whitespace. Sometimes you can use neither - foo()bar produces a function followed by an ident, same as foo() bar.

Regarding parser generators, we don't optimize for that. If it's clearer to read and write to put a condition in the surrounding text rather than in the grammar, we're happy to do that instead. But a parser generator would have to be optimized for CSS anyway (due to its specific tokenization rules, and its whitespace rules), and you'd have to manually handle the few cases where whitespace is required or disallowed.

From the perspective of an implementer, an issue with informally describing handling of white-space in context of the Selectors spec., is that evidently the grammar addresses tokens, not e.g. Unicode code points -- the spec. defers to the Syntax specification which defines tokenization, and just expects a stream of tokens. The implication of this is that by the Syntax spec., white-space tokens are a thing, they're a "first class citizen" so to speak.

Now, for the Selectors grammar to simply omit their presence in productions, instead opting for informally defining how these tokens are to be dealt with by parsers -- in prose, in my opinion does a real disservice to parser writers? This is a bold claim, I admit, so let me try to elaborate -- if we assume a non-trivial percentage of people reading the [Selectors] spec., are in fact doing so in order to implement a [CSS selectors] parser, while generally making the spec. readable would absolutely be a sound decision, in this specific case it's done so apparently at the expense of making it eas(ier) to implement a parser -- through a grammar that omits first-class CSS citizens that are white-space tokens!

I am not advocating for dispensing with the grammar -- for my part it's made implementing a selector parsing much easier since I could e.g. just copy it, as-is, into a file, have it parsed according to the corresponding notation (defined in Values & Units in large part), then feed the resulting grammar object to a parser generator which will get me, in theory, a working CSS selectors parser. Or I could express the equivalent of the grammar myself (without a parser generator), in code, feeding it to a general parser (which is what I am doing now in my implementation, despite having said I am "using a parser generator" -- I should have made it clear, for clarity's sake, it's the goal not current state of art).

In either case, I have had to "manually" insert <whitespace-token>? elements into parts of the grammar expression I have, in order to implement what is otherwise specified in prose in the Selectors document. And I cannot see why -- when white-space tokens are, after all, specified and are vended by the abstract tokenization procedure defined in Syntax -- the Selectors grammar can't just include the corresponding white-space productions explicitly, to dispense with having to informally specify the language? Especially since white-space handling isn't as simple as calling it "all whitespace is optional, everywhere" -- there are just enough exceptions so that codifying all of it in the grammar, as suggested, may be of a lot of benefit.

Same would go for comment tokens -- since the above wouldn't include parsing of foo/**/bar. But that is another issue. For my part I solved it by having comment tokens be white-space tokens (not vice versa), although coming to think of it lately I should have a common superclass called SpaceToken of which WhiteSpaceToken and CommentToken are sub-classes. Then the parser could be oblivious to the kind of white-space it encounters, dealing with <space-token> productions, but with these being explicit in the grammar.

Both white-space and comments are treated as "transparent" by the specification, but for a parser that isn't always a clear-cut matter -- turning a parse tree back into original text is much easier when no information was discarded by the parser. The Syntax spec. does, however, instruct parsers to effectively discard comments. That blurs the line between concrete and abstract syntax trees, and in any case makes turning a parse tree back into original text into a much less trivial affair -- than if comments were retained by the parser in form of tokens (which is what I do with mine, as hinted at with the previous paragraph).

Now, for the Selectors grammar to simply omit their presence in productions, instead opting for informally defining how these tokens are to be dealt with by parsers -- in prose, in my opinion does a real disservice to parser writers?

They're omitted by definition - the grammar that we use in CSS specs is defined as applying to tokens and component values, as produced by the parser defined in CSS Syntax, and then further details are defined in Values & Units (or, in some cases, the Syntax spec as well). One of those details in V&U's definition is that whitespace tokens are implicitly allowed between any two tokens but never required, unless prose specifically requires or forbids them in a particular location.

This rule results in grammars that are dramatically easier to read (and write), as we don't have to clutter grammar definitions with technical details of whitespace placement. Basically, every single pair of tokens in every single grammar, save the tiny handful of locations that require/forbid whitespace somewhere, would contain a <ws>? term between them. That sort of repetition helps no-one; it makes almost all grammars harder to read, and makes it harder to spot when there is a special behavior, as it would only be indicated by either the lack of ? (for required) or the lack of the token altogether (for forbidden).

And I know from experience (both personal, and observing other grammars that do explicitly indicate almost-always-optional whitespace) that it's easy to accidentally screw up a detail of this, and accidentally forbid or require whitespace in a location where that isn't intended. It would be difficult to tell if this was an accident or not; you'd want to provide accompanying prose repeating the requirement/restriction anyway to make it clear when it was intended.

So the end result is that you'd have more cluttered, difficult-to-read-and-write grammars, where it's easy to make mistakes and hard to spot those mistakes, and still have the prose descriptions of requirements/restrictions.

In theory we could have special productions that only indicate the requirement or restriction of whitespace, and use those instead of prose descriptions when it happens. For example, we moved simple numeric range restrictions into the grammar (like <length [0, 100px]>) when they were previously only indicated in prose. So far the number of such requirements/restrictions are so low that it hasn't proven necessary.

Note that it will always be the case that some grammar restrictions are expressed in prose rather than by complicating the grammar syntax. As I said earlier, we optimize for understandability, not machine-handling; the specs are written and read by humans. If your hope is that you could compile an "all of CSS" grammar and just throw that at a parser generator, you will unfortunately be disappointed.

w3c / csswg-drafts

[selectors-4] Augment the grammar to unambigously encode handling of white-space? #10940