Comments at token boundaries break selectors (false positive)

squidfunk commented 5 years ago

When encountering /* in the token stream, the tokenizer algorithm described in Syntax Level 3 states:

If the next input code point is U+002A ASTERISK (), consume it and all following code points up to and including the first U+002A ASTERISK () followed by a U+002F SOLIDUS (/), or up to an EOF code point. Then consume a token and return it.

Furthermore, later on:

The tokenizer described in this specification does not produce tokens for comments, or otherwise preserve them in any way. Implementations may preserve the contents of comments and their location in the token stream. If they do, this preserved information must have no effect on the parsing step.

This means that the tokenizer must effectively ignore comments, or, if it emits tokens for comments, ignore them during parsing. If comments occur at token boundaries, they should have absolutely no effect. This will work in every major browser, but won't validate:

./**/foo/**/:/**/hover {}

This will validate:

.foo/**/:hover {}

Note that for both examples, all comments are located at token boundaries. The tokenizer must parse it as:

[DELIM "."] [IDENT "foo"] [COLON] [IDENT "hover"] [WHITESPACE]...

I know that this might seem very pedantic, and I totally understand if this issue goes too far, so just consider this an FYI if somebody encounters this problem.

ylafon commented 5 years ago

Pedantism is what validators are for, not only, but they should be. An issue with EOF terminated comments is also ptracked in #229

squidfunk commented 5 years ago

I agree. So how could this be fixed? I have no knowledge of the parser/tokenizer architecture of this project, but after building my own CSS parser, I see only a single way of parsing CSS according to the spec - implementing the exact algorithm that is described in Syntax Level 3. This means effectively that tokenizing must be separated from parsing, because tokenizing must always terminate without errors.

Regarding comments: there are several different scenarios where comments may or may not appear at specific positions which makes it impossible to strip them as a preprocessing step (as many parsers do):

./**/class     /* [DELIM .][IDENT class] */
.cl/**/ass     /* [DELIM .][IDENT cl][IDENT ass] */

or

@media         /* [AT_TOKEN @media] */
@/**/media     /* [DELIM @][IDENT media] */

However, URL tokens may not contain any comments:

url(...)       /* [URL ...] */
url/**/(...)   /* [IDENT URL][(] ... [)] */
url(/**/...)   /* [BAD_URL] */
url(.../**/)   /* [BAD_URL] */

sideshowbarker commented 5 years ago

So how could this be fixed? I have no knowledge of the parser/tokenizer architecture of this project

The source for the CSS validator parser is here:

https://github.com/w3c/css-validator/blob/master/org/w3c/css/parser/analyzer/CssParser.jj

The tokenizer part of that is here:

https://github.com/w3c/css-validator/blob/master/org/w3c/css/parser/analyzer/CssParser.jj#L484

As you can see, it’s a grammar specification. It’s read by javacc, a parser generator which then generates the Java source code for the parser.

I guess it’s also worth mentioning that the CSS validator project was started more than 20 years ago, and (I think) it’s had that javacc parser-generator structure since the beginning.

squidfunk commented 5 years ago

I guess it’s also worth mentioning that the CSS validator project was started more than 20 years ago, and (I think) it’s had that javacc parser-generator structure since the beginning.

20 years and still working, that's not bad in the software domain. I skimmed through the source and see a huge entanglement of tokenizing and lexing/parsing. I totally understand that some time ago there was a need for multiple versions of CSS to be supported, but is this still necessary today? With the advent of "CSS 3" and the modularization of the specs, Syntax Level 3 should be the source of truth when writing a CSS parser. The lexing of specific parts (selectors, media queries, values, etc.) could then be delegated to more specific parsers.

I'm not proficient in Java, but maybe there's a third-party project implementing a CSS parser that adheres to the specs which could be used upstream. Otherwise, my personal recommendation would be a rewrite, as the current parser looks rather "grown" to me, but as I said - not a Java developer.

sideshowbarker commented 5 years ago

I totally understand that some time ago there was a need for multiple versions of CSS to be supported, but is this still necessary today?

I’m not the right person to speak to that. I’m just a contributor and downstream embedder. But for what it’s worth I can say that for my use case — which is, embedding the CSS checker into the Nu Html Checker — I don’t have need to support multiple CSS versions; I just need the very latest.

I'm not proficient in Java, but maybe there's a third-party project implementing a CSS parser that adheres to the specs which could be used upstream.

There isn’t such an existing parser written in Java. I’ve looked.

Otherwise, my personal recommendation would be a rewrite, as the current parser looks rather "grown" to me, but as I said - not a Java developer.

I can’t really call myself a Java developer either (despite the fact I write a lot of Java code…), but I also feel like ideally what would be best at this point is a rewrite.

But even if we had complete agreement for doing a rewrite, it’s not gonna happen unless/until someone’s got a significant amount of time free they’re able to commit to doing it. I personally do not.

Maybe over the long term we can recruit a GSOC student or such to work on it. It’d be a great project for someone with the right skills looking to do some significant contribution to a really useful project.

ylafon commented 2 years ago

This specific issue is fixed (some other may linger, don't hesitate to raise new issues pertaining to the tokeniser)

w3c / css-validator

Comments at token boundaries break selectors (false positive) #236