Fix bug in CSS tokenizer

fredden commented 8 months ago

Yes, I know that the non-PHP tokenizers are going away in version 4.0. Meanwhile, the 3.x release line has a bug which can be fixed.

Description

While working with phpcbf, I noticed that some LESS files (which are parsed as CSS due to the specific ruleset being used) had unexpected content injected into them.

/* file: test.less */
// /**
//  * this is a comment
//  * this is also a comment
//  */

(Note the extra trailing newline at the end of the file.)

phpcs --basepath=. -s --standard=Squiz --sniffs=Squiz.WhiteSpace.SuperfluousWhitespace --extensions=less/CSS test.less

FILE: test.less
-------------------------------------------------------------------------------------------------------------
FOUND 1 ERROR AFFECTING 1 LINE
-------------------------------------------------------------------------------------------------------------
 7 | ERROR | [x] Additional whitespace found at end of file
   |       |     (Squiz.WhiteSpace.SuperfluousWhitespace.EndFile)
-------------------------------------------------------------------------------------------------------------
PHPCBF CAN FIX THE 1 MARKED SNIFF VIOLATIONS AUTOMATICALLY
-------------------------------------------------------------------------------------------------------------

Time: 45ms; Memory: 10MB

After running phpcbf (with the same flags as above), I get a file with this content:

/* file: test.less */
// /**
?>^PHPCS_CSS_T_CLOSE_TAG^^PHPCS_CSS_T_CLOSE_TAG^//  * this is a comment
//  * this is also a comment
//  */

(Note the extra newline at the end of the file has been fixed.)

Suggested changelog entry

Fixed bug in CSS tokenizer

Types of changes

Bug fix (non-breaking change which fixes an issue)

PR checklist

[x] I have checked there is no other PR open for the same change.
[x] I have read the Contribution Guidelines.
[x] I grant the project the right to include and distribute the code under the BSD-3-Clause license (and I have the right to grant these rights).
[x] I have added tests to cover my changes.
[x] I have verified that the code complies with the projects coding standards.

jrfnl commented 8 months ago

@fredden Thanks for this PR. While I'm not keen on putting any more effort into the CSS tokenizer, the outcome of this bug is pretty undesirable, so I'm willing to accept this fix.

✔️ I have confirmed and been able to reproduce the issue. ✔️ I have confirmed that the added test demonstrates the issue and can safeguard the fix without needing to add a dedicated test file for this issue for the CSS tokenizer. ✔️ I have confirmed that the issue is gone with the proposed fix.

However, looking at the token output, the problem is actually bigger and not really fixed.

Originally, the code sample in the description above was tokenized like so:

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [  0]:
  1 | L1 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
  2 | L1 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  3 | L1 | C  4 | CC 0 | ( 0) | T_DOC_COMMENT_OPEN_TAG     | [  3]: /**
  4 | L1 | C  7 | CC 0 | ( 0) | T_DOC_COMMENT_WHITESPACE   | [  0]:

  5 | L2 | C  1 | CC 0 | ( 0) | T_DOC_COMMENT_STRING       | [  2]: ?>
  6 | L2 | C  3 | CC 0 | ( 0) | T_STRING                   | [  2]: //
  7 | L2 | C  5 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
  8 | L2 | C  7 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
  9 | L2 | C  8 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 10 | L2 | C  9 | CC 0 | ( 0) | T_STRING                   | [  4]: this
 11 | L2 | C 13 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 12 | L2 | C 14 | CC 0 | ( 0) | T_STRING                   | [  2]: is
 13 | L2 | C 16 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 14 | L2 | C 17 | CC 0 | ( 0) | T_STRING                   | [  1]: a
 15 | L2 | C 18 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 16 | L2 | C 19 | CC 0 | ( 0) | T_STRING                   | [  7]: comment
 17 | L2 | C 26 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 18 | L3 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
 19 | L3 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
 20 | L3 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
 21 | L3 | C  6 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 22 | L3 | C  7 | CC 0 | ( 0) | T_STRING                   | [  4]: this
 23 | L3 | C 11 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 24 | L3 | C 12 | CC 0 | ( 0) | T_STRING                   | [  2]: is
 25 | L3 | C 14 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 26 | L3 | C 15 | CC 0 | ( 0) | T_STRING                   | [  4]: also
 27 | L3 | C 19 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 28 | L3 | C 20 | CC 0 | ( 0) | T_STRING                   | [  1]: a
 29 | L3 | C 21 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 30 | L3 | C 22 | CC 0 | ( 0) | T_STRING                   | [  7]: comment
 31 | L3 | C 29 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 32 | L4 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
 33 | L4 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
 34 | L4 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
 35 | L4 | C  6 | CC 0 | ( 0) | T_DIVIDE                   | [  1]: /
 36 | L4 | C  7 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 37 | L5 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 38 | L6 | C  1 | CC 0 | ( 0) | T_CLOSE_TAG                | [  0]:

Take note of token 5, the inserted ?>, which looks to be the cause of the problem for the Squiz sniff.

With this fix applied, the code sample is tokenized as

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [  0]:
  1 | L1 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
  2 | L1 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  3 | L1 | C  4 | CC 0 | ( 0) | T_DOC_COMMENT_OPEN_TAG     | [  3]: /**
  4 | L1 | C  7 | CC 0 | ( 0) | T_DOC_COMMENT_WHITESPACE   | [  0]:

  5 | L2 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
  6 | L2 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
  7 | L2 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
  8 | L2 | C  6 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  9 | L2 | C  7 | CC 0 | ( 0) | T_STRING                   | [  4]: this
 10 | L2 | C 11 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 11 | L2 | C 12 | CC 0 | ( 0) | T_STRING                   | [  2]: is
 12 | L2 | C 14 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 13 | L2 | C 15 | CC 0 | ( 0) | T_STRING                   | [  1]: a
 14 | L2 | C 16 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 15 | L2 | C 17 | CC 0 | ( 0) | T_STRING                   | [  7]: comment
 16 | L2 | C 24 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 17 | L3 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
 18 | L3 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
 19 | L3 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
 20 | L3 | C  6 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 21 | L3 | C  7 | CC 0 | ( 0) | T_STRING                   | [  4]: this
 22 | L3 | C 11 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 23 | L3 | C 12 | CC 0 | ( 0) | T_STRING                   | [  2]: is
 24 | L3 | C 14 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 25 | L3 | C 15 | CC 0 | ( 0) | T_STRING                   | [  4]: also
 26 | L3 | C 19 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 27 | L3 | C 20 | CC 0 | ( 0) | T_STRING                   | [  1]: a
 28 | L3 | C 21 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 29 | L3 | C 22 | CC 0 | ( 0) | T_STRING                   | [  7]: comment
 30 | L3 | C 29 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 31 | L4 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
 32 | L4 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
 33 | L4 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
 34 | L4 | C  6 | CC 0 | ( 0) | T_DIVIDE                   | [  1]: /
 35 | L4 | C  7 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 36 | L5 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 37 | L6 | C  1 | CC 0 | ( 0) | T_CLOSE_TAG                | [  0]:

What I mean to point out by posting these tokenizations is that the comment line tokenization is completely borked for this code sample, what with all lines after the first line getting confused.

@fredden Did you notice this ? What is your opinion on this ?

As I said above, I'm okay with just accepting this fix as-is and not spending more time on this as CSS tokenizer support will be dropped, but I did want to document this finding.

jrfnl commented 8 months ago

@fredden I'm also trying to remember if // is even a valid comment format in CSS and IIRC it is not, so the fact that a file which would cause a "parse error" in CSS would lead to tokenization problems is not really surprising.

fredden commented 8 months ago

Thanks for looking at this @jrfnl

the comment line tokenization is completely borked

Yes, I did notice this. I don't yet understand why there's any special-case handling for comments in the CSS tokenizer. I decided to not spend additional brain-power working this out though, as the plan is to delete non-PHP tokenizers for version 4.0. This pull request doesn't make it any worse; I intentionally took a minimal-change approach.

if // is even a valid comment format in CSS

No, I don't think it is; I'll check... Some basic local testing with the three browsers I have installed currently all show that // is not treated as a comment. https://lesscss.org/#comments says that LESS supports that comment syntax; the file I was working with when the problem came to light was a LESS file. https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Organizing#comment_your_css shows only /* ... */ style comments.

jrfnl commented 8 months ago

if // is even a valid comment format in CSS

No, I don't think it is; I'll check... Some basic local testing with the three browsers I have installed currently all show that // is not treated as a comment. https://lesscss.org/#comments says that LESS supports that comment syntax; the file I was working with when the problem came to light was a LESS file. https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Organizing#comment_your_css shows only /* ... */ style comments.

@fredden Thanks for checking that.

IIRC the CSS tokenizer never had any support for LESS. If it worked, that should be considered coincidental, not by design.

As CSS doesn't support the // comment syntax, I'm now inclined to close this PR as accepting it would expand the scope of the CSS Tokenizer to a syntax which is not even valid in CSS.

fredden commented 8 months ago

As CSS doesn't support the // comment syntax, I'm now inclined to close this PR as accepting it would expand the scope of the CSS Tokenizer to a syntax which is not even valid in CSS.

I don't think it's valid for the CSS tokenizer to inject a rogue ?> token where it doesn't exist in the source file. The PHP tokenizer does not do this.

Example of PHP tokenizer handling a parse error without adding an extra ?> token

`cat test.php` ```php ` tokens in this output.

I haven't seen any cases of phpcs causing issues with this (invalid CSS) syntax, but I have not gone looking for any cases either. The browsers that I have installed here all cope gracefully with a "selector" which has // in it, and parse the rest of the CSS in the file/element as expected. I don't know to what extent people are using PHP_CodeSniffer on CSS files.

In my experience, this bug is mostly masked. It shows up when PHP_CodeSniffer tries to re-write the file as it will output all the tokens into the file after phpcbf has done its magic. With this extra / rogue token*, the resulting file is different in more ways than just what the selected sniffs have intentionally modified.

* I have not looked into why ^PHPCS_CSS_T_CLOSE_TAG^ gets added twice during auto-fixing, as eliminating the rogue ?> makes that symptom go away.

jrfnl commented 8 months ago

@fredden Okay, I'm going to merge this PR as the rogue token output is just weird. Having said that, I would caution that further PRs adding support for LESS to the CSS tokenizer are strongly discouraged.

fredden commented 8 months ago

Agreed. LESS and CSS are not the same and one shouldn't be parsed as the other. I look forward to version 4.0 where these other-language tokenizers will be gone.

jrfnl commented 7 months ago

FYI: this fix is included in today's PHP_CodeSniffer 3.8.0 release.

As per #3932, development on PHP_CodeSniffer will continue in the PHPCSStandards/PHP_CodeSniffer repository. If you want to stay informed, you may want to start "watching" that repo (or watching releases from that repo).

squizlabs / PHP_CodeSniffer