Closed fredden closed 1 year ago
@fredden Thanks for this PR. While I'm not keen on putting any more effort into the CSS tokenizer, the outcome of this bug is pretty undesirable, so I'm willing to accept this fix.
✔️ I have confirmed and been able to reproduce the issue. ✔️ I have confirmed that the added test demonstrates the issue and can safeguard the fix without needing to add a dedicated test file for this issue for the CSS tokenizer. ✔️ I have confirmed that the issue is gone with the proposed fix.
However, looking at the token output, the problem is actually bigger and not really fixed.
Originally, the code sample in the description above was tokenized like so:
Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content
-------------------------------------------------------------------------
0 | L1 | C 1 | CC 0 | ( 0) | T_OPEN_TAG | [ 0]:
1 | L1 | C 1 | CC 0 | ( 0) | T_STRING | [ 2]: //
2 | L1 | C 3 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
3 | L1 | C 4 | CC 0 | ( 0) | T_DOC_COMMENT_OPEN_TAG | [ 3]: /**
4 | L1 | C 7 | CC 0 | ( 0) | T_DOC_COMMENT_WHITESPACE | [ 0]:
5 | L2 | C 1 | CC 0 | ( 0) | T_DOC_COMMENT_STRING | [ 2]: ?>
6 | L2 | C 3 | CC 0 | ( 0) | T_STRING | [ 2]: //
7 | L2 | C 5 | CC 0 | ( 0) | T_WHITESPACE | [ 2]: ⸱⸱
8 | L2 | C 7 | CC 0 | ( 0) | T_MULTIPLY | [ 1]: *
9 | L2 | C 8 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
10 | L2 | C 9 | CC 0 | ( 0) | T_STRING | [ 4]: this
11 | L2 | C 13 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
12 | L2 | C 14 | CC 0 | ( 0) | T_STRING | [ 2]: is
13 | L2 | C 16 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
14 | L2 | C 17 | CC 0 | ( 0) | T_STRING | [ 1]: a
15 | L2 | C 18 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
16 | L2 | C 19 | CC 0 | ( 0) | T_STRING | [ 7]: comment
17 | L2 | C 26 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
18 | L3 | C 1 | CC 0 | ( 0) | T_STRING | [ 2]: //
19 | L3 | C 3 | CC 0 | ( 0) | T_WHITESPACE | [ 2]: ⸱⸱
20 | L3 | C 5 | CC 0 | ( 0) | T_MULTIPLY | [ 1]: *
21 | L3 | C 6 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
22 | L3 | C 7 | CC 0 | ( 0) | T_STRING | [ 4]: this
23 | L3 | C 11 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
24 | L3 | C 12 | CC 0 | ( 0) | T_STRING | [ 2]: is
25 | L3 | C 14 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
26 | L3 | C 15 | CC 0 | ( 0) | T_STRING | [ 4]: also
27 | L3 | C 19 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
28 | L3 | C 20 | CC 0 | ( 0) | T_STRING | [ 1]: a
29 | L3 | C 21 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
30 | L3 | C 22 | CC 0 | ( 0) | T_STRING | [ 7]: comment
31 | L3 | C 29 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
32 | L4 | C 1 | CC 0 | ( 0) | T_STRING | [ 2]: //
33 | L4 | C 3 | CC 0 | ( 0) | T_WHITESPACE | [ 2]: ⸱⸱
34 | L4 | C 5 | CC 0 | ( 0) | T_MULTIPLY | [ 1]: *
35 | L4 | C 6 | CC 0 | ( 0) | T_DIVIDE | [ 1]: /
36 | L4 | C 7 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
37 | L5 | C 1 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
38 | L6 | C 1 | CC 0 | ( 0) | T_CLOSE_TAG | [ 0]:
Take note of token 5, the inserted ?>
, which looks to be the cause of the problem for the Squiz sniff.
With this fix applied, the code sample is tokenized as
Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content
-------------------------------------------------------------------------
0 | L1 | C 1 | CC 0 | ( 0) | T_OPEN_TAG | [ 0]:
1 | L1 | C 1 | CC 0 | ( 0) | T_STRING | [ 2]: //
2 | L1 | C 3 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
3 | L1 | C 4 | CC 0 | ( 0) | T_DOC_COMMENT_OPEN_TAG | [ 3]: /**
4 | L1 | C 7 | CC 0 | ( 0) | T_DOC_COMMENT_WHITESPACE | [ 0]:
5 | L2 | C 1 | CC 0 | ( 0) | T_STRING | [ 2]: //
6 | L2 | C 3 | CC 0 | ( 0) | T_WHITESPACE | [ 2]: ⸱⸱
7 | L2 | C 5 | CC 0 | ( 0) | T_MULTIPLY | [ 1]: *
8 | L2 | C 6 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
9 | L2 | C 7 | CC 0 | ( 0) | T_STRING | [ 4]: this
10 | L2 | C 11 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
11 | L2 | C 12 | CC 0 | ( 0) | T_STRING | [ 2]: is
12 | L2 | C 14 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
13 | L2 | C 15 | CC 0 | ( 0) | T_STRING | [ 1]: a
14 | L2 | C 16 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
15 | L2 | C 17 | CC 0 | ( 0) | T_STRING | [ 7]: comment
16 | L2 | C 24 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
17 | L3 | C 1 | CC 0 | ( 0) | T_STRING | [ 2]: //
18 | L3 | C 3 | CC 0 | ( 0) | T_WHITESPACE | [ 2]: ⸱⸱
19 | L3 | C 5 | CC 0 | ( 0) | T_MULTIPLY | [ 1]: *
20 | L3 | C 6 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
21 | L3 | C 7 | CC 0 | ( 0) | T_STRING | [ 4]: this
22 | L3 | C 11 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
23 | L3 | C 12 | CC 0 | ( 0) | T_STRING | [ 2]: is
24 | L3 | C 14 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
25 | L3 | C 15 | CC 0 | ( 0) | T_STRING | [ 4]: also
26 | L3 | C 19 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
27 | L3 | C 20 | CC 0 | ( 0) | T_STRING | [ 1]: a
28 | L3 | C 21 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
29 | L3 | C 22 | CC 0 | ( 0) | T_STRING | [ 7]: comment
30 | L3 | C 29 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
31 | L4 | C 1 | CC 0 | ( 0) | T_STRING | [ 2]: //
32 | L4 | C 3 | CC 0 | ( 0) | T_WHITESPACE | [ 2]: ⸱⸱
33 | L4 | C 5 | CC 0 | ( 0) | T_MULTIPLY | [ 1]: *
34 | L4 | C 6 | CC 0 | ( 0) | T_DIVIDE | [ 1]: /
35 | L4 | C 7 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
36 | L5 | C 1 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
37 | L6 | C 1 | CC 0 | ( 0) | T_CLOSE_TAG | [ 0]:
What I mean to point out by posting these tokenizations is that the comment line tokenization is completely borked for this code sample, what with all lines after the first line getting confused.
@fredden Did you notice this ? What is your opinion on this ?
As I said above, I'm okay with just accepting this fix as-is and not spending more time on this as CSS tokenizer support will be dropped, but I did want to document this finding.
@fredden I'm also trying to remember if //
is even a valid comment format in CSS and IIRC it is not, so the fact that a file which would cause a "parse error" in CSS would lead to tokenization problems is not really surprising.
Thanks for looking at this @jrfnl
the comment line tokenization is completely borked
Yes, I did notice this. I don't yet understand why there's any special-case handling for comments in the CSS tokenizer. I decided to not spend additional brain-power working this out though, as the plan is to delete non-PHP tokenizers for version 4.0. This pull request doesn't make it any worse; I intentionally took a minimal-change approach.
if
//
is even a valid comment format in CSS
No, I don't think it is; I'll check... Some basic local testing with the three browsers I have installed currently all show that //
is not treated as a comment. https://lesscss.org/#comments says that LESS supports that comment syntax; the file I was working with when the problem came to light was a LESS file. https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Organizing#comment_your_css shows only /* ... */
style comments.
if
//
is even a valid comment format in CSSNo, I don't think it is; I'll check... Some basic local testing with the three browsers I have installed currently all show that
//
is not treated as a comment. https://lesscss.org/#comments says that LESS supports that comment syntax; the file I was working with when the problem came to light was a LESS file. https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Organizing#comment_your_css shows only/* ... */
style comments.
@fredden Thanks for checking that.
IIRC the CSS tokenizer never had any support for LESS. If it worked, that should be considered coincidental, not by design.
As CSS doesn't support the //
comment syntax, I'm now inclined to close this PR as accepting it would expand the scope of the CSS Tokenizer to a syntax which is not even valid in CSS.
As CSS doesn't support the
//
comment syntax, I'm now inclined to close this PR as accepting it would expand the scope of the CSS Tokenizer to a syntax which is not even valid in CSS.
I don't think it's valid for the CSS tokenizer to inject a rogue ?>
token where it doesn't exist in the source file. The PHP tokenizer does not do this.
?>
tokenI haven't seen any cases of phpcs
causing issues with this (invalid CSS) syntax, but I have not gone looking for any cases either. The browsers that I have installed here all cope gracefully with a "selector" which has //
in it, and parse the rest of the CSS in the file/element as expected. I don't know to what extent people are using PHP_CodeSniffer on CSS files.
In my experience, this bug is mostly masked. It shows up when PHP_CodeSniffer tries to re-write the file as it will output all the tokens into the file after phpcbf
has done its magic. With this extra / rogue token*, the resulting file is different in more ways than just what the selected sniffs have intentionally modified.
* I have not looked into why ^PHPCS_CSS_T_CLOSE_TAG^
gets added twice during auto-fixing, as eliminating the rogue ?>
makes that symptom go away.
@fredden Okay, I'm going to merge this PR as the rogue token output is just weird. Having said that, I would caution that further PRs adding support for LESS to the CSS tokenizer are strongly discouraged.
Agreed. LESS and CSS are not the same and one shouldn't be parsed as the other. I look forward to version 4.0 where these other-language tokenizers will be gone.
FYI: this fix is included in today's PHP_CodeSniffer 3.8.0 release.
As per #3932, development on PHP_CodeSniffer will continue in the PHPCSStandards/PHP_CodeSniffer repository. If you want to stay informed, you may want to start "watching" that repo (or watching releases from that repo).
Yes, I know that the non-PHP tokenizers are going away in version 4.0. Meanwhile, the 3.x release line has a bug which can be fixed.
Description
While working with
phpcbf
, I noticed that some LESS files (which are parsed as CSS due to the specific ruleset being used) had unexpected content injected into them.(Note the extra trailing newline at the end of the file.)
phpcs --basepath=. -s --standard=Squiz --sniffs=Squiz.WhiteSpace.SuperfluousWhitespace --extensions=less/CSS test.less
After running
phpcbf
(with the same flags as above), I get a file with this content:(Note the extra newline at the end of the file has been fixed.)
Suggested changelog entry
Fixed bug in CSS tokenizer
Types of changes
PR checklist