Open jrfnl opened 2 months ago
This was changed by @iluuu1994 during https://github.com/php/php-src/commit/f291d37a1a7d78e841f40a4410359548bc73de1b in response to #10083...
Tip: for grammar questions, one of the first places to check for answers is zend_language_scanner.
Damnit, beat me on commenting literally by seconds :joy: Indeed, this is a side effect of the bugfix and imo not a bug.
I do kind of consider it a bug as it is an undocumented change in behaviour - the changelog entry really doesn't cover the potential/actual impact of the change.
Knowing where this is coming from, I'm surprised I haven't seen more breakage due to this change in PHPCS. Then again, that would require the PHPCS tests to hit the changed tokenization behaviour and as this kind of thing was previously a parse error, I can understand why this is not covered by the existing tests.
If this change stays, in my opinion, the changelog entry needs to be updated to more comprehensively state what has actually changed and what the impact of this. I also think it should be mentioned in the CHANGELOG, not just in NEWS.
Additionally, I'd recommend for tests to be added with the above code sample for yield from
to prevent a regression in the future. Preferably tests which would not just check that the above code sample is not regarded as a parse error, but tests which also check the tokenization is consistent.
P.S.: Thanks @damianwadley and @nielsdos for your fast response to my query!
Since this affects PHP 8.3, I'm not sure whether updating CHANGELOG makes much sense now, but it's certainly worth documenting it in the migration guide of the PHP documentation.
@jrfnl I indeed didn't think about linters, this should have gotten an UPGRADING entry. Sorry about that.
IMO it is not a bug, as comments are (should be) allowed between any code token in general.
This is probably already documented or at least it is "generally expected". So https://github.com/php/php-src/commit/f291d37a1a7d78e841f40a4410359548bc73de1b is a fix.
@mvorisek What Juliette references is the patch changing the output of PhpToken::tokenize()
, with the tokenizer treating yield /* comment */ from
as a singular token. This is the only feasible solution without switching to a GLR parser.
That is indeed not good and such token should be broken into separate tokens further.
repro: https://3v4l.org/7VmX1
Forgive me if I don't use the correct terminology for some things, I'm hoping that even if the terminology is not 100% correct for C, you can still understand what I'm trying to say.
IMO it is not a bug, as comments are (should be) allowed between any code token in general.
... which is exactly the reason why I was testing the Tokenizer behaviour with comments between the words.
Having said that, it's also not true....
In nearly all cases, tokens consist of a "single entity", so no comments can exist within a token.
This includes "compound operators", like ??
, where no whitespace or comment is allowed between the first and the second ?
.
A typical case where the tokenization changed from multiple tokens to single token is the PHP 8.0 "namespaced names" change. However, that change also explicitly removed the ability to have whitespace or comments within a namespaced name.
Another one which comes to mind are cast tokens, like (bool)
. These tokens can contain whitespace, i.e. ( bool )
, but they cannot contain comments. See: https://3v4l.org/A6Sgj and https://3v4l.org/nE9H8
So, in practice, yield from
is (AFAICS) the only "multi-entity" token which can contain comments... (since PHP 8.3).
_Retrospectively, it would have been better if yield from
would have been introduced as yield_from
(note the underscore). This would be in line with other multi-word keywords, like include_once
or as yieldfrom
, which would be in line with keywords like endif
, but considering it has been in the language for such a long time now, changing this now would be a huge BC-break._
I consider the change in https://github.com/php/php-src/commit/f291d37a1a7d78e841f40a4410359548bc73de1b a bug as it is an undocumented (breaking) change in behaviour for the Tokenizer for projects consuming the token output. That bug could be fixed by either fixing/updating the documentation or by reverting the change.
So, what remains then is the discussion about how yield from
should be tokenized.
I'm not sure about this and you can also see some thoughts about this in the linked PHPCS ticket (as PHPCS needs to ensure the token stream sniffs receive is the same on all PHP versions, so I need to make a choice on how to get round the changed tokenization).
On the one hand I agree that comments between the keywords should be ignored - as they mostly are elsewhere -, on the other hand, comments cause a parse error in other "multi entity" tokens, so this change introduces an inconsistency.
In nearly all cases, tokens consist of a "single entity", so no comments can exist within a token. This includes "compound operators", like ??, where no whitespace or comment is allowed between the first and the second ?.
That's true, but what qualifies as a token is usually very intuitive. yield from
is one of the few exceptions, and it's primary rationale for being a single token to avoid making from
itself a keyword.
So, in practice, yield from is (AFAICS) the only "multi-entity" token which can contain comments... (since PHP 8.3).
I think that's correct. The other cases only use lookahead for comments.
So, what remains then is the discussion about how yield from should be tokenized.
Unfortunately, for yield from
to not cause parser ambiguitites, it must be a single token. Hence, when a comment does appear in that spot, it is implicitly part of the token. Since 8.3 has been out for a while, I don't think removing it now is feasible, and if it is removed from 8.4, then that still leaves 8.3 as an issue for you.
It might be possible for you to artificially split yield /* comment */ from
into multiple tokens. It's not great to push that responsibility onto users, but I don't see a great alternative at the moment. If you have any other ideas that might help with your situation, please let me know.
Unfortunately, for
yield from
to not cause parser ambiguitites, it must be a single token. Hence, when a comment does appear in that spot, it is implicitly part of the token.
And that's what I don't necessarily agree with - and nor did PHP 7.0 - 8.2 in which this was a parse error.
Since 8.3 has been out for a while, I don't think removing it now is feasible
And again, I don't agree with this. I only discovered this bug now as the change wasn't documented. That doesn't make it any less of a bug/inconsistency with every single other token, as none of them allow for comments within the token.
If it had been included in the changelog, I would probably have questioned the change/introduction of this inconsistency before the release was out.
Note: I'm only challenging the change to the yield from
tokenization. Not the other changes in the same commit.
Well, we can't go back in time, so the question is what would break more code: reverting the change for yield
from
, or leaving it as is. I'd hope the former, but I'm afraid the latter might actually be the case (coders are so "creative": "let's just put a comment between yield
and from
, because we can").
I would agree with Christoph. From what I understand, PHP_CodeSniffer would either not correctly format the comment, or mistakenly remove it. While that's not great, it's not breaking. Removing parsing at this point would fatal error on a PHP patch update, and I think that's generally a no-go. Please correct me if my understanding is incorrect.
The other side of that argument is that the chances of people having discovered that they can place a comment between yield
and from
now are slim as the change was not documented/publicized and has only been out 7 months...
Maybe we should ask for opinions on the internals mailing list?
But what is the "solution" anyway? Complicating the lexer to deal with "yield from" specially?
But what is the "solution" anyway? Complicating the lexer to deal with "yield from" specially?
I think this is about reverting
<ST_IN_SCRIPTING>"yield"{WHITESPACE_OR_COMMENTS}"from"[^a-zA-Z0-9_\x80-\xff] {
to the previous
<ST_IN_SCRIPTING>"yield"{WHITESPACE}"from"[^a-zA-Z0-9_\x80-\xff] {
or not.
I think the proper way of fixing this would be introducing T_FROM, in a special lexer mode.
I.e.:
<ST_IN_SCRIPTING>"yield"{WHITESPACE_OR_COMMENTS}"from"[^a-zA-Z0-9_\x80-\xff] {
yyless(5);
yy_push_state(ST_LOOKING_FOR_FROM);
RETURN_TOKEN(T_YIELD);
}
<ST_LOOKING_FOR_FROM>"from" {
yy_pop_state();
RETURN_TOKEN(T_FROM);
}
And then add ST_LOOKING_FOR_FROM
to the whitespace, comment etc. targets.
Sort of related issue is #14961. Basically needs the same handling than that one.
@bwoebi Wouldn't that be a much larger BC-break as it would turn from
into a (partially?) reserved keyword ? and would remove the T_YIELD_FROM
token ?
@jrfnl It would be a keyword only if preceded by yield
, i.e. changing nothing on the status quo. But yes, it would remove the T_YIELD_FROM token (and a T_COMMENT could now be between T_YIELD and T_FROM).
@bwoebi, I guess that would cause even more trouble for @jrfnl and others.
Anyhow, we need a solution soon; otherwise the solution would be to stick with what we have since PHP 8.3.10 is already on it's way (RC1 has been tagged yesterday).
I've posted to Internals now: https://externals.io/message/124462 Let's see if we get a response.
Hmm, the discussion on the internals mailing list ended without consensus, not even agreement that the current behavior is a bug; as such we can at most keep that as feature request, but I guess that would not help @jrfnl in any way, and may cause even more work(-arounds).
So, what to do, @jrfnl?
@cmb69 Ha, you beat me to it.
Interesting to see your take on it. The way I read it, is that there is consensus that there is a bug, just not whether the bug is the change in the tokenizer or the missing documentation ;-)
I also see consensus that the current tokenization is not ideal, but no consensus on how it should be changed and whether there should be consistency rules for whitespace/comments in tokens.
So, what to do, @jrfnl?
As I also shared on the list:
I agree the improved tokenization proposed in the PR makes better sense.
Having said that, it will make the breaking change much much larger as it now would no longer just be a change which can be handled in the PHPCS token stream, but one which will impact individual sniffs, what with the removal and introduction of new tokens.
If that PR gets merged, it will mean that PHPCS will need to undo the PHP 8.3 and PHP 8.4 tokenization for the time being (in the 3.x major) and will only "polyfill" the new tokenization as of PHPCS 4.0. In practice, we would be moving the breaking change to PHPCS 4.0 version to not break sniffs.
Interesting to see your take on it. The way I read it, is that there is consensus that there is a bug, just not whether the bug is the change in the tokenizer or the missing documentation ;-)
Oh, right. In this case I was referring to an implementation bug, since documenting the issue wouldn't really help you.
Regarding PR #15041: I feel that this is curing the symptoms, but not the desease, namely that from
is not a keyword. Changing this in a minor version is off the table, and even changing this in a major may not be accepted. And anyway, it wouldn't help right now with the issue at hand.
And anyway, it wouldn't help right now with the issue at hand.
Well, until there is clarity where this is going, I can't fix things for PHPCS, so PHPCS will stay broken until a solution is found.
Well, who is up to decide what we do about this? Maybe the 8.3 release managers have something to say: @adoy, @bukka, @ericmann, thoughts?
I don't think changing that in 8.3 is acceptable. It's been out for some time and doing such break in bug fixing release is not a good idea IMHO. We allow some breaks in minor version so this is just a break in minor version even though it is a bit unfortunate.
Okay, if we don't change the behavior, we should document the issue. Thus, I've submitted PR #15276.
Thank you! That makes sense to me. Let's not merge https://github.com/php/php-src/pull/15041 without a further discussion or RFC then.
PR #15276 has been applied; I'm not transferring this ticket to doc-en to avoid confusion (it's a rather long discussion), but will update the PHP manual soonish.
Description
The following code:
https://3v4l.org/2SI2Q#veol
Resulted in this output:
But I expected this output instead:
For a full analysis of the differences in tokenization, see: https://github.com/PHPCSStandards/PHP_CodeSniffer/issues/529#issuecomment-2209337419
While I'm not necessarily challenging this change, I'd like to know if this was a deliberate/intentional change.
I have not been able to find anything in the PHP 8.3 CHANGELOG about this change, nor in the NEWS file. I can't even seem to find the commit which caused this change.
If this change was intentional, this is probably a documentation issue and the change should be annotated in the PHP 8.3 CHANGELOG.
If this change was unintentional, I believe it may be prudent to revert the change (or at least revert the side-effects of the commit which incidentally caused this change).
I'd like to get some clarity about this as it will inform how the PHP_CodeSniffer issue linked above should be fixed.
PHP Version
PHP 8.3.8
Operating System
Not relevant