PHP 8.3: "yield /*comment*/ from" is no longer a parse error ?

jrfnl commented 2 months ago

Description

The following code:

<?php
function generator()
{
    yield from gen2();

    yield /* comment */ from gen2();

    yield
    from
    gen2();

    yield // comment
    from gen2();

    yield
    /* comment */
    from
    gen2();
}

https://3v4l.org/2SI2Q#veol

Resulted in this output:

_no output_

But I expected this output instead:

Parse error: syntax error, unexpected identifier "gen2" in /in/2SI2Q on line 7

For a full analysis of the differences in tokenization, see: https://github.com/PHPCSStandards/PHP_CodeSniffer/issues/529#issuecomment-2209337419

While I'm not necessarily challenging this change, I'd like to know if this was a deliberate/intentional change.

I have not been able to find anything in the PHP 8.3 CHANGELOG about this change, nor in the NEWS file. I can't even seem to find the commit which caused this change.

If this change was intentional, this is probably a documentation issue and the change should be annotated in the PHP 8.3 CHANGELOG.

If this change was unintentional, I believe it may be prudent to revert the change (or at least revert the side-effects of the commit which incidentally caused this change).

I'd like to get some clarity about this as it will inform how the PHP_CodeSniffer issue linked above should be fixed.

PHP Version

PHP 8.3.8

Operating System

Not relevant

damianwadley commented 2 months ago

This was changed by @iluuu1994 during https://github.com/php/php-src/commit/f291d37a1a7d78e841f40a4410359548bc73de1b in response to #10083...

Tip: for grammar questions, one of the first places to check for answers is zend_language_scanner.

nielsdos commented 2 months ago

Damnit, beat me on commenting literally by seconds :joy: Indeed, this is a side effect of the bugfix and imo not a bug.

jrfnl commented 2 months ago

I do kind of consider it a bug as it is an undocumented change in behaviour - the changelog entry really doesn't cover the potential/actual impact of the change.

Knowing where this is coming from, I'm surprised I haven't seen more breakage due to this change in PHPCS. Then again, that would require the PHPCS tests to hit the changed tokenization behaviour and as this kind of thing was previously a parse error, I can understand why this is not covered by the existing tests.

If this change stays, in my opinion, the changelog entry needs to be updated to more comprehensively state what has actually changed and what the impact of this. I also think it should be mentioned in the CHANGELOG, not just in NEWS.

Additionally, I'd recommend for tests to be added with the above code sample for yield from to prevent a regression in the future. Preferably tests which would not just check that the above code sample is not regarded as a parse error, but tests which also check the tokenization is consistent.

jrfnl commented 2 months ago

P.S.: Thanks @damianwadley and @nielsdos for your fast response to my query!

cmb69 commented 2 months ago

Since this affects PHP 8.3, I'm not sure whether updating CHANGELOG makes much sense now, but it's certainly worth documenting it in the migration guide of the PHP documentation.

iluuu1994 commented 2 months ago

@jrfnl I indeed didn't think about linters, this should have gotten an UPGRADING entry. Sorry about that.

mvorisek commented 2 months ago

IMO it is not a bug, as comments are (should be) allowed between any code token in general.

This is probably already documented or at least it is "generally expected". So https://github.com/php/php-src/commit/f291d37a1a7d78e841f40a4410359548bc73de1b is a fix.

iluuu1994 commented 2 months ago

@mvorisek What Juliette references is the patch changing the output of PhpToken::tokenize(), with the tokenizer treating yield /* comment */ from as a singular token. This is the only feasible solution without switching to a GLR parser.

mvorisek commented 2 months ago

That is indeed not good and such token should be broken into separate tokens further.

repro: https://3v4l.org/7VmX1

jrfnl commented 2 months ago

Forgive me if I don't use the correct terminology for some things, I'm hoping that even if the terminology is not 100% correct for C, you can still understand what I'm trying to say.

IMO it is not a bug, as comments are (should be) allowed between any code token in general.

... which is exactly the reason why I was testing the Tokenizer behaviour with comments between the words.

Having said that, it's also not true....

In nearly all cases, tokens consist of a "single entity", so no comments can exist within a token. This includes "compound operators", like ??, where no whitespace or comment is allowed between the first and the second ?.

A typical case where the tokenization changed from multiple tokens to single token is the PHP 8.0 "namespaced names" change. However, that change also explicitly removed the ability to have whitespace or comments within a namespaced name.

Another one which comes to mind are cast tokens, like (bool). These tokens can contain whitespace, i.e. ( bool ), but they cannot contain comments. See: https://3v4l.org/A6Sgj and https://3v4l.org/nE9H8

So, in practice, yield from is (AFAICS) the only "multi-entity" token which can contain comments... (since PHP 8.3).

_Retrospectively, it would have been better if yield from would have been introduced as yield_from (note the underscore). This would be in line with other multi-word keywords, like include_once or as yieldfrom, which would be in line with keywords like endif, but considering it has been in the language for such a long time now, changing this now would be a huge BC-break._

I consider the change in https://github.com/php/php-src/commit/f291d37a1a7d78e841f40a4410359548bc73de1b a bug as it is an undocumented (breaking) change in behaviour for the Tokenizer for projects consuming the token output. That bug could be fixed by either fixing/updating the documentation or by reverting the change.

So, what remains then is the discussion about how yield from should be tokenized.

I'm not sure about this and you can also see some thoughts about this in the linked PHPCS ticket (as PHPCS needs to ensure the token stream sniffs receive is the same on all PHP versions, so I need to make a choice on how to get round the changed tokenization).

On the one hand I agree that comments between the keywords should be ignored - as they mostly are elsewhere -, on the other hand, comments cause a parse error in other "multi entity" tokens, so this change introduces an inconsistency.

iluuu1994 commented 2 months ago

In nearly all cases, tokens consist of a "single entity", so no comments can exist within a token. This includes "compound operators", like ??, where no whitespace or comment is allowed between the first and the second ?.

That's true, but what qualifies as a token is usually very intuitive. yield from is one of the few exceptions, and it's primary rationale for being a single token to avoid making from itself a keyword.

So, in practice, yield from is (AFAICS) the only "multi-entity" token which can contain comments... (since PHP 8.3).

I think that's correct. The other cases only use lookahead for comments.

So, what remains then is the discussion about how yield from should be tokenized.

Unfortunately, for yield from to not cause parser ambiguitites, it must be a single token. Hence, when a comment does appear in that spot, it is implicitly part of the token. Since 8.3 has been out for a while, I don't think removing it now is feasible, and if it is removed from 8.4, then that still leaves 8.3 as an issue for you.

It might be possible for you to artificially split yield /* comment */ from into multiple tokens. It's not great to push that responsibility onto users, but I don't see a great alternative at the moment. If you have any other ideas that might help with your situation, please let me know.

jrfnl commented 2 months ago

Unfortunately, for yield from to not cause parser ambiguitites, it must be a single token. Hence, when a comment does appear in that spot, it is implicitly part of the token.

And that's what I don't necessarily agree with - and nor did PHP 7.0 - 8.2 in which this was a parse error.

Since 8.3 has been out for a while, I don't think removing it now is feasible

And again, I don't agree with this. I only discovered this bug now as the change wasn't documented. That doesn't make it any less of a bug/inconsistency with every single other token, as none of them allow for comments within the token.

If it had been included in the changelog, I would probably have questioned the change/introduction of this inconsistency before the release was out.

Note: I'm only challenging the change to the yield from tokenization. Not the other changes in the same commit.

cmb69 commented 2 months ago

Well, we can't go back in time, so the question is what would break more code: reverting the change for yield from, or leaving it as is. I'd hope the former, but I'm afraid the latter might actually be the case (coders are so "creative": "let's just put a comment between yield and from, because we can").

iluuu1994 commented 2 months ago

I would agree with Christoph. From what I understand, PHP_CodeSniffer would either not correctly format the comment, or mistakenly remove it. While that's not great, it's not breaking. Removing parsing at this point would fatal error on a PHP patch update, and I think that's generally a no-go. Please correct me if my understanding is incorrect.

jrfnl commented 2 months ago

The other side of that argument is that the chances of people having discovered that they can place a comment between yield and from now are slim as the change was not documented/publicized and has only been out 7 months...

cmb69 commented 2 months ago

Maybe we should ask for opinions on the internals mailing list?

nielsdos commented 2 months ago

But what is the "solution" anyway? Complicating the lexer to deal with "yield from" specially?

cmb69 commented 2 months ago

But what is the "solution" anyway? Complicating the lexer to deal with "yield from" specially?

I think this is about reverting

<ST_IN_SCRIPTING>"yield"{WHITESPACE_OR_COMMENTS}"from"[^a-zA-Z0-9_\x80-\xff] {

to the previous

<ST_IN_SCRIPTING>"yield"{WHITESPACE}"from"[^a-zA-Z0-9_\x80-\xff] {

or not.

bwoebi commented 2 months ago

I think the proper way of fixing this would be introducing T_FROM, in a special lexer mode.

I.e.:

<ST_IN_SCRIPTING>"yield"{WHITESPACE_OR_COMMENTS}"from"[^a-zA-Z0-9_\x80-\xff] {
    yyless(5);
    yy_push_state(ST_LOOKING_FOR_FROM);
    RETURN_TOKEN(T_YIELD);
}

<ST_LOOKING_FOR_FROM>"from" {
    yy_pop_state();
    RETURN_TOKEN(T_FROM);
}

And then add ST_LOOKING_FOR_FROM to the whitespace, comment etc. targets.

Sort of related issue is #14961. Basically needs the same handling than that one.

jrfnl commented 2 months ago

@bwoebi Wouldn't that be a much larger BC-break as it would turn from into a (partially?) reserved keyword ? and would remove the T_YIELD_FROM token ?

bwoebi commented 2 months ago

@jrfnl It would be a keyword only if preceded by yield, i.e. changing nothing on the status quo. But yes, it would remove the T_YIELD_FROM token (and a T_COMMENT could now be between T_YIELD and T_FROM).

cmb69 commented 2 months ago

@bwoebi, I guess that would cause even more trouble for @jrfnl and others.

Anyhow, we need a solution soon; otherwise the solution would be to stick with what we have since PHP 8.3.10 is already on it's way (RC1 has been tagged yesterday).

jrfnl commented 2 months ago

I've posted to Internals now: https://externals.io/message/124462 Let's see if we get a response.

cmb69 commented 1 month ago

Hmm, the discussion on the internals mailing list ended without consensus, not even agreement that the current behavior is a bug; as such we can at most keep that as feature request, but I guess that would not help @jrfnl in any way, and may cause even more work(-arounds).

So, what to do, @jrfnl?

jrfnl commented 1 month ago

@cmb69 Ha, you beat me to it.

Interesting to see your take on it. The way I read it, is that there is consensus that there is a bug, just not whether the bug is the change in the tokenizer or the missing documentation ;-)

I also see consensus that the current tokenization is not ideal, but no consensus on how it should be changed and whether there should be consistency rules for whitespace/comments in tokens.

So, what to do, @jrfnl?

As I also shared on the list:

I agree the improved tokenization proposed in the PR makes better sense.

Having said that, it will make the breaking change much much larger as it now would no longer just be a change which can be handled in the PHPCS token stream, but one which will impact individual sniffs, what with the removal and introduction of new tokens.

If that PR gets merged, it will mean that PHPCS will need to undo the PHP 8.3 and PHP 8.4 tokenization for the time being (in the 3.x major) and will only "polyfill" the new tokenization as of PHPCS 4.0. In practice, we would be moving the breaking change to PHPCS 4.0 version to not break sniffs.

cmb69 commented 1 month ago

Interesting to see your take on it. The way I read it, is that there is consensus that there is a bug, just not whether the bug is the change in the tokenizer or the missing documentation ;-)

Oh, right. In this case I was referring to an implementation bug, since documenting the issue wouldn't really help you.

Regarding PR #15041: I feel that this is curing the symptoms, but not the desease, namely that from is not a keyword. Changing this in a minor version is off the table, and even changing this in a major may not be accepted. And anyway, it wouldn't help right now with the issue at hand.

jrfnl commented 1 month ago

And anyway, it wouldn't help right now with the issue at hand.

Well, until there is clarity where this is going, I can't fix things for PHPCS, so PHPCS will stay broken until a solution is found.

cmb69 commented 1 month ago

Well, who is up to decide what we do about this? Maybe the 8.3 release managers have something to say: @adoy, @bukka, @ericmann, thoughts?

bukka commented 1 month ago

I don't think changing that in 8.3 is acceptable. It's been out for some time and doing such break in bug fixing release is not a good idea IMHO. We allow some breaks in minor version so this is just a break in minor version even though it is a bit unfortunate.

cmb69 commented 1 month ago

Okay, if we don't change the behavior, we should document the issue. Thus, I've submitted PR #15276.

iluuu1994 commented 1 month ago

Thank you! That makes sense to me. Let's not merge https://github.com/php/php-src/pull/15041 without a further discussion or RFC then.

cmb69 commented 1 month ago

PR #15276 has been applied; I'm not transferring this ticket to doc-en to avoid confusion (it's a rather long discussion), but will update the PHP manual soonish.

php / php-src