Character and string token definitions need updating.

ehuss commented 5 years ago

There are multiple issues here. Some of this has changed in 1.37 via https://github.com/rust-lang/rust/pull/60793.

[x] RAW_BYTE_STRING_LITERAL no longer allows bare CR (new 1.37). #1459
[x] "Raw string" and "raw byte string" needs to be updated that CRLF is converted to LF (new 1.37). #1459
[ ] Several tokens need to sync the English text with the "Lexer" definition.
- STRING_LITERAL indicates several rules (like isolated CR's are not allowed), but the text does not mention any of those restrictions.
- CHAR_LITERAL says "single Unicode character…except U+0027" which is not complete.
- RAW_STRING_LITERAL does not allow bare CR's.
- BYTE_LITERAL escapes are not described.
- BYTE_STRING_LITERAL restrictions are not described.
- In general, just make sure they are all in sync!
[x] Typo in RAW_BYTE_STRING_CONTENT, points to RAW_STRING_CONTENT when it should be RAW_BYTE_STRING_CONTENT. #818
[x] I cannot find anywhere that mentions CRLF in a string is converted to LF. Am I blind? #1459
[x] The description for string continuations says "\ immediately before U+000A", but it can also be before CRLF. How should this be handled? I haven't looked at how it is implemented, but are all CRLF's translated everywhere? Should there just be a blanket statement somewhere about this, to avoid having to discuss it in every string literal definition? #1459

I may be missing some things here. Need to very thoroughly review everything to make sure it is correct and up-to-date with the changes from 60793.

ehuss commented 5 years ago

mattheww commented 9 months ago

https://github.com/rust-lang/rust/pull/118699#issuecomment-1852867466 should be helpful.

mattheww commented 9 months ago

~The current description says that forms like 'a'b are acceptable as a BYTE_LITERAL with a suffix, but in fact they're rejected (to avoid confusion with two LIFETIME_LABEL tokens).~

The current description says that forms like 'ab'c are acceptable as two LIFETIME_LABEL tokens, but in fact they're rejected ("character literal may only contain one codepoint"; the c is taken as a suffix).

Perhaps this could be documented via another reserved form.

mattheww commented 9 months ago

A form like b"\u{00a0}" is rejected at lexing time ("unicode escape in byte string").

But as it doesn't match either BYTE_STRING_LITERAL or RESERVED_TOKEN_DOUBLE_QUOTE, the current description says there's a valid tokenisation as the identifier b followed by "\u{00a0}".

So if we keep on with the current mechanism for documenting such rejected tokens, I think we'd need yet more reserved forms.

There are probably other similar cases. I think after rust-lang/rust#119172 a C string literal containing a NUL is one.

rust-lang / reference

Character and string token definitions need updating. #626