Update proof of concept with respect to PEP 701 changes

jimbaker commented 1 year ago

PEP 701, as tracked in the CPython implementation and recently merged by @pablogsal , https://github.com/python/cpython/issues/102856, required changes to the tokenizer and parser for the cleaned up implementation of f-strings. This impacts the initial proof of concept work in #1 written by @gvanrossum

First, we have discussed two possible syntaxes, as seen in the below examples:

html'<li>{name}</li>' (this is what we have generally done, with the tag not being dotted and no whitespace between it and the string part)
html @ f'<li>{name}</li>', as with the transpiler work by @rmorshea in #20

In my reading of the PEP, the easiest/most straightforward change is to leverage PEP 701 f-strings directly, producing these tokens in either case. (So the scheme in #20 more explicitly maps to this.)

NAME - "html"
FSTRING_START - "'" 
FSTRING_MIDDLE - '<li>'
LBRACE - '{'
NAME - 'name'
RBRACE - '}'
FSTRING_MIDDLE - '</li>'
FSTRING_END - "'"

In other words, this is the exact tokenization produced as now, other than the first token NAME. This tokenization should be straightforward to then parse, including any necessary updates to the POC (possibly including removing more of the changes that were done, one can hope!).

The advantage of this approach (at least at first glance) is that we avoid duplicating a lot of the new work that was done for PEP 701, such as working with brace nesting, and then any subsequent tokenization of the interpolations with potential nesting.

rmorshea commented 1 year ago

To clarify, I wasn't proposing html @ f'<li>{name}</li>' as a competing syntax. I chose that purely because I was interested in using tag strings in ReactPy before the PEP itself was accepted. In that context, picking a currently valid syntax would allow me to use existing tooling (e.g. black, flake8, mypy) without issue.

pauleveritt commented 1 year ago

@rmorshea We just chatted about it, so good timing. Should we find a Discord somewhere to keep you in the loop?

rmorshea commented 1 year ago

@pauleveritt, I'm happy to hop on discord.

pauleveritt commented 1 year ago

@rmorshea I'm sitting in the Python Discord ATM as pauleveritt -- give me a nudge and we'll chat.

gvanrossum commented 1 year ago

I am slowly working through the update (starting from scratch, but expect to start copying from #1 soon). See https://github.com/python/cpython/compare/main...gvanrossum:tag-strings-v2?expand=1

gvanrossum commented 1 year ago

Ready, see https://github.com/python/cpython/pull/103835. Have fun!

jimbaker commented 1 year ago

My first reaction from only looking at the PR (without trying out the tag string examples) is that the use of the recent changes in PEP 701 has very nicely simplified the necessary code for tag strings. So this validates our thinking about what PEP 701 could help support in its formalization/remove hacky code, even if it wasn't explicitly written to support tag strings.

jimbaker commented 1 year ago

Most of the examples in this repo do work, but I saw two problems with the current PR.

First, a useful helper function:

def tag(*args):
    return args

So in general, it works as expected:

>>> tag"...{f(42)}..."
('...', (<function <lambda> at 0x7ff5202b7490>, 'f(42)', None, None), '...')

But one assertion failure:

>>> tag"...{x=}..."
python: Parser/action_helpers.c:1205: decode_fstring_buffer: Assertion `tok_mode->last_expr_buffer != NULL' failed.
Aborted

And it's unclear to me how we should handle Unicode escapes now (I guess I will have to read the code here more closely):

>>> tag"...\N{GRINNING FACE}..."
  File "<stdin>", line 1
    tag"...\N{GRINNING FACE}..."
              ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> tag"...\\N{GRINNING FACE}..."
  File "<stdin>", line 1
    tag"...\\N{GRINNING FACE}..."
               ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> f"...\N{GRINNING FACE}..."
'...😀...'

gvanrossum commented 1 year ago

Thanks for flagging those. I'll try to figure out what's going on with the crash.

For \N, what do "rf" strings do?

gvanrossum commented 1 year ago

I have pushed a fix for the crash. It was a bit weird.

I'm still pondering the \N thing. It works sanely in the old tag-string branch, so I think it's another bug.

gvanrossum commented 1 year ago

Hm, "sanely" may be an exaggeration. In the old branch tag"\N{SPACE}" treats {SPACE} as an interpolation. That matches what rf"\N{SPACE}" does, so maybe it's inevitable. It does mean that tag strings cannot easily use the \N{...} notation. At the very least that needs to be called out in PEP and documentation.

~A workaround in the user-level tag function could be to check if the previous item is a constant string ending in \N, and in that case do a unicode lookup of the string field. That feels brittle though (to be correct you'd have to count backslashes).~ [UPDATE: That doesn't work for tag"\N{GRINNING FACE}" because GRINNING FACE is not a valid expression.]

jimbaker commented 1 year ago

Speaking of the treatment of something like \N{GRINNING FACE} I had to consult myself from last year: https://github.com/jimbaker/tagstr/issues/6#issuecomment-1121648575

There have been some changes since then. Let's look at the following:

>>> def tag(*args): return args
...
>>> def decode(raw): return raw.encode('utf-8').decode('unicode-escape')
...
>>> decode(fr'\N{{GRINNING FACE}}')
'😀'
>>> tag'...\N{{GRINNING FACE}}...'
  File "<stdin>", line 1
    tag'...\N{{GRINNING FACE}}...'
                                 ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 3-5: malformed \N character escape
>>> tag'...\\N{{GRINNING FACE}}...'
('...\\N{', 'GRINNING FACE}', '...')
>>> x = 42
>>> fr'...{x}...'
'...42...'

So at the very least we should expect it to behave like fr with '\N{{UNICODE NAME}}`. Supporting it just as '\N{UNICODE NAME}' is trickier as you mention, but seems closer to what a user would expect - the backslash is unprocessed, but can then be processed as expected, but it's still possible to just write the codepoint out as it's text.

As to which one is sane... it's an interesting corner case. The argument for processing it like fr is simply that there are various DSLs out there. While I cannot think of one that even uses the convention \N{SOME NAME}, it could exist.

gvanrossum commented 1 year ago

Pushed a fix for \N{GRINNING FACE}. It is a syntax error, because GRINNING FACE is not a valid expression:

>>> tag"\N{GRINNING FACE}"
  File "<stdin>", line 1
    tag"\N{GRINNING FACE}"
           ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>>

The workaround is to double the curlies:

>>> tag"\N{{GRINNING FACE}}"
('\\N{', 'GRINNING FACE}')
>>>

jimbaker commented 1 year ago

Pushed a fix for \N{GRINNING FACE}. It is a syntax error, because GRINNING FACE is not a valid expression:
>>> tag"\N{GRINNING FACE}"
  File "<stdin>", line 1
    tag"\N{GRINNING FACE}"
           ^^^^^^^^^^^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> 
The workaround is to double the curlies:
>>> tag"\N{{GRINNING FACE}}"
('\\N{', 'GRINNING FACE}')
>>> 

This is probably the right way to do it, given that we get fr semantics for literals.

jimbaker commented 1 year ago

@gvanrossum Never mind for now - I tried running on another Linux box and it does work. It's always possible I have some out of date header or whatever for SQLite, or somewhere else on the tool chain. But still strange that it's only with this usage.

All examples in the examples directory now work (for some definition of work, I need to add a tests issue so some of them can be turned into such for automated testing).

~~try this example code - I'm getting a segmentation fault when I run it. It's moderately complex in terms of the nesting, so it might be exercising something interesting. Or maybe it's my bug, but I wouldn't expect a segmentation fault regardless: https://github.com/jimbaker/tagstr/blob/main/examples/sql.py~~

~~(While there is an example of usage as well with SQLAlchemy, it's an optional demo.)~~

gvanrossum commented 1 year ago

@gvanrossum try this example code - I'm getting a segmentation fault when I run it. It's moderately complex in terms of the nesting, so it might be exercising something interesting. Or maybe it's my bug, but I wouldn't expect a segmentation fault regardless: https://github.com/jimbaker/tagstr/blob/main/examples/sql.py

Hm, interesting. It's not in the parser or bytecode compiler, as it prints the message about pip installing sqlalchemy when I try to run it. Note: the instructions contain a bug -- you need to quote the greenlet>=2.0.0a2 argument to pip install otherwise it creates an interesting filename. Alas, even with that issue fixed, greenlet fails to compile for me (on Mac). So not sure how to continue debugging this.

gvanrossum commented 1 year ago

Hm, interesting that you have it working on a Linux box. Maybe sqlalchemy doesn't depend on greenlet there?

gvanrossum commented 1 year ago

So I looked at the failing tests in my draft PR, and tried to fix them. Unfortunately I found that one test seems to be failing due to an issue with resetting the tokenizer; I'll probably have to get @pablogsal or @lysnikolaou to help me with that.

Most of the failures were of a simpler nature, which we will need to mention in the PEP and docs: where previously, if the letter or letters immediately preceding a string literal are incorrect (e.g. urf"blah"), instead of a SyntaxError this will now raise NameError (usually). I guess if we had chosen to use backticks instead of generalizing f-strings (e.g., tag`foo{bar}`) this would not have been a problem.

gvanrossum commented 1 year ago

@lysnikolaou Any chance you can help me with getting this last test passing? See the commit in https://github.com/python/cpython/pull/103835: https://github.com/python/cpython/pull/103835/commits/e37d679a7970573052c468931c7e4ea96159b5d2

lysnikolaou commented 1 year ago

Sure I can help!

From the commit message of https://github.com/python/cpython/pull/103835/commits/e37d679a7970573052c468931c7e4ea96159b5d2:

It seems that if there is a syntax error involving mismatched braces in an f-string interpolation, the parser backtracks and the tokenizer produces TAGSTRING_START instead of FSTRING_START, and other state is not initialized properly.

I don't think that the parser backtracks and the tokenizer produces TAGSTRING_START instead of FSTRING_START. My understanding is that the conversion character s is recognized as TAGSTRING_START because of a quote immediately following it. The full list of tokens for one of the offending cases:

❯ ./python.exe
Python 3.12.0a7+ (heads/tag-strings-v2:e37d679a79, Apr 26 2023, 22:15:21) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize
>>> src = "f'{3!s'"
>>> for token in tokenize._generate_tokens_from_c_tokenizer(src):
...     print(token)
... 
TokenInfo(type=61 (FSTRING_START), string="f'", start=(1, 0), end=(1, 2), line="f'{3!s'\n")
TokenInfo(type=25 (LBRACE), string='{', start=(1, 2), end=(1, 3), line="f'{3!s'\n")
TokenInfo(type=2 (NUMBER), string='3', start=(1, 3), end=(1, 4), line="f'{3!s'\n")
TokenInfo(type=54 (EXCLAMATION), string='!', start=(1, 4), end=(1, 5), line="f'{3!s'\n")
TokenInfo(type=62 (TAGSTRING_START), string="s'", start=(1, 5), end=(1, 7), line="f'{3!s'\n")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lysnikolaou/repos/python/cpython/Lib/tokenize.py", line 688, in _generate_tokens_from_c_tokenizer
    for info in c_tokenizer.TokenizerIter(source):
  File "<string>", line 1
    f'{3!s'
         ^
SyntaxError: unterminated f-string literal (detected at line 1)

I think what we will need to do is emit NAME after an exclamation point in an f-string interpolation even if it's followed by a quote. I can work on that if you want.

By the way, congrats on everyone involved. This is indeed very exciting!

lysnikolaou commented 1 year ago

There's also the slightly more difficult case of something like f'{x'. The error message we should be producing in this case is not clear, since both of the following are correct.

Adding a } will fix the SyntaxError, since the f-string f'{x}' is valid.
Closing the tag string and adding a } afterwards is also okay, since quote reusing is now allowed and, thus, the following is also valid: f'{x''}'

With the current implementation the syntax error message reads SyntaxError: unterminated f-string literal, while the previous message was f-string: expecting '}'. I would suggest to keep a modified version of the old message, but I'm not really sure about it.

gvanrossum commented 1 year ago

Thanks! You're likely right. Interestingly you can get the main branch to emit the unexpected error too:

>>> f'{3!f'
  File "<stdin>", line 1
    f'{3!f'
         ^^
SyntaxError: f-string: invalid conversion character
>>>

(should be: "expecting '}'").

So your fix for this should be applied to main, not just to my branch. :-)

The same argument applies to the other problem case:

>>> f'{f'
  File "<stdin>", line 1
    f'{f'
       ^
SyntaxError: unterminated f-string literal (detected at line 1)
>>>

And why is that treated differently than this?

>>> f'{b'
  File "<stdin>", line 1
    f'{b'
       ^
SyntaxError: f-string: expecting '}'
>>>

Honestly I don't care that much about getting the right error in every case -- mismatched string quotes can lead to many different errors. (E.g. what if a user accidentally puts an unescaped quote inside a string.) But the tests insist on certain error messages, so something needs to be done.

lysnikolaou commented 1 year ago

I think that the error message on the main branch is okay, since f really is an invalid conversion character. It's a custom error implemented at the parser level, if I'm not mistaken. The one in the tag-strings branch (SyntaxError: unterminated f-string literal (detected at line 1) is the one that I think we should change.

And why is that treated differently than this?

The difference is that string-prefixes that were there before (u, b, r) go into the path that emits STRING tokens, while all the others go into code that emit FSTRING_START or TAGSTRING_START tokens. The fact that the error message is different doesn't really make much sense though, so I'll look into that as well.

There's another point that I'd like to bring up, just in case you haven't considered it for the PEP. I had a flash the other day about the time I tried to add a specific syntax error for invalid string prefixes. People started complaining because it was backwards-incompatible for cases like this:

def f():
    return'something'

or

'something' if expression else'something else'

Basically, all code that does not add spaces between a keyword and a string breaks. And that's true for this change as well.

On main:

❯ ./python.exe     
Python 3.12.0a7+ (heads/main:1d99e9e46e, May  3 2023, 13:07:40) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f():
...     return'1'
... 
>>> f()
'1'
>>> 'something' if 1 else'something else'
'something'

On the tag-string branch:

❯ ./python.exe  
Python 3.12.0a7+ (heads/tag-strings-v2:e37d679a79, May  3 2023, 13:11:32) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def f():
...     return'1'
... 
>>> f()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in f
NameError: name 'return' is not defined
>>> 'something' if 1 else'something else'
  File "<stdin>", line 1
    'something' if 1 else'something else'
    ^^^^^^^^^^^^^^^^
SyntaxError: expected 'else' after 'if' expression

We ended up reverting that commit. There might be ways around this (one of which is transferring keyword handling in the tokenizer, which probably isn't a good idea) and I'd be up to do the work for it, even if it's just writing a section on the PEP, in case there's not one already.

gvanrossum commented 1 year ago

Yeah, there are endless variations on the theme of invalid conversion characters. Latest fun (on main):

>>> f'{3!r'''
... 
  File "<stdin>", line 1
    f'{3!r'''
         ^
SyntaxError: unterminated triple-quoted string literal (detected at line 1)
>>>

(There's an invisible ^C at the ... prompt.)

Personally I think the lexer probably has to special-case the character after ! in interpolations.

About return'foo' and friends, the simplest solution would sure be to move reserved word recognition into the lexer. I personally think that's fine (we can find a way for pegen to write the list of reserved words gleaned from the grammar to a file that can be included by the lexer).

jimbaker commented 2 months ago

Following up on this issue, there are two specific changes we need to clarify the PEP. Other details are respect with the implementation, and be deferred for now.

The PEP should make it clear that reserved words, per https://docs.python.org/3/reference/lexical_analysis.html#keywords, are not allowed tag names, including in dotted names. @pauleveritt let's add this restriction to https://github.com/jimbaker/tagstr/blob/main/pep.rst#valid-tag-names.
We need to ensure good testing around reserved words, including producing the same errors. For example, this is seen in the recent rebasing by @lysnikolaou (https://github.com/lysnikolaou/cpython/tree/tag-strings-rebased):

>>> def f():
...     if 0:
...         pass
...     else'hi, {name}!'
...
>>> f()
Traceback (most recent call last):
  File "<python-input-22>", line 1, in <module>
    f()
    ~^^
  File "<python-input-21>", line 4, in f
    else'hi, {name}!'
    ^^^^^
NameError: name 'else' is not defined. Did you mean: 'False'?

whereas current cpython main is expected:

>>> def f():
...     if 0:
...         pass
...     else'hi, {name}!'
  File "<unknown>", line 4
    else'hi, {name}!'
        ^^^^^^^^^^^^^
SyntaxError: expected ':'

The straightforward solution proposed in https://github.com/jimbaker/tagstr/issues/22#issuecomment-1539400754 is probably what we need here.

While a parser for the tag function could handle split strings like the following, as seen in the current implementation:
```
>>> def tag(*args): return args
...
>>> tag"\N{{GRINNING FACE}}"
('\\N{', 'GRINNING FACE}')
```
from a PEP perspective, we should follow what the user has written: exactly one string (specifically Decoded) is provided for each static string in the source tag string. So Decoded.raw would actually be
```
'\\N{GRINNING FACE}'
```
and this cooks to
```
'\\N{GRINNING FACE}'.encode('utf-8').decode('unicode-escape')
'😀'
```

@pauleveritt let's add a subsection to https://github.com/jimbaker/tagstr/blob/main/pep.rst#tag-function-arguments that states this - it feels very similar to what we already state about "No Empty Decoded String"/

The PEP is already clear with respect to ...!x, where x is not a valid conversion - "Note that as with f-strings, no other conversions are supported."

pauleveritt commented 2 months ago

Just sent @jimbaker a PR for review on this.

jimbaker / tagstr

Update proof of concept with respect to PEP 701 changes #22