Closed alexmojaki closed 2 years ago
At this stage I mostly wanted to get feedback on the general plan and proposed API. Quite a bit more code needs to be written to fulfill what I've written above. Are you happy with that plan? Since you've accepted this PR, does that mean I should merge this and then write the rest in another PR?
The primary motivation for me is to avoid the performance cost of tokenizing an entire file and then matching those tokens with nodes. I don't generally use information relating to tokens, I usually only care about AST nodes and their locations. Solving the problem of f-strings would be a nice bonus.
If people want to work with tokens within f-strings, then I think that requires obtaining the source code of the ast.FormattedValue
and tokenizing that source in isolation. An easy way to get that source code is with get_text_unmarked
. Otherwise you need some workaround like https://github.com/gristlabs/asttokens/issues/6#issuecomment-504760714.
Once you have that inner set of tokens, you still have to deal with the fact that they're somewhat separate from the original outer tokens. Do you merge everything into one big set? Do you discard the 'official' Python token corresponding to the entire f-string? Do you invent new tokens for the curly brackets and the intervening constant parts of the string? Do you adjust each Token.index
? Or if you keep the sets of tokens separate, do you define new semantics for next_token
and prev_token
?
As mentioned in https://github.com/ipython/ipython/issues/13731#issuecomment-1227272665, although I think keeping the functionality here is still best, even if tokens aren't usually used.
For now I've just added two methods to
ASTTokens
so that they can be tested and compared to the original methods. Here's a proposal for an actual API:mark
(ormark_tokens
) to theASTTokens
constructor.mark=False
, then don't callself.mark_tokens
.mark=True
by default for backwards compatibility. Personally I've often used thefirst_token
andlast_token
attributes added to AST nodes.mark=False
is ignored for Python versions older than 3.8. It's still allowed so that callers don't have to check the Python version themselves. Same for PyPy, which apparently doesn't have correct position info either, even in 3.8.mark=False
is an error for an astroid tree. Even if it's easy to get it to work, I'm personally not interested in maintaining that.get_text
andget_text_range
methods should automatically pick the correct implementation based on whether the nodes have been marked with tokens. Then code that only uses these methods only needs to addmark=False
to the constructor.mark_tokens
first if necessary. Currently, this meansarg
andStarred
nodes for Python 3.8, i.e. there's no problem for 3.9+. Those node types only seem to fail under specific conditions which I think are known, but for safety we can just check the type.ast.get_source_segment
does, so the new implementation should too, hopefully transparently.node.first_token.start
andnode.last_token.end
, i.e. they should return (lineno, col_offset) pairs instead of string indices. These will usually be the same as the standardast
values but not always.