gristlabs / asttokens

Annotate Python AST trees with source text and token information
Apache License 2.0
172 stars 34 forks source link

Versions of get_text and get_text_range that don't need mark_tokens #92

Closed alexmojaki closed 2 years ago

alexmojaki commented 2 years ago

As mentioned in https://github.com/ipython/ipython/issues/13731#issuecomment-1227272665, although I think keeping the functionality here is still best, even if tokens aren't usually used.

For now I've just added two methods to ASTTokens so that they can be tested and compared to the original methods. Here's a proposal for an actual API:

alexmojaki commented 2 years ago

At this stage I mostly wanted to get feedback on the general plan and proposed API. Quite a bit more code needs to be written to fulfill what I've written above. Are you happy with that plan? Since you've accepted this PR, does that mean I should merge this and then write the rest in another PR?

alexmojaki commented 2 years ago

The primary motivation for me is to avoid the performance cost of tokenizing an entire file and then matching those tokens with nodes. I don't generally use information relating to tokens, I usually only care about AST nodes and their locations. Solving the problem of f-strings would be a nice bonus.

If people want to work with tokens within f-strings, then I think that requires obtaining the source code of the ast.FormattedValue and tokenizing that source in isolation. An easy way to get that source code is with get_text_unmarked. Otherwise you need some workaround like https://github.com/gristlabs/asttokens/issues/6#issuecomment-504760714.

Once you have that inner set of tokens, you still have to deal with the fact that they're somewhat separate from the original outer tokens. Do you merge everything into one big set? Do you discard the 'official' Python token corresponding to the entire f-string? Do you invent new tokens for the curly brackets and the intervening constant parts of the string? Do you adjust each Token.index? Or if you keep the sets of tokens separate, do you define new semantics for next_token and prev_token?