Versions of get_text and get_text_range that don't need mark_tokens

alexmojaki commented 2 years ago

As mentioned in https://github.com/ipython/ipython/issues/13731#issuecomment-1227272665, although I think keeping the functionality here is still best, even if tokens aren't usually used.

For now I've just added two methods to ASTTokens so that they can be tested and compared to the original methods. Here's a proposal for an actual API:

[x] 1. Add a boolean argument mark (or mark_tokens) to the ASTTokens constructor.
[x] 2. If mark=False, then don't call self.mark_tokens.
[x] 3. mark=True by default for backwards compatibility. Personally I've often used the first_token and last_token attributes added to AST nodes.
[x] 4. mark=False is ignored for Python versions older than 3.8. It's still allowed so that callers don't have to check the Python version themselves. Same for PyPy, which apparently doesn't have correct position info either, even in 3.8.
[x] 5. mark=False is an error for an astroid tree. Even if it's easy to get it to work, I'm personally not interested in maintaining that.
[x] 6. The usual get_text and get_text_range methods should automatically pick the correct implementation based on whether the nodes have been marked with tokens. Then code that only uses these methods only needs to add mark=False to the constructor.
[x] 7. For cases where the unmarked implementation has known bugs, use the original implementation, calling mark_tokens first if necessary. Currently, this means arg and Starred nodes for Python 3.8, i.e. there's no problem for 3.9+. Those node types only seem to fail under specific conditions which I think are known, but for safety we can just check the type.
[x] 8. Ideally this should just start working automatically on nodes inside f-strings. These currently don't work here (#6) but ast.get_source_segment does, so the new implementation should too, hopefully transparently.
[x] 9. Add new method(s) to replace using node.first_token.start and node.last_token.end, i.e. they should return (lineno, col_offset) pairs instead of string indices. These will usually be the same as the standard ast values but not always.

alexmojaki commented 2 years ago

At this stage I mostly wanted to get feedback on the general plan and proposed API. Quite a bit more code needs to be written to fulfill what I've written above. Are you happy with that plan? Since you've accepted this PR, does that mean I should merge this and then write the rest in another PR?

alexmojaki commented 2 years ago

The primary motivation for me is to avoid the performance cost of tokenizing an entire file and then matching those tokens with nodes. I don't generally use information relating to tokens, I usually only care about AST nodes and their locations. Solving the problem of f-strings would be a nice bonus.

If people want to work with tokens within f-strings, then I think that requires obtaining the source code of the ast.FormattedValue and tokenizing that source in isolation. An easy way to get that source code is with get_text_unmarked. Otherwise you need some workaround like https://github.com/gristlabs/asttokens/issues/6#issuecomment-504760714.

Once you have that inner set of tokens, you still have to deal with the fact that they're somewhat separate from the original outer tokens. Do you merge everything into one big set? Do you discard the 'official' Python token corresponding to the entire f-string? Do you invent new tokens for the curly brackets and the intervening constant parts of the string? Do you adjust each Token.index? Or if you keep the sets of tokens separate, do you define new semantics for next_token and prev_token?

gristlabs / asttokens

Versions of get_text and get_text_range that don't need mark_tokens #92