ASTText class that doesn't require tokens

alexmojaki commented 2 years ago

Following up on https://github.com/gristlabs/asttokens/pull/92.

Closes #97.

Adds a new class ASTText that:
- Allows using get_text[_range] while avoiding the performance cost of dealing with tokens for most node types and newer Python versions.
- Supports nodes inside f-strings for newer Python versions.
- Lazily initialises tree if one isn't passed, so parse=True isn't needed.
Adds a new method get_text_positions to both ASTText and ASTTokens which returns start and end (lineno, col_offset) pairs, as an alternative to node.first/last_token.
Adds a boolean argument padded to all 3 get_text* methods on both classes which matches the behaviour of the same argument in ast.get_source_segment i.e. includes leading indentation for multiline statements. However the default is True (instead of False as in ast.get_source_segment) in the existing methods get_text[_range] for backwards compatibility. For the new get_text_positions there is no default to force the user to stop and think. It can't be False because that would be confusing compared to the existing methods, but False matches the behaviour of using node.first/last_token which the method is meant to replace.

PeterJCLaw commented 2 years ago

Just a thought here -- would using something like https://pypi.org/project/cached-property/ make this easier given the goal of lazy evaluation? That would sidestep the need for various places to check whether the tokens have been initialised (I'm imagining that consumers would just access self.tokens or self.token_offsets which would facade out the lazy-parsing side of things) and make the type checkers happier too.

Not sure how big an issue it would be to take the extra dependency for Python < 3.8 though? (functools learns cached_property in 3.8)

If that approach does work, given the conversation at https://github.com/gristlabs/asttokens/issues/94#issuecomment-1252862533 would it make sense to make the AST similarly lazy (rather than potentially None)?

alexmojaki commented 2 years ago

The tokens property isn't lazily initialising _tokens, it's asserting it for the sake of type checkers. Using .tokens when they're not initialised is meant to throw an error.

PeterJCLaw commented 2 years ago

Ah, ok. I think I'd misunderstood from the discussion on that issue.

Given that goal I would repeat the encouragement I made on that issue -- of exploring having variations of ASTTokens for the no-tokens and no-tree cases so that the default case requires both. This would greatly simplify the type checking and in turn make consumer code clearer. I'd be happy to put together a sketch of that if you're open to that approach.

alexmojaki commented 2 years ago

Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is atok = ASTTokens(source, parse=True) and I don't see much of a problem that needs improving.

PeterJCLaw commented 2 years ago

Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is atok = ASTTokens(source, parse=True) and I don't see much of a problem that needs improving.

Sure, though I'll take it to a separate issues to avoid taking this PR too far off topic.

PeterJCLaw commented 2 years ago

Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is atok = ASTTokens(source, parse=True) and I don't see much of a problem that needs improving.

Sure, though I'll take it to a separate issues to avoid taking this PR too far off topic.

https://github.com/gristlabs/asttokens/issues/97

alexmojaki commented 2 years ago

@dsagal the names I chose initially are not great. What do you think of replacing unmarked and init_tokens with use_tokens? So the constructor and get_text methods would both have use_tokens=True by default, and the method .init_tokens() would be .use_tokens(). supports_unmarked could be supports_no_tokens.

alexmojaki commented 2 years ago

Here's a comparison to just before phase 1 (#92) to show the whole picture which can be helpful: https://github.com/gristlabs/asttokens/compare/2.0.8...unmarked2

alexmojaki commented 2 years ago

Found a problem while testing this in futurecoder. I had some code that relied on the fact that .get_text(node) returning something non-empty means that node.first_token exists. The new way of supporting f-strings means that this implication no longer holds, even without setting init_tokens=False. This led to an AttributeError. I don't see anything we can do about that here, so that's a small break in compatibility that probably means a new major version is needed.

alexmojaki commented 2 years ago

Two more f-string problems:

.get_text(node) suddenly working can just break things in other ways. Previously I was excluding nodes from a particular set when get_text didn't work. f-string contents suddenly appearing in the set was a problem. In particular, I was using asttokens.util.replace to modify some source code, and it assumes non-overlapping ranges, but now the inner node ranges were overlapping with the full f-string range. The fact that I was using an asttokens.util function here is essentially a coincidence, things could just break in non-asttokens related ways. In theory I suppose there could be another boolean arg for compatibility to continue ignoring f-string nodes, but that seems a bit extreme.
The direct children of f-strings (but not the regular expression nodes within) have broken positions and need further special handling.

PeterJCLaw commented 2 years ago

While rebasing #98 I hit a test failure which I think is present in this PR too -- the f-string handling in atok.get_text implicitly relies on mark_tokens having run before it's called. This seems to happen in the example source because of the function definition there, however commenting that out (the def foo on line 16--17 in tests/test_unmarked.py) causes test_unmarked to fail on the f-string handling.

The specific error comes from the assertion that atok.get_text for f-strings return empty text: self.assertEqual(atok_text, "", node).

I'm actually not sure why the test is asserting that that value is empty given that the text which would otherwise be returned appears to be correct to me (it matches what python's AST returns, which seems fine; removing the conditional block gets the tests passing for #98).

alexmojaki commented 2 years ago

I've explained the reason for the conditional block in https://github.com/gristlabs/asttokens/pull/98#discussion_r985298862. However you're right about this:

the f-string handling in atok.get_text implicitly relies on mark_tokens having run before it's called

Initially this was intentional and fine because the only f-string handling was _in_f_string (now called _dont_use_tokens) saying that even though we have tokens you still shouldn't use them.

However it was a mistake to also add _broken_positions in the same place, because indeed this means that marking tokens changes the behaviour of get_text. The tests actually do reflect this now but I forgot about this point.

So another solution is needed if we want the 'broken' inner nodes of f-strings to return empty text instead of the whole f-string. There's 3 types of these nodes:

FormattedValue: easy, this type seems to always be broken.
Str/Constant: quite easy if we just notice that the returned text starts with f or rf (or F etc.) instead of " or '.
JoinedStr: this is for the format spec. This is a problem, since the returned text is the full f-string which is the correct type. So I think we either have to:
- Iterate through the whole tree on initialisation to mark nodes. Not great since until now we managed to avoid paying any similar upfront cost.
- Parse the returned text and see if it matches the node we have. Makes each get_text call potentially significantly more expensive, but only for JoinedStr nodes.

dsagal commented 2 years ago

@dsagal the names I chose initially are not great. What do you think of replacing unmarked and init_tokens with use_tokens? So the constructor and get_text methods would both have use_tokens=True by default, and the method .init_tokens() would be .use_tokens(). supports_unmarked could be supports_no_tokens.

Sure! I haven't followed the full discussion on this or the related issues. I'd be fine with different names for params/methods. Having separate classes for different semantics seems reasonable too, especially if that helps with typings.

alexmojaki commented 2 years ago

@PeterJCLaw please can you write an example of what using the API in #98 would look like, compared with the API here, and a summary of the benefits?

alexmojaki commented 2 years ago

@dsagal aside from the broader API questions, it'd be helpful if you could look at my last comment above about f-strings, including the comment I linked to showing the general problem, in case you have any good ideas for solutions, or a reason why it's not worth worrying about.

dsagal commented 2 years ago

@dsagal aside from the broader API questions, it'd be helpful if you could look at my last comment above about f-strings, including the comment I linked to showing the general problem, in case you have any good ideas for solutions, or a reason why it's not worth worrying about.

It seems to me like f-strings are an area similar to where the AST itself was when asttokens was created: the built-in functionality has bugs that make it insufficient, and to work around them, asttokens needs to do some manual parsing (searching for curly braces, :, !). Sorry, I don't have anything more helpful to suggest...

alexmojaki commented 2 years ago

Well since you noticed that arbitrary types of expression nodes inside f-strings can have wrong positions, I think that settles the fact that walking through the tree to mark various nodes is necessary.

PeterJCLaw commented 2 years ago

@PeterJCLaw please can you write an example of what using the API in #98 would look like, compared with the API here,

I think the tests in #98 cover this? Essentially construction is similar to that of plain ASTTokens, however usage of the lazy class is limited to the functionality that doesn't "obviously" need tokens.

I'm not really expecting consumers to use the LazyHelper.asttokens property, though they could if they wanted (and would get what they expected in return).

and a summary of the benefits?

The main benefits as I see them are:

ASTTokens and LazyHelper have useful static analysis (which the consequences of ASTTokens(init_tokens=False) does not), allowing both type checkers and editor completions to be essentially correct without additional effort (including any changes necessary due to opting in to the functionality in this PR #93)
The semantics of what is possible with each class are consequently clearer as they're static (indeed this was how the f-strings bug was highlighted), resulting in clearer behaviour for both consumers and maintainers of this package (and in turn hopefully easier maintenance)
The semantics of _all_uses of LazyHelper are much more obvious in terms of their impact on node.{first,last}_token -- specifically that consumers should not use those added attributes on the grounds that they might not be present (but that's what consumers are opting in to by using this class)
(LazyHelper also happens to allow for the case of lazily evaluating the AST too, meaning other use-cases can choose to benefit from laziness without this functionality further complicating the ASTTokens initialiser)
Consumers that really want to access the additional ASTTokens functionality can still do so (and the point at which the lazy-loading happens will be obvious to them as a result of their needing to access an indirect property to do so)
Easier maintenance as logically separate code is in separate places (including e.g: the TextRangeBase class) and in particular the optimising/special-casing being added for the lazy case are limited to one specific place in the code (the LazyHelper class)

The potential drawbacks I can see are:

To opt in to the new behaviour consumers need to use a new class:
- potentially needing more changes than just the initialiser argument (minor IMO given that they'd be changing code anyway)
- counterpoint: most additional changes they need to make will be highlighted to them by a type checker (if they're using one)
Some mental overhead during maintenance from having the logic split between multiple classes

alexmojaki commented 2 years ago

With @PeterJCLaw's changes:

The ASTTokens class only uses tokens and doesn't use the new method at all.
This means that it can't handle f-strings at all again.
The previously mentioned compatibility problems are gone.
The special attribute _dont_use_tokens (from earlier in this PR) is no longer needed and has been removed.

gristlabs / asttokens

ASTText class that doesn't require tokens #93