gristlabs / asttokens

Annotate Python AST trees with source text and token information
Apache License 2.0
172 stars 34 forks source link

ASTText class that doesn't require tokens #93

Closed alexmojaki closed 2 years ago

alexmojaki commented 2 years ago

Following up on https://github.com/gristlabs/asttokens/pull/92.

Closes #97.

PeterJCLaw commented 2 years ago

Just a thought here -- would using something like https://pypi.org/project/cached-property/ make this easier given the goal of lazy evaluation? That would sidestep the need for various places to check whether the tokens have been initialised (I'm imagining that consumers would just access self.tokens or self.token_offsets which would facade out the lazy-parsing side of things) and make the type checkers happier too.

Not sure how big an issue it would be to take the extra dependency for Python < 3.8 though? (functools learns cached_property in 3.8)

If that approach does work, given the conversation at https://github.com/gristlabs/asttokens/issues/94#issuecomment-1252862533 would it make sense to make the AST similarly lazy (rather than potentially None)?

alexmojaki commented 2 years ago

The tokens property isn't lazily initialising _tokens, it's asserting it for the sake of type checkers. Using .tokens when they're not initialised is meant to throw an error.

PeterJCLaw commented 2 years ago

Ah, ok. I think I'd misunderstood from the discussion on that issue.

Given that goal I would repeat the encouragement I made on that issue -- of exploring having variations of ASTTokens for the no-tokens and no-tree cases so that the default case requires both. This would greatly simplify the type checking and in turn make consumer code clearer. I'd be happy to put together a sketch of that if you're open to that approach.

alexmojaki commented 2 years ago

Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is atok = ASTTokens(source, parse=True) and I don't see much of a problem that needs improving.

PeterJCLaw commented 2 years ago

Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is atok = ASTTokens(source, parse=True) and I don't see much of a problem that needs improving.

Sure, though I'll take it to a separate issues to avoid taking this PR too far off topic.

PeterJCLaw commented 2 years ago

Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is atok = ASTTokens(source, parse=True) and I don't see much of a problem that needs improving.

Sure, though I'll take it to a separate issues to avoid taking this PR too far off topic.

https://github.com/gristlabs/asttokens/issues/97

alexmojaki commented 2 years ago

@dsagal the names I chose initially are not great. What do you think of replacing unmarked and init_tokens with use_tokens? So the constructor and get_text methods would both have use_tokens=True by default, and the method .init_tokens() would be .use_tokens(). supports_unmarked could be supports_no_tokens.

alexmojaki commented 2 years ago

Here's a comparison to just before phase 1 (#92) to show the whole picture which can be helpful: https://github.com/gristlabs/asttokens/compare/2.0.8...unmarked2

alexmojaki commented 2 years ago

Found a problem while testing this in futurecoder. I had some code that relied on the fact that .get_text(node) returning something non-empty means that node.first_token exists. The new way of supporting f-strings means that this implication no longer holds, even without setting init_tokens=False. This led to an AttributeError. I don't see anything we can do about that here, so that's a small break in compatibility that probably means a new major version is needed.

alexmojaki commented 2 years ago

Two more f-string problems:

  1. .get_text(node) suddenly working can just break things in other ways. Previously I was excluding nodes from a particular set when get_text didn't work. f-string contents suddenly appearing in the set was a problem. In particular, I was using asttokens.util.replace to modify some source code, and it assumes non-overlapping ranges, but now the inner node ranges were overlapping with the full f-string range. The fact that I was using an asttokens.util function here is essentially a coincidence, things could just break in non-asttokens related ways. In theory I suppose there could be another boolean arg for compatibility to continue ignoring f-string nodes, but that seems a bit extreme.
  2. The direct children of f-strings (but not the regular expression nodes within) have broken positions and need further special handling.
PeterJCLaw commented 2 years ago

While rebasing #98 I hit a test failure which I think is present in this PR too -- the f-string handling in atok.get_text implicitly relies on mark_tokens having run before it's called. This seems to happen in the example source because of the function definition there, however commenting that out (the def foo on line 16--17 in tests/test_unmarked.py) causes test_unmarked to fail on the f-string handling.

The specific error comes from the assertion that atok.get_text for f-strings return empty text: self.assertEqual(atok_text, "", node).

I'm actually not sure why the test is asserting that that value is empty given that the text which would otherwise be returned appears to be correct to me (it matches what python's AST returns, which seems fine; removing the conditional block gets the tests passing for #98).

alexmojaki commented 2 years ago

I've explained the reason for the conditional block in https://github.com/gristlabs/asttokens/pull/98#discussion_r985298862. However you're right about this:

the f-string handling in atok.get_text implicitly relies on mark_tokens having run before it's called

Initially this was intentional and fine because the only f-string handling was _in_f_string (now called _dont_use_tokens) saying that even though we have tokens you still shouldn't use them.

However it was a mistake to also add _broken_positions in the same place, because indeed this means that marking tokens changes the behaviour of get_text. The tests actually do reflect this now but I forgot about this point.

So another solution is needed if we want the 'broken' inner nodes of f-strings to return empty text instead of the whole f-string. There's 3 types of these nodes:

dsagal commented 2 years ago

@dsagal the names I chose initially are not great. What do you think of replacing unmarked and init_tokens with use_tokens? So the constructor and get_text methods would both have use_tokens=True by default, and the method .init_tokens() would be .use_tokens(). supports_unmarked could be supports_no_tokens.

Sure! I haven't followed the full discussion on this or the related issues. I'd be fine with different names for params/methods. Having separate classes for different semantics seems reasonable too, especially if that helps with typings.

alexmojaki commented 2 years ago

@PeterJCLaw please can you write an example of what using the API in #98 would look like, compared with the API here, and a summary of the benefits?

alexmojaki commented 2 years ago

@dsagal aside from the broader API questions, it'd be helpful if you could look at my last comment above about f-strings, including the comment I linked to showing the general problem, in case you have any good ideas for solutions, or a reason why it's not worth worrying about.

dsagal commented 2 years ago

@dsagal aside from the broader API questions, it'd be helpful if you could look at my last comment above about f-strings, including the comment I linked to showing the general problem, in case you have any good ideas for solutions, or a reason why it's not worth worrying about.

It seems to me like f-strings are an area similar to where the AST itself was when asttokens was created: the built-in functionality has bugs that make it insufficient, and to work around them, asttokens needs to do some manual parsing (searching for curly braces, :, !). Sorry, I don't have anything more helpful to suggest...

alexmojaki commented 2 years ago

Well since you noticed that arbitrary types of expression nodes inside f-strings can have wrong positions, I think that settles the fact that walking through the tree to mark various nodes is necessary.

PeterJCLaw commented 2 years ago

@PeterJCLaw please can you write an example of what using the API in #98 would look like, compared with the API here,

I think the tests in #98 cover this? Essentially construction is similar to that of plain ASTTokens, however usage of the lazy class is limited to the functionality that doesn't "obviously" need tokens.

I'm not really expecting consumers to use the LazyHelper.asttokens property, though they could if they wanted (and would get what they expected in return).

and a summary of the benefits?

The main benefits as I see them are:

The potential drawbacks I can see are:

alexmojaki commented 2 years ago

With @PeterJCLaw's changes: