Closed alexmojaki closed 2 years ago
Just a thought here -- would using something like https://pypi.org/project/cached-property/ make this easier given the goal of lazy evaluation? That would sidestep the need for various places to check whether the tokens have been initialised (I'm imagining that consumers would just access self.tokens
or self.token_offsets
which would facade out the lazy-parsing side of things) and make the type checkers happier too.
Not sure how big an issue it would be to take the extra dependency for Python < 3.8 though? (functools
learns cached_property
in 3.8)
If that approach does work, given the conversation at https://github.com/gristlabs/asttokens/issues/94#issuecomment-1252862533 would it make sense to make the AST similarly lazy (rather than potentially None
)?
The tokens
property isn't lazily initialising _tokens
, it's asserting it for the sake of type checkers. Using .tokens
when they're not initialised is meant to throw an error.
Ah, ok. I think I'd misunderstood from the discussion on that issue.
Given that goal I would repeat the encouragement I made on that issue -- of exploring having variations of ASTTokens
for the no-tokens and no-tree cases so that the default case requires both. This would greatly simplify the type checking and in turn make consumer code clearer. I'd be happy to put together a sketch of that if you're open to that approach.
Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is atok = ASTTokens(source, parse=True)
and I don't see much of a problem that needs improving.
Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is
atok = ASTTokens(source, parse=True)
and I don't see much of a problem that needs improving.
Sure, though I'll take it to a separate issues to avoid taking this PR too far off topic.
Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is
atok = ASTTokens(source, parse=True)
and I don't see much of a problem that needs improving.Sure, though I'll take it to a separate issues to avoid taking this PR too far off topic.
@dsagal the names I chose initially are not great. What do you think of replacing unmarked
and init_tokens
with use_tokens
? So the constructor and get_text methods would both have use_tokens=True
by default, and the method .init_tokens()
would be .use_tokens()
. supports_unmarked
could be supports_no_tokens
.
Here's a comparison to just before phase 1 (#92) to show the whole picture which can be helpful: https://github.com/gristlabs/asttokens/compare/2.0.8...unmarked2
Found a problem while testing this in futurecoder. I had some code that relied on the fact that .get_text(node)
returning something non-empty means that node.first_token
exists. The new way of supporting f-strings means that this implication no longer holds, even without setting init_tokens=False
. This led to an AttributeError. I don't see anything we can do about that here, so that's a small break in compatibility that probably means a new major version is needed.
Two more f-string problems:
.get_text(node)
suddenly working can just break things in other ways. Previously I was excluding nodes from a particular set when get_text
didn't work. f-string contents suddenly appearing in the set was a problem. In particular, I was using asttokens.util.replace
to modify some source code, and it assumes non-overlapping ranges, but now the inner node ranges were overlapping with the full f-string range. The fact that I was using an asttokens.util
function here is essentially a coincidence, things could just break in non-asttokens related ways. In theory I suppose there could be another boolean arg for compatibility to continue ignoring f-string nodes, but that seems a bit extreme.While rebasing #98 I hit a test failure which I think is present in this PR too -- the f-string handling in atok.get_text
implicitly relies on mark_tokens
having run before it's called. This seems to happen in the example source because of the function definition there, however commenting that out (the def foo
on line 16--17 in tests/test_unmarked.py
) causes test_unmarked
to fail on the f-string handling.
The specific error comes from the assertion that atok.get_text
for f-strings return empty text: self.assertEqual(atok_text, "", node)
.
I'm actually not sure why the test is asserting that that value is empty given that the text which would otherwise be returned appears to be correct to me (it matches what python's AST returns, which seems fine; removing the conditional block gets the tests passing for #98).
I've explained the reason for the conditional block in https://github.com/gristlabs/asttokens/pull/98#discussion_r985298862. However you're right about this:
the f-string handling in atok.get_text implicitly relies on mark_tokens having run before it's called
Initially this was intentional and fine because the only f-string handling was _in_f_string
(now called _dont_use_tokens
) saying that even though we have tokens you still shouldn't use them.
However it was a mistake to also add _broken_positions
in the same place, because indeed this means that marking tokens changes the behaviour of get_text
. The tests actually do reflect this now but I forgot about this point.
So another solution is needed if we want the 'broken' inner nodes of f-strings to return empty text instead of the whole f-string. There's 3 types of these nodes:
f
or rf
(or F
etc.) instead of "
or '
.get_text
call potentially significantly more expensive, but only for JoinedStr nodes.@dsagal the names I chose initially are not great. What do you think of replacing
unmarked
andinit_tokens
withuse_tokens
? So the constructor and get_text methods would both haveuse_tokens=True
by default, and the method.init_tokens()
would be.use_tokens()
.supports_unmarked
could besupports_no_tokens
.
Sure! I haven't followed the full discussion on this or the related issues. I'd be fine with different names for params/methods. Having separate classes for different semantics seems reasonable too, especially if that helps with typings.
@PeterJCLaw please can you write an example of what using the API in #98 would look like, compared with the API here, and a summary of the benefits?
@dsagal aside from the broader API questions, it'd be helpful if you could look at my last comment above about f-strings, including the comment I linked to showing the general problem, in case you have any good ideas for solutions, or a reason why it's not worth worrying about.
@dsagal aside from the broader API questions, it'd be helpful if you could look at my last comment above about f-strings, including the comment I linked to showing the general problem, in case you have any good ideas for solutions, or a reason why it's not worth worrying about.
It seems to me like f-strings are an area similar to where the AST itself was when asttokens
was created: the built-in functionality has bugs that make it insufficient, and to work around them, asttokens
needs to do some manual parsing (searching for curly braces, :
, !
). Sorry, I don't have anything more helpful to suggest...
Well since you noticed that arbitrary types of expression nodes inside f-strings can have wrong positions, I think that settles the fact that walking through the tree to mark various nodes is necessary.
@PeterJCLaw please can you write an example of what using the API in #98 would look like, compared with the API here,
I think the tests in #98 cover this? Essentially construction is similar to that of plain ASTTokens
, however usage of the lazy class is limited to the functionality that doesn't "obviously" need tokens.
I'm not really expecting consumers to use the LazyHelper.asttokens
property, though they could if they wanted (and would get what they expected in return).
and a summary of the benefits?
The main benefits as I see them are:
ASTTokens
and LazyHelper
have useful static analysis (which the consequences of ASTTokens(init_tokens=False)
does not), allowing both type checkers and editor completions to be essentially correct without additional effort (including any changes necessary due to opting in to the functionality in this PR #93)LazyHelper
are much more obvious in terms of their impact on node.{first,last}_token
-- specifically that consumers should not use those added attributes on the grounds that they might not be present (but that's what consumers are opting in to by using this class)LazyHelper
also happens to allow for the case of lazily evaluating the AST too, meaning other use-cases can choose to benefit from laziness without this functionality further complicating the ASTTokens
initialiser)ASTTokens
functionality can still do so (and the point at which the lazy-loading happens will be obvious to them as a result of their needing to access an indirect property to do so)TextRangeBase
class) and in particular the optimising/special-casing being added for the lazy case are limited to one specific place in the code (the LazyHelper
class)The potential drawbacks I can see are:
With @PeterJCLaw's changes:
ASTTokens
class only uses tokens and doesn't use the new method at all._dont_use_tokens
(from earlier in this PR) is no longer needed and has been removed.
Following up on https://github.com/gristlabs/asttokens/pull/92.
Closes #97.
ASTText
that:get_text[_range]
while avoiding the performance cost of dealing with tokens for most node types and newer Python versions.tree
if one isn't passed, soparse=True
isn't needed.get_text_positions
to bothASTText
andASTTokens
which returns start and end (lineno, col_offset) pairs, as an alternative tonode.first/last_token
.padded
to all 3get_text*
methods on both classes which matches the behaviour of the same argument inast.get_source_segment
i.e. includes leading indentation for multiline statements. However the default is True (instead of False as inast.get_source_segment
) in the existing methodsget_text[_range]
for backwards compatibility. For the newget_text_positions
there is no default to force the user to stop and think. It can't beFalse
because that would be confusing compared to the existing methods, butFalse
matches the behaviour of usingnode.first/last_token
which the method is meant to replace.