Use a zipper to manage parent nodes and updates

pylint-bot / test

0 stars 0 forks source link

Use a zipper to manage parent nodes and updates #163

Open pylint-bot opened 9 years ago

pylint-bot commented 9 years ago

Originally reported by: BitBucket: ceridwenv, GitHub: @ceridwen?

After experimenting more with the process of refactoring Macropy to use astroid, I've concluded that the solution I came up with of having two methods is creating a lot of boilerplate in tree creation and maintaining the synchronization manually is bug-prone. After researching possible alternatives, I'd like to argue for using a zipper to wrap the AST. There's a simple Python implementation at https://github.com/trivio/zipper. A zipper keeps track of parent nodes without needing a doubly-linked tree, which will make the constructors simpler and remove most of the boilerplate in tree creation, and allows the AST to be immutable. The current mutability of the AST makes tracing the state of notes unnecessarily difficult, as I learned while writing the initial patch for node constructors, especially given how complex the state of any given astroid node is. An immutable AST is compatible with lazy properties that are calculated when they're first called but afterwards remain unchanged. It should be possible to keep the external API the same using properties. Along the way, it will probably be necessary to refactor some things that are currently mutating the AST (transforms, inference) to be functions/visitors acting on the AST, but you've already indicated you want to do that in https://bitbucket.org/logilab/astroid/issues/116/add-a-separate-step-for-transforms and https://bitbucket.org/logilab/astroid/issues/147/lots-of-circular-dependencies-in-the.

Bitbucket: https://bitbucket.org/logilab/astroid/issue/163

pylint-bot commented 9 years ago

Original comment by Claudiu Popa (BitBucket: PCManticore, GitHub: @PCManticore?):

In general, I like the idea, but I want to see some sort of proof of concept of how this will look, as well as what needs to be changed in the inference and the transforms. Also, I'm not sure if this changes anything, but the transforms can mutate a node, not necessarily return a new one. Probably we should use something else than that zipper library, it doesn't seem maintained and the implementation seems messy.

pylint-bot commented 9 years ago

Original comment by BitBucket: ceridwenv, GitHub: @ceridwen?:

I'll write a proof of concept. A zipper can work on mutable data, it just doesn't offer any advantages. In the long run, the purpose of changing the representation would be to limit the amount of mutation to make it easier to reason about state, which at the moment is very complicated and hard to understand (or at least, that's my feeling having tried to understand the code to write my patches). I agree that we shouldn't use that zipper library because it's unmaintained.

pylint-bot commented 9 years ago

Original comment by Claudiu Popa (BitBucket: PCManticore, GitHub: @PCManticore?):

Cool, I'm looking forward to it.

pylint-bot commented 9 years ago

Original comment by BitBucket: ceridwenv, GitHub: @ceridwen?:

I'm going to start on this now, but I wamted to ask first if I should submit the debugging improvement patches to make debugging this patch less challenging than my last one.

pylint-bot commented 9 years ago

Original comment by Claudiu Popa (BitBucket: PCManticore, GitHub: @PCManticore?):

Yeah, you could send them first.

pylint-bot commented 9 years ago

Original comment by BitBucket: ceridwenv, GitHub: @ceridwen?:

Zipper prototype.

pylint-bot commented 9 years ago

Original comment by BitBucket: ceridwenv, GitHub: @ceridwen?:

I added a prototype, which is more a skeleton of what a working implementation would look like than a specific proposal per se. I only implemented basic movement and editing primitives. To preserve the API, I think the real implementation should be a proxy for the node it contains and parent should be aliased to up() using a property. This would mean that there are certain node methods that at the moment will fail when called on raw nodes not enclosed in a zipper. I don't see this as a problem since all the public APIs ought to return ASTs properly enclosed in zippers, but tell me if you think differently. The node classes themselves will need two new methods, children and make_node (those names aren't set in stone), and after it's done I can remove parent from __init__ and combine all the postinit arguments into init. We should still talk about exactly what movement and editing primitives should be available and data structure issues. I discuss the data structures a bit more in my comments in the prototype. For movement and editing primitives, the Clojure implementation and the other Python implementation I mentioned above have some ideas, but I honestly don't expect to get this right on the first draft. After I finish this, I'm going to adapt Macropy to use astroid, which ought to give me more insight into what editing/movement primitives will be needed in a real application, and then submit another pull request.

pylint-bot commented 9 years ago

Original comment by BitBucket: ceridwenv, GitHub: @ceridwen?:

The long-range project I've been working on is to make astroid more useful for non-pylint projects (in particular Macropy) and closer to a drop-in replacement for the standard lib ast module. One of the major problems I found when I was first looking at trying to use astroid for Macropy is that the logic for handling the ASTs was scattered all over astroid and poorly encapsulated, so working with astroid ASTs required detailed knowledge of the undocumented-internals and manual calls to various astroid methods to get a proper astroid AST. So far, I've been focusing on solving this for node creation: when I gave the nodes constructors, I moved some of the logic out of rebuilder.py and into the __init__ and postinit methods, which are also vaguely more documented now so one could at least theoretically create an AST by looking at the __init__ and postinit arguments without needing to read through large chunks of the astroid code. Moving navigation to the zipper, making the tree singly-linked, and condensing node creation down to __init__ will again improve this.

Unfortunately, I'm increasingly seeing that alone won't solve the problems. The biggest are in the variable name/scope handling, so I'll use that as an example, but there are some other similar cases. There are still several explicit calls of set_local() in rebuilder.py, and some of them are in TreeRebuilder._save_assignment(). To properly construct an astroid AST at the AST level effectively requires reading the code to find these calls and mimicking them, somehow using the private API to handle it for you, or both. When altering an astroid AST, like I would often need to in Macropy, would require using set_local() by hand, because there doesn't seem to be any comprehensive code for maintaining variable name/scope synchronization when changing the AST.

My diagnosis is that all these problems (including the related problems in #169) come from storing AST-level properties on the nodes. This makes the ASTs badly nonlocal because a change in one part of the AST can require changes on arbitrary nodes in any other part of the AST. Frankly, I think the right way to fix this is the obvious approach: stop storing AST-level properties on the nodes! AST-level properties should be calculated by functions (speaking in a generalized sense to include visitors, classes with __call__, etc.) acting on ASTs. If performance is an issue, we need a way to identify an AST and memoize the result of the function call with respect to that AST so that repeated calls on the same AST access a cached result, rather than trying to store cached results on the nodes themselves. Compare, for instance, how Macropy uses a visitor to calculate scoping: https://github.com/ceridwen/macropy/blob/astroid/macropy/core/analysis.py#L43 . This is a simpler, better-encapsulated solution than what astroid has now.

I want to do this right, so I want to refactor the AST-level properties (pretty much everything that lives in _other_other_fields and some other things besides) into functions/visitors and handle the caching some other way. On the performance issue: do we have any existing benchmarks for astroid's or pylint's performance I can use to assess the impact of changes I make? Do we have any idea where the bottlenecks in the code are to know what parts of the code need optimization? I will note that just by eyeballing the code, I see astroid does a lot of things that will make it unnecessarily slow on CPython.

pylint-bot commented 9 years ago

Original comment by Claudiu Popa (BitBucket: PCManticore, GitHub: @PCManticore?):

First of all, thank you for the work you're doing with astroid. I'm in total agreement with you regarding all these issues you mentioned, from this point of view, astroid shows not only its age, but a lack of a comprehensive design, which causes problems such as #147 and #108. So if you want to tackle the change of moving away from AST level properties into function computations through visitors or other methods, then I'm happy with that. Regarding the performance, no, there's nothing yet for measuring astroid's performance, but I might do it if it helps you with your work. I presume for now that the bottleneck is actually the inference and not the tree rebuilding. What things will make it unnecessarily slow on CPython?

pylint-bot commented 9 years ago

Original comment by BitBucket: ceridwenv, GitHub: @ceridwen?:

I would also be surprised if the AST operations are the bottleneck. If you could write some kind of profiling, that would be a huge help. I think for the moment I'm going to not worry about performance too much, pending data showing it's a real problem compared to the inference or pylint operations.

My guess, without data to go on, is that a lot of time is being spent in small functions with multiple levels of indirection and on temporary copies. Unfortunately in CPython, avoiding code duplication and performance are in opposition, because all abstractions in CPython have a cost. The copying should have an impact even on PyPy.

pylint-bot commented 9 years ago

Original comment by Claudiu Popa (BitBucket: PCManticore, GitHub: @PCManticore?):

Increasing the priority, since I want to have this before releasing 1.5, so that the API isn't changed for two versions in a row. I'll try to have a performance monitoring tool available for you this week.

pylint-bot commented 9 years ago

Original comment by BitBucket: ceridwenv, GitHub: @ceridwen?:

I'm in the middle of implementing the zipper and want to consult on one of the larger design decisions. In the prototype, I talked about the two simple cases of nodes, nodes whose children are all direct attributes (accessed by node.foo) and nodes with only one sequence of children. In general nodes can contain multiple sequences of children, though, and so the zipper has to accommodate that. There are, AFAICT, two approaches to the problem: allow the zipper to focus on sequences of children as well as nodes themselves, or have zipper only focus on nodes. The latter requires providing an API on the nodes that returns a nested sequence of children, where each element corresponds to an attribute on the node, which element may itself be another node or a sequence of nodes. There will need to be more kinds of traversal functions in this case, and some of them will look like child_sequence and locate_child. The former will requires either doing type dispatch in the zipper itself, having code that branches on whether the zipper is focusing on a node or a sequence of nodes, or writing a new sequence type with an API that matches that of nodes. This will probably also require additional traversal functions for traversals that are sure to only return nodes.

Because the existing API contains a mixture of both approaches to traversal (parent will always return a node, child_or_sequence will may return a sequence), there's not really a clear precedent. Do you have any thoughts on which approach would be better?

pylint-bot commented 9 years ago

Original comment by Claudiu Popa (BitBucket: PCManticore, GitHub: @PCManticore?):

This is interesting, since algorithms and data structures aren't my strongest point, hope I'm not wandering too much with these ideas.

I think the solution is a hybrid approach and I'll try in the following to emphasize why is that. Let's take for example the AST of this code:

#!python

def test():
   a = 1
   b = 2
   if a == b:
      print('here')
   else:
      print('there')
   return 'other block'

which has this repr_tree:

#!python

FunctionDef(
   name='test',
   doc=None,
   decorators=None,
   args=Arguments(),
   body=[
      Assign(
         targets=[AssignName(name='a')],
         value=Const(value=1)),
      Assign(
         targets=[AssignName(name='b')],
         value=Const(value=2)),
      If(
         test=Compare(
            left=Name(name='a'),
            ops=[['==', Name(name='b')]]),
         body=[Expr(value=Call(
                  func=Name(name='print'),
                  args=[Const(value='here')],
                  keywords=None))],
         orelse=[Expr(value=Call(
                  func=Name(name='print'),
                  args=[Const(value='there')],
                  keywords=None))]),
      Return(value=Const(value='other block'))],
   returns=None)

If the zipper's focus would be the If node, then we can do the following

retrieve the left part of the tree from zipper's current point of view. I think this is the first two assignments, to a and b, which is obviously a collection of nodes.
retrieve the right part of the tree from.
retrieve the children, which is also a collection of nodes.
the only one that is different is the behaviour for up() / parent, because there can be only one parent for a node. For this, it might not make sense to return a collection, but internally you could do that to ease your implementation.

The problem with this example is the If itself, which as you said, is a composite node, with both children as attributes and sequences of children. The zipper should deal with nodes only, since it makes most sense of all if you view the sequence of children as a subtree of the node in question, rather than a Python collection, as in this example. The Else and Body nodes can be seen as fake nodes, since they only purpose is to show that there's a subtree living there.

Regarding what it should return, why the list has to be nested, why not simply flattening everything together? Unless we're expecting to build the tree bottom-up I don't see an use case right now.

pylint-bot commented 9 years ago

Original comment by BitBucket: ceridwenv, GitHub: @ceridwen?:

If the zipper's focus would be the If node, then we can do the following

retrieve the left part of the tree from zipper's current point of view. I think this is the first two assignments, to a and b, which is obviously a collection of nodes.

retrieve the right part of the tree from.

retrieve the children, which is also a collection of nodes.

the only one that is different is the behaviour for up() / parent, because there can be only one parent for a node. For this, it might not make sense to return a collection, but internally you could do that to ease your implementation.

For the most part you're right, but not all of this can be done easily at the same time. If the zipper can only focus on nodes, then it can't return sequences in any case. Moving left from the If will return the second Assign, then the first Assign. Moving right will return the Return node, then nothing. Going down will return the Compare node. Moving up will return the FunctionDef.

If the zipper can focus on sequences, it could return a sequence of the two assignments, and going down then right will return the sequence contained in If.body. However, it will also mean that going up will return the sequence in FunctionDef.body.

Note that it's possible to write all kinds of traversal functions so to some extent this a question of what functions will be simplest and perform best.

Regarding what it should return, why the list has to be nested, why not simply flattening everything together? Unless we're expecting to build the tree bottom-up I don't see an use case right now.

The zipper needs to be able to reconstruct the children of a node properly when moving up. If you flatten everything together, the zipper doesn't know which children belong to which sequences in the parent node and so can't call the constructor properly.

I hope I'm being clear, there are a lot of details here. If the zipper can only focus on nodes, what should happen to child_or_sequence?

pylint-bot commented 9 years ago

Original comment by Claudiu Popa (BitBucket: PCManticore, GitHub: @PCManticore?):

One thing doesn't seem right: "However, it will also mean that going up will return the sequence in FunctionDef.body." Going up means retrieving the direct parent, similar to what we're currently having. Why does it have to return the entire sequence of functiondef's body, since functiondef is the only parent? That's the case for If, but I would expect to be the same for Compare for instance, up() should retrieve only the direct ancestor.

pylint-bot commented 9 years ago

Original comment by BitBucket: ceridwenv, GitHub: @ceridwen?:

The zipper needs to be consistent to work correctly, or at least, if there's a way to avoid that, it's quite complicated, and I don't know how I'd make it work immediately. If it can focus on sequences and nodes, then sometimes moving will return a sequence, not a node. If it can focus on only nodes, moving will always return a node. That said, it's still possible to write traversal functions that only return nodes (or sequences, for that matter) using isinstance.

# Zipper focuses on nodes and sequences
    @property
    def parent(self):
        up = self.up()
        if isinstance(up, bases.NodeNG):
            return up
        else:
            return up.up()

# Zipper focuses on only nodes
    @property    
    def parent(self):
         return self.up()

The trickiest part, I think, is if the zipper focuses only on nodes, what about the traversal functions that can currently return nodes and sequences, like child_or_sequence? The only way I see to do that is to explicitly wrap each of the nodes in in the sequence in a zipper, which will involve a performance hit.

I should emphasize again that as far as I can tell at this point, it should be possible to maintain consistency with the external API no matter which choice we make here. This should be an implementation detail, which will affect the complexity of the code, where that complexity lives (which operations are simple and which are more complicated), and performance.