vyperlang / vyper

Pythonic Smart Contract Language for the EVM
https://vyperlang.org
Other
4.9k stars 801 forks source link

VIP: Structured AST Output #2276

Open haltman-at opened 3 years ago

haltman-at commented 3 years ago

Simple Summary

So, I work on the Truffle Debugger, and I recently added rudimentary Vyper support for it (with more hopefully to come). However, further Vyper support -- especially decoding of variables -- is made difficult by the fact that Vyper's AST format (as emitted by the compiler) is difficult to work with. I was hoping maybe that a future version of Vyper could introduce a new or improved AST format that would be more usable, which would go a long way towards allowing fuller Vyper support in Truffle Debugger.

Motivation

So first I should be clear here that I'm only talking about the AST emitted by the compiler. I have no interest in any intermediate AST format that is used internally, so if the existing format needs to continue to be used internally at some point, that is fine. My interest is in the AST emitted by the compiler so that Truffle Debugger can make use of it.

With that out of the way, why is the existing AST format unsatisfactory, and how would I like to see it improved?

Fundamentally, the problem with the existing AST format is that it's not so much of an abstract syntax tree as a concrete syntax tree; it doesn't so much describe what things are, as how they are written. For instance, if I write

a: String[64] = "hello"
b: uint256 = c[1]

then the nodes corresponding to String[64] and c[1] both have the same ast_type, "Index", despite meaning entirely different things -- one being a special sort of type name, and one being an expression indicating an access into an array.

While I would consider this the most fundamental problem, I should note that a carefully-written program meant to process the AST can still likely work with this. However, there are some key points where it is outright missing needed information.

The worst case of this, I would say, is the declaration of a loop variable. Most variable declarations are of the AnnAssign ast_type and as such include type information in the annotation field. However, loop variable declarations are of the Name ast_type and do not include any type information. That's because this reflects the syntax, where most variables are declared with a specified type, but loop variables are not. However the result is that determining the type of a loop variable is, while perhaps technically possible, not realistically feasible for an external tool.

But I want to be clear that while the lack of type information for loop variables is the immediate problem, I don't think the correct solution is to simply add that type information in somehow. Doing so wouldn't be very compatible with the current format, after all, since then the tree would no longer reflect the concrete syntax. What I'd really like to see is not a minor modification to cover this case, but rather something like Solidity has -- where every node that has a type (variable, expression, whatever) has type information attached to it. For instance, with the current format, it might be possible to find a way to decode mapping keys, but it would be difficult.

While I'm not suggesting you copy the Solidity AST -- Vyper is a different language, after all, and so many things in the Solidity AST would likely make no sense in this new context -- I think studying it could be instructive. Here are some features of it that I think are useful and that you would likely want to implement:

  1. As mentioned above, the Solidity AST attaches type information, in a machine-readable format, to every node that has a type; for user-defined types, this should likely include an ID (see below for more about IDs).
  2. Nodes in the Solidity AST can refer to other nodes by ID. For instance, each variable declaration has a scope field, giving the node ID of that variable's scope. Any use of something that has been previously declared by the user has a referencedDeclaration field, giving the node ID of its declaration. Etc.
  3. In order to make the above work, node IDs in Solidity are unique per-compilation, rather than per-source. While Vyper does less with imports than Solidity does, I think it still does enough with them to make this appropriate. Of course, you might find some other alternative rather than doing this, e.g. doing something like having referencedDeclarationSource (with the source index) and referencedDeclaration (with the ID). I don't know what the best way is; I'm just tossing ideas out here.
  4. As mentioned above, ast_types should be more about the node's meaning, rather than how it's written, and nodes should generally have more useful information attached to them rather than being a simple parse of the syntax (this is really the fundamental problem, as mentioned above). As I said, I think studying the Solidity AST format could be instructive regarding what sorts of useful information could be added or what ast_types could potentially exist.

So, those are examples of the sort of thing I'd like to see. Reworking the emitted AST format along these lines would make it much easier for Truffle Debugger (or other tools) to make use of it, both for variable decoding and for other purposes.

Thank you!

Backwards Compatibility

Depending on how this VIP is implemented, it could easily change the emitted AST format to something entirely incompatible with what already exists. That said, this would only be relevant to tools that make use of the emitted AST, and I'm not sure how prevalent those are.

Copyright

Copyright and related rights waived via CC0

haltman-at commented 3 years ago

Oh, one other note, if you do do this, ideally this new format should be readily distinguishable from the old one via something that's detectable at the top-level node. I'm not saying you need to put an explicit flag or anything, but ideally there should be something different there that will give it away. :) Thanks again!

haltman-at commented 3 years ago

Sorry, one more note -- I wanted to give a better example of what's wrong. (Other than iterator variables, I mean.) I don't feel like my Index example above did a great job illustrating the problem, because, maybe it's not too unreasonable that those are the same, right?

Well, I remembered a better example. And this will illustrate both things being the same that shouldn't be, but also the worse problem of things being different that shouldn't be. (Although iterator variables are already an example there, since, most variable declarations are AnnAssigns but those ones are instead Names!)

Consider the following declarations:

a: uint256
b: public(uint256)

The first one, like most variable declarations, is an AnnAssign node, with the name info in target and the type info (in this case, a Name) in annotation.

However, the second one, under annotation, instead has a Call! The Call has as its func a Name with id "public", and as its args an array containing just the node with the type info.

So, these two things are nearly the same, but in one under annotation you have the type info, and in the other you have it wrapped in a Call. (And of course the use of the Call type isn't great since nothing at all like a function call is in any way happening here!) Ideally, the fact that b is public would be a property of the variable declaration node, not some intermediate layer that wraps around the type where you expect the type to be!

Anyway, hoping that's a better example. Thank you again!

fubuloubu commented 3 years ago

I see the issue, yes we will be implementing a custom parser and updating our AST nodes to be specific to Vyper, intead of the strange PyAST/VyAST hybrid we have now. This should help :smile:


Structuring the output is definitely something we'd like to discuss with you more.

fubuloubu commented 3 years ago

Meeting notes: will explore refactoring our AST information so it is easier to work with.

sambacha commented 3 years ago

I see the issue, yes we will be implementing a custom parser and updating our AST nodes to be specific to Vyper, intead of the strange PyAST/VyAST hybrid we have now. This should help 😄

Structuring the output is definitely something we'd like to discuss with you more.

New parser? Any details on this?

fubuloubu commented 3 years ago

I see the issue, yes we will be implementing a custom parser and updating our AST nodes to be specific to Vyper, intead of the strange PyAST/VyAST hybrid we have now. This should help smile Structuring the output is definitely something we'd like to discuss with you more.

New parser? Any details on this?

quite old, but basically: https://github.com/vyperlang/vyper/issues/1363

Relevant work: https://github.com/fubuloubu/vyper-grammar https://github.com/fubuloubu/lll-compiler https://github.com/apeworx/node-utils

charles-cooper commented 3 years ago

@haltman-at is there a specific ask here? Or is this a more general "AST could be better"?

haltman-at commented 3 years ago

It's both. It's both a general "AST should be better", and examples of a few particular ways I would like it to be better. Note that making these improvements would probably require an entirely new AST format, rather than just tweaks to the existing one, as I suspect the existing one just isn't really suitable for these changes.

As for what these specific improvements would be, well, I was going to put them into a numbered list here, only to realize that I basically already did that in the original issue writeup. (Although some of them may rely on context in the paragraphs preceding said list.) So, um, just go reread that list, and also the issue more generally, for even more information? Not sure what else to add here that I didn't already say.

charles-cooper commented 3 years ago

@haltman-at I agree that most of the things you are asking for aren't really tweaks to the existing AST, they are kind of the result of using the Python AST parser instead of having a custom Vyper parser (see #1363). So I'm not sure if what you're asking for are specific changes to the AST or if this issue is putting on paper several issues that would be solved by #1363 (but effectively is a sub-issue of #1363).

I want to add that in general, it is much less confusing, and contributors are much more likely to want to help by working on an issue or VIP if it has one "main thing to do". The umbrella "AST should be better" issue is a pretty abstract issue. Meanwhile, there are several distantly related suggestions in this issue and it's hard for me to figure out which ones are important and which ones are just thoughts that happened to make their way into the VIP. So it would make it a lot easier for Vyper contributors to prioritize and take on the work - and it would be much appreciated! - if it were parceled out into separate, relatively self-contained issues.

haltman-at commented 3 years ago

Ah, yes, I would say that this is a sub-issue of #1363. I would say that (1)-(3) are pretty essential for my purposes; (4) maybe not so much.

charles-cooper commented 3 years ago

@haltman-at can you check out https://github.com/charles-cooper/vyper/tree/ast_output and https://github.com/charles-cooper/vyper/tree/ast_output2? The first one returns the AST after analysis (type checking, "expansion" and "folding" stages) instead of before. The second one is similar but it also exposes internal type information. (Obviously this is a proof of concept, not a stable format)

haltman-at commented 3 years ago

Are there any instructions for how I would use Vyper from source? If not, could you maybe just attach examples of the output? (Ideally in all three formats for comparison?) I can specify what specific features I'd like to see compared if you want, although it's basically just the stuff above -- do public declarations no longer look totally different from non-public ones, do loop variables have type info, do references to structs, constants, or events include the ID of what they're referencing... (I talked about the ID thing in general, because it should be a general thing, but I think it's structs, constants, and events that I care most about.)

charles-cooper commented 3 years ago

@haltman-at not sure how you're using vyper but you can usually pip install directly from a URL.

Here's an example contract, with output from 3 different branches - current master, and the two others I linked to before. https://gist.github.com/charles-cooper/b35dc96c4f952b586f9ec654ba252c6b

haltman-at commented 3 years ago

@haltman-at not sure how you're using vyper but you can usually pip install directly from a URL.

Oh, cool! I wasn't aware; I'll have to give these a whirl then (and also look at your example but I expect I'll want to try more myself :) ).

haltman-at commented 3 years ago

OK, quick impressions: The output from ast_output is not a big improvement from what currently exists. As for ast_output2... that one's more interesting, but it's hard to judge quickly. I don't think it meets all my criteria -- I don't see anything that looks like nodes referring to other nodes, for instance -- but it might at least meet some of them. There's clearly some sort of type information here. (It doesn't meet the "public state variables look essentially the same as non-public ones" criterion, but, given sufficient type information elsewhere, that criterion isn't truly a necessity, I suppose; it's something you'd want to do in a new AST format, but not something you can do by modifying the existing one, obviously.) My suspicion is that it won't quite be enough, but it does seem to be an improvement; I'll have to try it out more and comment more with what I find. Anyway, thanks a bunch for putting this up, this is quite interesting indeed.

haltman-at commented 3 years ago

@haltman-at not sure how you're using vyper but you can usually pip install directly from a URL.

Sorry, how does one actually do this? I'm trying to test this out right now and pip install https://github.com/charles-cooper/vyper/tree/ast_output2 doesn't seem to do it (I get a "cannot detect archive format" error).

Edit: Sorry, nevermind, got this working, the command is

pip install git+https://github.com/charles-cooper/vyper.git@ast_output2
haltman-at commented 3 years ago

OK, so, having tried this out, here's my quick assessment. So, we're trying to do two things here: Track variables and track mapping keys. I'm going to ignore the latter for now as I think if we can do the former we can basically do the latter, although it will probably require some special handling at some points, but like I said I want to ignore all this for now. Tracking variables means tracking their location (we can do that) and tracking their types; the former we can do, the latter is the problem we're trying to solve.

So, first question: Does this do everything that I would want from an AST format?

...well, obviously, no. It's still basically the same AST format, just with some transformations made and some information added. So that's obviously not happening until the whole AST is reworked.

But, OK -- let's work with what we have. Is this good enough to easily or mostly-uniformly handle the type-tracking problem?

Unfortunately, I'd say this is still a no. To qualify as a "yes", I'd say we ought to be able to rely primarily on the added type info, and only occasionally having to get type info via processing the AST itself (especially because the AST format, while handleable, is not great). This works fine for value types; but for strings, bytestrings, arrays, structs, and mappings, the added type info just tells you which one of those that it is, not the more detailed type information. Moreover, not all variables seem to get the type metadata at all; it seems to be missing on function arguments.

But, OK -- so we can't do things in a more uniform manner as we'd like, but... if this is what we had to go with, using whatever means feasible... could we do it? If we absolutely had to, is this good enough?

The answer there is... maybe. Or rather, it's yes, conditionally. The condition is that I think this is good enough, if we absolutely had to, for this version of Vyper; we could write mechanisms ot handle this that would work with the current version of Vyper. However, such mechanisms would rely on assumptions about what's currently possible in Vyper, assumptions that may be invalidated by future versions could add support for features that could easily render such mechanisms untenable, and possibly truly require a new AST format. As such, I would be reluctant to write support for this intermediate format.

So why do I say it would be enough for the current version? Well, because we could mostly ignore the type metadata and just process the AST, like I had initially planned. Doing so, while a pain, would be doable, except, of course, for the case of loop variables. But, with this added type metadata, we would now have a way to handle loop variables... in this version of Vyper. Currently, as best I can tell, only value types are permitted for loop variables; and value types are precisely the types for which the type metadata tells us everything. But if a future version of Vyper ever expanded the allowed types on loop variables, well, we'd be stuck. The type metadata simply isn't good enough for those cases.

(Another example of this, relevant not to variables but rather to mapping keys, is the assumption that Vyper doesn't permit pointers to mappings like Solidity does. When I said above that mapping keys seem handleable, it was conditional on this assumption. If Vyper added that, we'd have to rely on the type metadata, and for mappings it wouldn't be good enough.)

Also, the fact that this AST output is emitted after various other transformations has some unrelated downsides. Like, constants completely vanish! We'd like to be able to report on the values of constants when possible, but we can't do that if their AST entries go away.

So overall I'd call this a substantial improvement over the existing format, but it's not a strict improvement, and it still has enough missing that, while it's technically enough for current versions, I'd be quite wary of attempting to support it due to the possibility that future versions would invalidate our assumptions.

charles-cooper commented 3 years ago

@haltman-at I think this conversation would be a lot more focused with some examples. Could you maybe write up some sample contracts + the ideal AST output in a gist?

haltman-at commented 3 years ago

Well, there's no such thing as "the ideal AST output"; AST output necessarily involves arbitrary choices. And of course there's the question of whether we're talking about something close to what you already have, or wholly reinventing it as I imagine might happen in #1363. But I can certainly provide written-out examples illustrating the sorts of features I've described above.

I'll often illustrate with bits of Solidity ASTs, since it has a lot of these things, but obviously this isn't a suggestion that you should adopt their format whole cloth! Vyper is a different language -- and one with a different type system, which will certainly affect things. But I'm going to use it for reference.

So, let's compare some pairs of analogous declarations in both languages, and what we get for each.

Comparison 1: Declaring a fixed-length, nested array as a state variable.

x: uint256[2][2]

vs

contract C {
  uint256[2][2] x;
}

Here's the relevant part of the Vyper AST output, using the ast_output2 branch. (To keep things shorter, I'm going to remove keys that seem irrelevant.)

      {
        "_metadata": {
          "type": "ArrayDefinition"
        },
        "annotation": {
          "_metadata": {},
          "ast_type": "Subscript",
          "node_id": 4,
          "slice": {
            "_metadata": {},
            "ast_type": "Index",
            "node_id": 11,
            "src": "14:1:0",
            "value": {
              "_metadata": {},
              "ast_type": "Int",
              "node_id": 12,
              "src": "14:1:0",
              "value": 2
            }
          },
          "src": "3:13:0",
          "value": {
            "_metadata": {},
            "ast_type": "Subscript",
            "node_id": 5,
            "slice": {
              "_metadata": {},
              "ast_type": "Index",
              "node_id": 8,
              "src": "11:1:0",
              "value": {
                "_metadata": {},
                "ast_type": "Int",
                "node_id": 9,
                "src": "11:1:0",
                "value": 2
              }
            },
            "src": "3:10:0",
            "value": {
              "_metadata": {},
              "ast_type": "Name",
              "id": "uint256",
              "node_id": 6,
              "src": "3:7:0"
            }
          }
        },
        "ast_type": "AnnAssign",
        "node_id": 1,
        "simple": 1,
        "src": "0:16:0",
        "target": {
          "_metadata": {
            "type": "ArrayDefinition"
          },
          "ast_type": "Name",
          "id": "x",
          "node_id": 2,
          "src": "0:1:0"
        },
        "value": null
      }

Here's the relevant part of the Solidity AST output: (Again, to keep things shorter, I'm going to remove keys that seem irrelevant.)

          {
            "constant": false,
            "id": 6,
            "mutability": "mutable",
            "name": "x",
            "nodeType": "VariableDeclaration",
            "scope": 7,
            "src": "15:15:0",
            "typeDescriptions": {
              "typeIdentifier": "t_array$_t_array$_t_uint256_$2_storage_$2_storage",
              "typeString": "uint256[2][2]"
            },
            "typeName": {
              "baseType": {
                "baseType": {
                  "id": 1,
                  "name": "uint256",
                  "nodeType": "ElementaryTypeName",
                  "src": "15:7:0",
                  "typeDescriptions": {
                    "typeIdentifier": "t_uint256",
                    "typeString": "uint256"
                  }
                },
                "id": 3,
                "length": {
                  "hexValue": "32",
                  "id": 2,
                  "kind": "number",
                  "nodeType": "Literal",
                  "src": "23:1:0",
                  "typeDescriptions": {
                    "typeIdentifier": "t_rational_2_by_1",
                    "typeString": "int_const 2"
                  },
                  "value": "2"
                },
                "nodeType": "ArrayTypeName",
                "src": "15:10:0",
                "typeDescriptions": {
                  "typeIdentifier": "t_array$_t_uint256_$2_storage_ptr",
                  "typeString": "uint256[2]"
                }
              },
              "id": 5,
              "length": {
                "hexValue": "32",
                "id": 4,
                "kind": "number",
                "nodeType": "Literal",
                "src": "26:1:0",
                "typeDescriptions": {
                  "typeIdentifier": "t_rational_2_by_1",
                  "typeString": "int_const 2"
                },
                "value": "2"
              },
              "nodeType": "ArrayTypeName",
              "src": "15:13:0",
              "typeDescriptions": {
                "typeIdentifier": "t_array$_t_array$_t_uint256_$2_storage_$2_storage_ptr",
                "typeString": "uint256[2][2]"
              }
            },
            "visibility": "internal"
          }

So, let's compare. I think the Solidity tree does things several things better (although it's still not ideal; I'll get to that).

Firstly, it puts type information on everything with a type. The Vyper example has nonempty _metadata only on the AnnAssign (i.e., variable declaration) node itself; the other nodes have empty _metadata. The Solidity example has typeDescriptors not only on the VariableDeclaration node itself, but also on both ArrayTypeName nodes and the ElementaryTypeName node. (And the Literal nodes, though obviously we don't need that here.) Is this redundant? Yes! But redundancy is helpful.

I'm not going to show a contrasting example to illustrate this, but Solidity really does put typeDescriptors on everything with a type -- not just variable declarations and the types in those declarations, but every expression, too. The ast_output2 branch of Vyper does a lot of this too... but not consistently. For example, as mentioned earlier, function parameters (and outputs) seem to have empty _metadata.

Secondly, the strings used to describe the type truly do describe the complete type. Not the typeStrings, mind you; in this case those are good enough, but in other cases they're not. However, the typeIdentifier strings really are one-to-one with the types they describe (modulo some business about some of them ending in ptr or not but that's a small enough wart to be handleable; I expect there's a good reason for the difference internally but we ignore it). Their format is a bit hairy, but they're made to be machine-readable, and (as long as you don't have to deal with function types, those ones are nasty) you barely even have to write a proper parser for them.

By contrast, in the Vyper example, the only real type string we get is ArrayDefinition. This tells us that the variable is an array... but not how long, nor of what. The type information is incomplete.

Now in this case, where the variable is being declared, we don't necessarily need a whole type string, because we have the AST for the declaration (although that's not great either because it's made up of Subscript and Index nodes, which are not really appropriate here; contrast with the Solidity example which is made up of ArrayTypeName nodes). However, when the variable is being used, or when we're looking at a more general expression rather than a variable, we won't have that. (In Solidity, a use of a variable would have a referencedDeclaration key with the node ID of where it was declared; so you could look up the declaration node and extract the type info from that. However obviously such a mechanism can't be applied to more general expressions.)

This brings me to the one point where I think the Solidity example has a noticeable flaw -- having these string type identifiers is nice, yes, but they're still strings that need to be parsed. So what would be really nice would be if the type information was simply in the form of a tree itself so we didn't need to do any parsing! :) Although, having it in string form is also really convenient, for other reasons... we exploit the string form as part of how we handle mapping keys... so why not both? :) Sure, technically it's redundant -- we could stringify the tree in our own way, but that risks getting things wrong, or including irrelevant keys that make is so that strings aren't truly one-to-one with types. :) Maybe just have both, redundancy is good. :)

By the way, let me point out one more thing the Solidity example does. Notice that the VariableDeclaration node has a scope property, which gives the node ID for that portion of the AST for which the variable is in-scope. This is really helpful information to have! It is something I'd like to see included as part of the more general allowing-nodes-to-refer-to-other-nodes concept. (Which, again, likely requires node IDs to be unique per-compilation rather than per-source.)

This is getting pretty long, so I'm just going to do one more example.

Case 2: Declaring public and constant state variables

p: public(uint256)
c: constant(uint256) = 3

vs

contract C {
  uint266 public p;
  uint256 constant c = 3;
}

Here's the relevant part of the Vyper AST output; I'm switching back to the released 0.3.0 for this, since ast_output2 transforms the AST too much and in particular skips constants. (Again, to keep things shorter, I'm going to remove keys that seem irrelevant.)

    "body": [
      {
        "annotation": {
          "args": [
            {
              "ast_type": "Name",
              "id": "uint256",
              "node_id": 7,
              "src": "10:7:0"
            }
          ],
          "ast_type": "Call",
          "func": {
            "ast_type": "Name",
            "id": "public",
            "node_id": 5,
            "src": "3:6:0"
          },
          "keyword": null,
          "keywords": [],
          "node_id": 4,
          "src": "3:15:0"
        },
        "ast_type": "AnnAssign",
        "node_id": 1,
        "simple": 1,
        "src": "0:18:0",
        "target": {
          "ast_type": "Name",
          "id": "p",
          "node_id": 2,
          "src": "0:1:0"
        },
        "value": null
      },
      {
        "annotation": {
          "args": [
            {
              "ast_type": "Name",
              "id": "uint256",
              "node_id": 15,
              "src": "31:7:0"
            }
          ],
          "ast_type": "Call",
          "func": {
            "ast_type": "Name",
            "id": "constant",
            "node_id": 13,
            "src": "22:8:0"
          },
          "keyword": null,
          "keywords": [],
          "node_id": 12,
          "src": "22:17:0"
        },
        "ast_type": "AnnAssign",
        "node_id": 9,
        "simple": 1,
        "src": "19:24:0",
        "target": {
          "ast_type": "Name",
          "id": "c",
          "node_id": 10,
          "src": "19:1:0"
        },
        "value": {
          "ast_type": "Int",
          "node_id": 17,
          "src": "42:1:0",
          "value": 3
        }
      }
    ],

Here's the relevant part of the Solidity AST output: (Again, to keep things shorter, I'm going to remove keys that seem irrelevant.)

        "nodes": [
          {
            "constant": false,
            "functionSelector": "9ae8886a",
            "id": 2,
            "mutability": "mutable",
            "name": "p",
            "nodeType": "VariableDeclaration",
            "scope": 6,
            "src": "15:16:0",
            "typeDescriptions": {
              "typeIdentifier": "t_uint256",
              "typeString": "uint256"
            },
            "typeName": {
              "id": 1,
              "name": "uint256",
              "nodeType": "ElementaryTypeName",
              "src": "15:7:0",
              "typeDescriptions": {
                "typeIdentifier": "t_uint256",
                "typeString": "uint256"
              }
            },
            "visibility": "public"
          },
          {
            "constant": true,
            "id": 5,
            "mutability": "constant",
            "name": "c",
            "nodeType": "VariableDeclaration",
            "scope": 6,
            "src": "35:22:0",
            "typeDescriptions": {
              "typeIdentifier": "t_uint256",
              "typeString": "uint256"
            },
            "typeName": {
              "id": 3,
              "name": "uint256",
              "nodeType": "ElementaryTypeName",
              "src": "35:7:0",
              "typeDescriptions": {
                "typeIdentifier": "t_uint256",
                "typeString": "uint256"
              }
            },
            "value": {
              "hexValue": "33",
              "id": 4,
              "kind": "number",
              "nodeType": "Literal",
              "src": "56:1:0",
              "typeDescriptions": {
                "typeIdentifier": "t_rational_3_by_1",
                "typeString": "int_const 3"
              },
              "value": "3"
            },
            "visibility": "internal"
          }
        ],

So, what I want to point out here is how consistent the Solidity AST is, and how semantically meaningful the nodes are. (I mean, in this case; it's not always so nice as this, but it's generally pretty decent about this.) For each VariableDeclaration node, the AST node of the type is always under typeName. The public and constant modifiers are reflected in fields in the VariableDeclaration nodes; public causes visibility to be set to "public" rather than "internal", and constant causes mutability to be set to "constant" rather than "mutable". (There's also the old constant flag that's still present, but, uh, I want to stick to the current main way it outputs information.)

By contrast, if we look at the Vyper example, we see that for an AnnAssign, the annotation property does not always contain the AST for the variable's type. In our first example it did, but here it instead contains a Call AST node (another meaningless node type, that reflects only the syntax!), and that tells us that the variable is public or constant -- there's no fields for this on the variable declarations itself -- and then the AST node for the actual type is under the args[0] of that call.

So, I'm hoping those examples make concrete enough what I'm talking about? Again, I can't design a format for you, because that would require both making arbitrary choices that it's not my place to make, and probably also a good knowledge of Vyper internals that I certainly don't have. But I hope these examples demonstrate concretely where the Vyper AST falls down, how Solidity handles these situations better, and where I think the Solidity AST could itself be improved.

I think these are the highlights in terms of features, although it's possible there's something I've still forgotten. I realize not all of this can be adopted into the current AST format. But I'm hoping they can all make it into a new one that is redesigned around better principles -- one that works to be more semantically meaningful rather than matching the syntax one-for-one. I hope this is helpful.

charles-cooper commented 3 years ago

Great, we'll look these over. Thanks!

haltman-at commented 3 years ago

OK, I'm glad that's finally concrete enough! I'll probably come back later and add an example with structs too because I think I could probably also stand to illustrate better what I've said about those.

haltman-at commented 3 years ago

OK, so, the structs example I was talking about.

We'll look at the following Vyper code:

struct S:
    x: uint256
s: S

and the following Solidity code:

contract C {
  struct S {
    uint256 x;
  }
  S s;
}

So, again, let's look at the relevant parts of the ASTs.

First Vyper, with irrelevant keys removed as usual. I switched back to the ast_output2 branch for this to get additional type information.

    "body": [
      {
        "_metadata": {},
        "ast_type": "StructDef",
        "body": [
          {
            "_metadata": {},
            "annotation": {
              "_metadata": {},
              "ast_type": "Name",
              "id": "uint256",
              "node_id": 5,
              "src": "17:7:0"
            },
            "ast_type": "AnnAssign",
            "node_id": 2,
            "simple": 1,
            "src": "14:10:0",
            "target": {
              "_metadata": {},
              "ast_type": "Name",
              "id": "x",
              "node_id": 3,
              "src": "14:1:0"
            },
            "value": null
          }
        ],
        "name": "S",
        "node_id": 1,
        "src": "0:24:0"
      },
      {
        "_metadata": {
          "type": "StructDefinition"
        },
        "annotation": {
          "_metadata": {},
          "ast_type": "Name",
          "id": "S",
          "node_id": 10,
          "src": "28:1:0"
        },
        "ast_type": "AnnAssign",
        "node_id": 7,
        "simple": 1,
        "src": "25:4:0",
        "target": {
          "_metadata": {
            "type": "StructDefinition"
          },
          "ast_type": "Name",
          "id": "s",
          "node_id": 8,
          "src": "25:1:0"
        },
        "value": null
      }
    ],

Versus the Solidity, again with the usual caveats:

        "nodes": [
          {
            "canonicalName": "C.S",
            "id": 3,
            "members": [
              {
                "constant": false,
                "id": 2,
                "mutability": "mutable",
                "name": "x",
                "nodeType": "VariableDeclaration",
                "scope": 3,
                "src": "30:9:0",
                "typeDescriptions": {
                  "typeIdentifier": "t_uint256",
                  "typeString": "uint256"
                },
                "typeName": {
                  "id": 1,
                  "name": "uint256",
                  "nodeType": "ElementaryTypeName",
                  "src": "30:7:0",
                  "typeDescriptions": {
                    "typeIdentifier": "t_uint256",
                    "typeString": "uint256"
                  }
                },
                "visibility": "internal"
              }
            ],
            "name": "S",
            "nodeType": "StructDefinition",
            "scope": 7,
            "src": "15:29:0",
            "visibility": "public"
          },
          {
            "constant": false,
            "id": 6,
            "mutability": "mutable",
            "name": "s",
            "nodeType": "VariableDeclaration",
            "scope": 7,
            "src": "47:3:0",
            "typeDescriptions": {
              "typeIdentifier": "t_struct$_S_$3_storage",
              "typeString": "struct C.S"
            },
            "typeName": {
              "id": 5,
              "nodeType": "UserDefinedTypeName",
              "pathNode": {
                "id": 4,
                "name": "S",
                "nodeType": "IdentifierPath",
                "referencedDeclaration": 3,
                "src": "47:1:0"
              },
              "referencedDeclaration": 3,
              "src": "47:1:0",
              "typeDescriptions": {
                "typeIdentifier": "t_struct$_S_$3_storage_ptr",
                "typeString": "struct C.S"
              }
            },
            "visibility": "internal"
          }
        ],

Now, some of the features here I've already commented on -- the presence of type information on everything; the constant use of referencedDeclaration fields to refer back to other nodes where things were declared; etc. (Although actually, I believe I previously commented on that, but hand't illustrated it. So here's an actual example -- when the struct S is referred to, there's a reference back to where it was declared!)

But what I want to focus on here is the typeIdentifier for the variable s, of type S. In my earlier comment, I used as an example the type identifier for the type uint256[2][2], which was "t_array$_t_array$_t_uint256_$2_storage_$2_storage". I didn't go on about it in detail, but you can see its recursive structure -- how it indicates that it's an array, of length 2, in storage; and that what it's an array of is an array, of length 2, in storage; and what that is an array of is uint256.

(Arguably, the way that Solidity includes locations (memory vs storage vs calldata) in type identifiers for reference types is also something of a wart; we have to do some massaging of that in certain cases. But then again, in other cases it's absolutely necessary, so I guess it's actually a good thing on the whole. But then again, if they had done all this as a tree instead of a string, then they could have left it out of the string, perhaps...?)

But what about for the variable s? Its type identifier is "t_struct$_S_$3_storage". Notice that number 3 in there... what is that?

It's the struct's ID! Type identifiers are one-to-one with types (again, modulo the location and pointer warts I've mentioned), but there could be multiple structs named S; there could even be multiple structs named C.S because there could be multiple contracts named C. So, each user-defined type (structs, enums, contracts, whatever) are all given an ID, which is unambiguous, and goes in the type identifier. So it's possible to disambiguate between different structs (or enums or whatever) with the same name.

The way it gives them an ID is, you may notice, to simply use the ID of the delcaring node. :) But the important part is that it gets some ID, that is unique and unambiguous, and that lets us find the declaring node somehow.

OK, I think that should be enough examples now, unless you really think more are necessary!

sambacha commented 3 years ago

just input code here https://sambacha.github.io/solidity-parser-explorer/ and click on portions of the generated AST it will illustrate what @haltman-at is describing.

so like xPath for source code AST is what you want right @haltman-at ?

haltman-at commented 3 years ago

@sambacha, I don't understand your comment. What I'm asking for has nothing to do with syntax used to access the AST, it has to do with the format of the actual AST itself.

I don't think that I would recommend that Solidity parser explorer that you link for illustrating what I'm talking about; it seems to lack a lot of the information that's actually in the full Solidity AST. (It also oddly seems to include other things that Solidity doesn't put in its ASTs, such as comments...)

charles-cooper commented 3 years ago

@haltman-at would it help if we exported something like a "memory layout" which maps variables to their locations in memory?

Also, I think it might be more efficient to chat about this offline, what is the best way to do that? Are you able to join our discord?

fubuloubu commented 3 years ago

@haltman-at would it help if we exported something like a "memory layout" which maps variables to their locations in memory?

Keep in mind that unlike Solidity, all of our user-facing local variables are in memory and not on the stack

sambacha commented 3 years ago

@haltman-at would it help if we exported something like a "memory layout" which maps variables to their locations in memory?

Keep in mind that unlike Solidity, all of our user-facing local variables are in memory and not on the stack

Yes it would

haltman-at commented 3 years ago

@charles-cooper Well, that depends on what you mean by "would it help".

If you mean, "would the extra information potentially be useful", sure! More information is always useful.

If you mean, "would it resolve the essential problems", no. I'm pretty sure we can figure out where in memory variables are stored without needing it put explicitly in the AST. Obviously it's different from in Solidity, what with them going in memory rather than the stack, but I've looked into this and I think only a slight tweak on our Solidity-handling mechanism is necessary. The key remaining problem is the type information, not the location information. If I thought we truly needed memory location information in the AST, I would have made a point of mentioning it. :)

That said, as mentioned, more information is always useful, and would certainly be appreciated! Every piece of information we can get from the compiler, and which will presumably be maintained in future compiler versions, is one less piece of information we have to figure out ourselves and potentially maintain if they change in future compiler versions.

(Although some variables, from what I've seen, live in calldata, right, rather than memory? And then of course there's state variables in storage, and (sort of) constants...)

haltman-at commented 3 years ago

Also, I think it might be more efficient to chat about this offline, what is the best way to do that? Are you able to join our discord?

Well, you can always email me at harry@trufflesuite.com, but I could potentially also join your Discord if you give me the information?

fubuloubu commented 3 years ago

Also, I think it might be more efficient to chat about this offline, what is the best way to do that? Are you able to join our discord?

Well, you can always email me at harry@trufflesuite.com, but I could potentially also join your Discord if you give me the information?

Please join the Ethereum Python Devs discord (which has a #vyper channel) via this link: https://discord.gg/VXjZUnSj

charles-cooper commented 8 months ago

i think this issue will be resolved by https://github.com/vyperlang/vyper/pull/3829