vyperlang / vyper

Pythonic Smart Contract Language for the EVM
https://vyperlang.org
Other
4.85k stars 791 forks source link

The road to a custom parser #1363

Open jacqueswww opened 5 years ago

jacqueswww commented 5 years ago

The road to a custom parser

As we have discussed this at lengths as parts of weekly calls, this is the proposed road map to swap out the current python standard library. Also lots of discussion on github (see https://github.com/ethereum/vyper/issues/563).

jacqueswww commented 5 years ago

I plan on starting with Step1 some time next week ;) let me know if I should add anythin.

pipermerriam commented 5 years ago

Some naming suggestions to closer conform to what I think are standard-ish names (feel free to ignore any of these an use your best judgement).

And maybe missing

Forgive any lack of context on my part for these suggestions. Very much like the direction of this.

jacqueswww commented 5 years ago

@pipermerriam definitely keen to renaming the classes to be more sensible (and less abbreviated) ! Will add link to this on Step 1 (or maybe will make it part of Step 2, to make the PR smaller - will see).

Side Note: Else is quite interesting, it's all part of If currently using ast.If.orelse .

davesque commented 5 years ago

@pipermerriam Yeah, I would expect else to be a property on an If ast object; rather like body is a list of statements on a function definition object.

fubuloubu commented 5 years ago

I think it is noted, but exposing an API to get the AST for a given source program would be quite handy for Step 1.

For Step 3, my wishlist might include being able to "extend" the Vyper module by inserting tools "in between" stages of the compiler, such as advanced static analysis on the AST that may be able to reduce the AST nodes through some custom rewrite rules, then inserting the optimized AST back into the next stage of the compiler (code translation to IR). Similar idea for the IR as well (for example, symbolic analysis-based optimizations that exhaust all possible paths and remove assertion checks as impossible)

davesque commented 5 years ago

@pipermerriam Also, not sure if you noticed but I think @jacqueswww chose those names largely based on the naming scheme in the python standard ast library.

pipermerriam commented 5 years ago

:+1: on ignoring my name suggestions, or only using the ones you like. I don't have strong feelings on any of them.

davesque commented 5 years ago

I've begun some experimental work on a parser using lark. I've got it hosted here for the time being. It might also be wise to create a CST parser using TatSu since that seems like a very good potential alternative to lark. However, I'm using lark for now since it was easier to get up and going with it.

I've currently got a working parser that parses code files into a CST. I've begun adding some custom AST classes so that I have something to convert the CST into. After that's done, some routines will need to be written to convert the CST into an AST represented with AST class instances. This file in the cpython implementation should act as a good guide for that: https://github.com/python/cpython/blob/master/Python/ast.c .

davesque commented 5 years ago

Also, a note about my repo that I linked to. It's really just meant to act as a place to store code that I'm hacking on. So there are no tests and no real structure to the repo. So don't freak out :).

davesque commented 5 years ago

Custom Vyper AST classes in the vyper_parser repo are done (see here). Currently, they attempt to mirror the naming scheme and structure of the ASDL-derived Python AST classes as precisely as possible.

As a quick aside, Python's ASDL definitions are found here. CPython uses a specific grammar to define its collection of AST classes. The C implementation parses the ASDL grammar and uses it to create all the AST classes using the C API.

Next, I'll work on defining some routines to convert the lark parse tree into an AST composed of our custom AST classes. Currently, I'm targeting the entire Python AST. I think this will be useful in verifying the correctness of the implementation. Our test routines can just compare the results of Python's native ast.parse function to our custom parser for all python files in the standard library. After this is done, we can cut back the definition to just what we need.

davesque commented 5 years ago

I went ahead and migrated my code into a proper project (found here). I don't necessarily intend to have this work exist in a separate package, but I wanted to take advantage of our project template test rigging. I added a test that automatically generates fixtures from python files in the standard library.

davesque commented 5 years ago

Here's a fun development from today. For the string literal parsing portion of our custom parser, I'm going to just defer to python's parsing facilities. Parsing string literals is pretty complicated and I think we can safely just use python's parser and convert the result to a tree of our custom AST classes. In order to do this, I needed to write some code to convert python AST trees into trees of our Vyper-specific AST classes. Here it is: https://github.com/davesque/vyper-parser/blob/master/vyper_parser/ast.py#L59

It passes tests for all python files in my standard lib directory: https://github.com/davesque/vyper-parser/blob/master/tests/test_ast.py#L21

This is something we could potentially begin using today if we're interested in converting Vyper to using custom AST classes. Although I'd probably recommend holding off a little longer in case I want to change some things before committing to the API that I've come up with.

davesque commented 5 years ago

The VyperAST.from_python_ast method I just mentioned has one potential issue at this point. It automatically infers the proper VyperAST subclass to use for an instance of python's standard ast.AST subclasses. It does this by automatically collecting all subclasses of VyperAST. This means that someone could potentially create one of their own subclasses of VyperAST and get weird behavior when it happens to shadow an existing class name. This can be fixed by just explicitly creating a mapping from python AST classes to vyper AST classes. Just wanted to make a note of that here.

jacqueswww commented 5 years ago

@davesque very cool! It's coming together nicely, I like the layout of the VyperASTs - I suggest we adopt those, but we start with empty the _fields on each classes init. Then we only handle the fields we need :)

rocky commented 5 years ago

Added a comment to https://gist.github.com/jacqueswww/aa24d22d95a578429e0f25f5f2de0b36 which I won't duplicate here. This gist is to follow solc's AST where it makes sense (e.g. in how to specify a location in a general, flexible and compact way), that maybe some thought should be given to how this is specified when writing to JSON so as to be compatible again with solc's JSON AST.

jacqueswww commented 5 years ago

@rocky this id field is unique per node? Or can nodes share ID's, if not I am not sure how would one would make the deterministic - unless there is a global auto increment counter on each file compiled (global context in our code base).

davesque commented 5 years ago

@jacqueswww I think all of that is clarified in the link that @rocky posted in his gist comment: https://solidity.readthedocs.io/en/develop/miscellaneous.html#source-mappings

fubuloubu commented 5 years ago

Really cool extra for hypothesis that can generate strategies for a lark grammar: https://hypothesis.readthedocs.io/en/latest/extras.html#hypothesis-lark

fubuloubu commented 4 years ago

Thanks to @iamdefinitelyahuman for helping us achieve Step 2 here!

jacqueswww commented 4 years ago

Thanks to @iamdefinitelyahuman for helping us achieve Step 2 here!

excited

0xalpharush commented 1 year ago

I think this is a worthwhile investment. Many of the issues I've opened recently have to do with trying to use the Python AST while supporting syntax in Vyper that is semantically different than the node produced by the Python parser. I think it would simplify the compiler by making AST nodes single purpose, removing the need to interoperate with changes to the Python AST. It would also mitigate issues like https://github.com/vyperlang/vyper/issues/3475 since a custom parser would require types and not backfill the AST of a dynamically type language with type annotations. I would imagine there's also the added benefit of no longer having to worry about anything related to Python when discussing features and syntax.

fubuloubu commented 1 year ago

Definitely something we want to do, we have a Lark grammar that's been tested and a part of the core now for over a year. We haven't had issues with it for a while.

The biggest problem with custom grammar is that we would get a ton of parsing issues grieving the users, the python ast solves that by being extremely well used and debugged. Hopefully could work on switching over to it as a part of a major revision

What's still missing is that even though the parser parses reliably, we have not yet done the work of having it produces Vyper AST nodes or checked for equivalency with the current parser

That would be next steps before we can swap it in (Taking any contributors)

charles-cooper commented 1 year ago

I would imagine there's also the added benefit of no longer having to worry about anything related to Python when discussing features and syntax.

This is a double edged sword, really. Actually the main reason I haven't gotten rid of the existing parser flow so far is not technical, it's that in some sense it "keeps us honest" -- we can't really break away from python syntax too easily.