tree-sitter / tree-sitter-ruby

Ruby grammar for tree-sitter
MIT License
176 stars 58 forks source link

Proposal: Adopt YARP #240

Closed chtzvt closed 1 year ago

chtzvt commented 1 year ago

Background / Motivation

Some exciting developments have occurred in the Ruby language infrastructure & tooling space. In particular, folks at Shopify (especially @kddnewton) have introduced an all-new portable, error-tolerant parser called YARP:

YARP is a parser for the Ruby programming language. It is designed to be portable, error tolerant, and maintainable. It is written in C99 and has no dependencies. It is currently being integrated into CRuby, JRuby, TruffleRuby, Sorbet, and Syntax Tree.

The YARP project is intended to unify Ruby tooling around a standard portable, compatible parser. The YARP and CRuby teams have collaborated closely on the parser's design, and the CRuby team has given their approval to merge YARP as a replacement for MRI's existing parser implementation:

Once Matz and the CRuby team were happy with the design, agreed on the approach, and determined that they would merge YARP in when it was ready, the work began in earnest.

Benefits of YARP

Quotes below are taken from the YARP announcement.

Compatibility

As of the date of this post, YARP can parse a semantically equivalent syntax tree to Ruby 3.3 on every Ruby file in Shopify’s main codebase, GitHub’s main codebase, CRuby, and the 100 most popular gems downloaded from rubygems.org. We recently got approval to merge this work into CRuby, and are very excited to share our work with the community.

Maintainability

Much of YARP's code is generated from a single YAML configuration file describing Ruby tokens and nodes.

Portability

Currently, the [existing CRuby] parser is tightly tied to CRuby internals, requiring data structures and functions only available in the CRuby codebase. This makes it impossible to use in other tooling.

Accordingly, the community fractured and developed multiple solutions, each with their own issues. Over the years there have been many other parsers written, almost all by taking the grammar file and generating a new kind of parser. In our research, we found parsers written in 9 different languages. Some of these made their way into academic papers, otherwise into production systems. As of writing, we know of 12 that are being actively maintained (6 runtimes, 6 tools) [...]

Each of these parsers besides the reference implementation have their own issues. This means that each of the tools built on these parsers therefore inherit those same issues. The fracture therefore spreads into tooling. For example, some tools are based on Ripper, including Syntax Tree, rubyfmt, rufo, syntax_suggest, and ruby-lsp. Even more are based on the parser gem, including rubocop, standard, unparser, ruby-next, solargraph, and steep. Even more are based on the ruby_parser gem, such as debride, flay, flog, and fasterer.

Every time new syntax is introduced into Ruby, all of the parsers have to update. This means opportunities to introduce bugs, which all get flushed down to their corresponding tools. As an example, Ruby 2.7 was released 4 years ago, and it came along with pattern matching syntax. Of the 10 non-CRuby parsers, only 5 of them support all of pattern matching to this day, and only 2 of them without any caveats.

Standardized AST

The tree redesign has ended up being one of the most important parts of the project. It has delivered something that Ruby has never had before: a standardized syntax tree. With a standard in place, the community can start to build a collective knowledge and language around how we discuss Ruby structure, and we can start to build tooling that can be used across all Ruby implementations. Going forward this can mean more cross-collaboration between tools (like Rubocop and Syntax Tree), maintainers, and contributors.

Integration / Performance

What we can share so far is that YARP is able to parse around 50,000 of Shopify’s Ruby files in about 4.49 seconds, with a peak memory footprint of 10.94 Mb.

We also worked with other tools to validate that our tree contained enough metadata for static analysis and compilation. Syntax Tree is a syntax tree tool suite that can also be used as a formatter, and it has an experimental branch running with YARP as its parser instead of Ripper. Early results show that by replacing Ripper with YARP, in some cases performance increased by nearly two fold. We also built a VSCode plugin that you can find inside the repository to ensure that our error locations and messages were correct, and work continues on that today.

Recently, we began experimenting with generating the same syntax tree as the parser and ruby_parser gems in order to seemlessly allow consumers of these libraries to benefit from the new parser. Early results are very promising and show both a reduction in memory and an increase in speed.

Error Tolerance

YARP includes a number of error tolerance features out of the box, and we are planning on adding many more in the months/years to come.

Whenever source code is being edited, it almost always contains syntax errors until the developers gets to the end of the expression. As such, it’s common for the underlying syntax tree to be missing tokens and nodes that it would otherwise have in a valid program. The first error tolerance feature that we built, therefore, is the ability to insert missing tokens. For example, if the parser encounters a missing end keyword where one was expected, it will automatically insert the missing token and continue parsing the program.

YARP can also insert missing nodes in the syntax tree. For example, if the parser encounters an expression like 1 + without a right-hand side, it will insert a missing node for the right-hand side and continue parsing the program.

Additionally, when YARP encounters a token in a context that it simply cannot understand, it skips past that token and attempts to continue parsing. This is useful when something gets copy-pasted and there is extra surrounding content that accidentally sneaks in.

Finally, YARP includes a technique we’re calling context-based recovery, which allows it to recover from syntax errors by analyzing the context in which the error occurred. This is similar to a method employed by Microsoft when they wrote their own PHP parser. For example, if the parser encounters:

foo.bar(baz, qux1 + qux2 + qux3 +)

it will insert a missing node into the + call on qux3, then bubble all of the way up to parsing the arguments because it knows that the ) character closes the argument list. At this point it will continue parsing as if there were nothing wrong with the arguments.

Rust Bindings

At the moment, a draft PR is in progress to add Rust bindings to YARP.

aryx commented 1 year ago

I don't think this align well with tree-sitter; A few tools (e.g., neovim, emacs, semgrep) rely on a precise tree-sitter API and Concrete syntax trees to work, and the beauty of tree-sitter is that it works consistently for many languages, not just Ruby, so I don't think it makes sense to adopt YARP. Just my 2c.