kamadorueda / nixel

Parser for the Nix Expressions Language
Other
84 stars 3 forks source link

Comparison with rnix-parser #1

Closed thufschmitt closed 2 years ago

thufschmitt commented 2 years ago

There’s already a Nix parser written in Rust at https://github.com/nix-community/rnix-parser (but from what I could see based on a very different parsing method and producing a somewhat different result), how do the two of them compare? Do you have any benchmark or qualitative comparison?

kamadorueda commented 2 years ago

rnix-parser is faster (at least for now), but much less correct, so that's the trade-off

rnix-parser is based on a less generic algorithm: recursive descent, while Nixel is based on a modified Earley. This is an important difference with big implications becase Nix requires GLR, which is in the same category as Earley (they can parse ambiguous languages like Nix), which are 1 level above the category of deterministic parsers like recursive descent, LALR, LR, PEG, etc. If you use an algorithm unable to handle ambiguous grammars you'll always encounter code that is incorrectly parsed, as you can find in rnix-parser issues. On the other hand, NixEL performs the same lexical analysis as the original Nix, and uses similar parse tables, and that allows to handle correctly (as in exactly equal to Nix) logic like unescaping strings, dedenting multiline strings, the funny stuff around the or variable name, path interpolations and disambiguation and precedence, which rnix-parser does not do as of today, not to mention the complex logic behind the lexer states, which is very hard (if not very tedious) to get correctly on a hand-written parser like rnix-parser. On NixEl everything is declarative, much easier to maintain

So in short, correctness, that's the comparison, and probably types, since rnix-parser only has nodes and tokens, and here we have a richly typed data-structure, much easier to work with

oberblastmeister commented 2 years ago

One important difference is that rnix-parser is error resistant, you can give the parser gibberish and it will always return a tree. This is very important for things like IDEs and formatters. rnix-parser will soon have a typed api https://github.com/nix-community/rnix-parser/pull/105.

If you use an algorithm unable to handle ambiguous grammars you'll always encounter code that is incorrectly parsed, as you can find in rnix-parser issues.

Those issues are just normal issues with the parser, not because the algorithm is unable to handle an ambiguous grammar. Of course using a parser generator makes the parser easier to maintain, but in return we get more control. I would be interested to see an example that cannot be parsed due to being ambiguous. As for speed, recursive descent is $O(n)$ while GLR and Earley are both $O(n^3)$. GLR has the advantage of being $O(n)$ for most inputs.

kamadorueda commented 2 years ago

Since version 5.0 NixEL is the closest parser to the original Nix implementation, essentially because now NixEL is a copy-paste of the original lexer and parser definitions, and uses the same lexer and parser generator (Flex and GNU Bison).

Since version 5.0 NixEL also uses GLR instead of Earley, it parses the entirety of Nixpkgs in under 25 seconds (single-threaded) while keeping memory safety, a typed API, offering single-line comments, multi-line comments, whitespace, string unescaping, multi-line string indentation handling, starting and end positions of everything, and excellent documentation.

All of that is without the bugs inherent to using a hand-crafted backend as the one rnix-parser uses

Those issues are just normal issues with the parser, not because the algorithm is unable to handle an ambiguous grammar. Of course using a parser generator makes the parser easier to maintain, but in return we get more control.

At NixEL I don't consider parsing issues "normal" because the whole point of a parser for Nix is parsing Nix correctly, ideally exactly equal to the reference implementation. I don't see the point in having "more control" if it cannot be used for that purpose.

That being said, the feature of parsing partially valid Nix files is perfectly possible with the current architecture of Nixel, and first-class citizen of GNU Bison: https://github.com/kamadorueda/nixel/issues/3. So I foresee a future where NixEL supports that as well.

I think some interesting future work would be to upstream back the knowledge I got from working on NixEL to Nix, since in the original parser implementation of Nix they mention some memory leaks, which I was able to find, confirm and fix while working on NixEL. This has the potential of being very beneficial, especially when evaluating big projects like Nixpkgs, which requires a considerable amount of memory right now, so let me know if that's something the Nix Core Developers would be interested in merging.

Lastly, I'm closing this issue. I'm not interested in creating a discussion between library authors here, but I understand that putting excellent software out there can create some tension between the alternatives, as happened when I published Alejandra. Let's not take it personally. We all are members of the same community, and the community benefits from the software we publish. That's how FOSS works.