Canonical Rust grammar distinct from parser (tracking issue for RFC #1331)

nikomatsakis commented 8 years ago

This is the tracking issue for rust-lang/rfcs#1331, which specifies a procedure for creating a canonical grammar apart from the compiler. This is a multi-phase process. I think the first step, honestly, is just to lay out a firm plan of how to proceed -- what kind of automatic testing to use and so forth. The RFC provides a general plan but it needs to be made more concrete. Once this issue has an owner, I (or they) can update this summary and try to keep it up-to-date.

steveklabnik commented 8 years ago

@nagisa , as it was your RFC, do you have opinions on the strategy we take here? Or should I be thinking about this.

EDIT: @nagisa mentioned on IRC that they are too busy, so I'll come up with aplan.

SimonSapin commented 8 years ago

I’m curious what kind of testing can check that a grammar matches the not-grammar-based parser.

steveklabnik commented 8 years ago

Well, it can show the presence of bugs, but not the absence of them.

https://github.com/rust-lang/rust/blob/master/mk/grammar.mk is what we used to do. Just vestigial at this point :)

matklad commented 8 years ago

I’m curious what kind of testing can check that a grammar matches the not-grammar-based parser.

In intellij-rust, we have regression tests for the parser, which consist of a Rust file and a serialized AST. Then we check that the parser produces the expected AST. We also have some negative "you shall not parse" tests. The same technique can be applied to the reference grammar, if we ensure that the serialized AST is sufficiently abstract and doesn't leak grammar "implementation details".

The question here is what tool should be used for the reference grammar? I think it has to goals:

Create two rust parsers, which can be verified against each other.
Make it easier to learn and reason about Rust grammar.

One option is to continue to use Bison. In my opinion, it doesn't fulfill the second goal: grammar is difficult to read because of large amount of duplication and low level details. Regular expression extensions would be really nice to have in the canonical grammar.

Another extreme option is to build a custom parser generator (LALRPOP ?) :) It would have a nice side effect of making Rust more suitable for efficiently implementing programming languages: a realm currently occupied by C/C++.

nikomatsakis commented 8 years ago

I have opinions but not much time. I've been interesting in porting the Rust grammar to LALRPOP -- I think it'd be much cleaner, as LALRPOP has a number of features that should allow us to avoid some of the abuse of precedence declarations and the like. (I found the existing grammar not very useful in evaluating syntactic changes because of those precedence declarations, for example.)

(As an aside, this would probably be a good stress test for LALRPOP, I assume we'd have to fix various bugs or add new features to scale it up that large.)

nikomatsakis commented 8 years ago

So this morning I started porting the yacc grammar in the repository to LALRPOP, just to see what would happen. I made some progress. But I was wondering: is that grammar the most up to date version, or is the one in the repo more up to date? Does anyone know what the differences are between them?

fhahn commented 8 years ago

I was thinking about working on that as well, but I unfortunately don't have too much time in the next month to work on this. What do you think about creating a repo to collaborate? (I could take care of that as well) On Jan 28, 2016 18:51, "Niko Matsakis" notifications@github.com wrote:

So this morning I started porting the yacc grammar in the repository to LALRPOP, just to see what would happen. I made some progress. But I was wondering: is that grammar the most up to date version, or is the one in the repo more up to date? Does anyone know what the differences are between them?

— Reply to this email directly or view it on GitHub https://github.com/rust-lang/rust/issues/30942#issuecomment-176303713.

nagisa commented 8 years ago

The (parser-lalr) grammar originally comes from https://github.com/bleibig/rust-grammar, which hasn’t been updated in a while.

nikomatsakis commented 8 years ago

@fhahn

I was thinking about working on that as well, but I unfortunately don't have too much time in the next month to work on this. What do you think about creating a repo to collaborate? (I could take care of that as well)

I will do this. I am right now just working through some LALRPOP issues that arose (I'm not sure if it's a bug or if LALRPOP just needs optimization; I suspect both :) but once I get that sorted out a bit I will open up a repo and post the link here. Hopefully tomorrow. It seems @TyOverby may be interested as well, and no doubt others. For example, I was talking to @jorendorff, who developed https://github.com/jorendorff/rust-grammar/, as well.

nikomatsakis commented 8 years ago

OK, I put up a definite work in progress here: https://github.com/nikomatsakis/rustypop

I also described some conventions that I am aiming towards, but have not yet achieved.

This is hard work to parallelize in the early stages, but if you're interested in hacking on it, let me know over IRC or what have you (or just open some PRs). I'll probably alternate between working on it and improving LALRPOP (for example, I am now highly motivated to get a better printout on shift-reduce errors). =)

glaebhoerl commented 8 years ago

If we can generate a parser from the official grammar, and if we can run rustc in parse-only mode, and if we can make the two of them produce output and errors in identical format, then I wonder if we could use AFL to automatically generate test cases for them. Of course even this could not prove that they match, but it might give a higher level of assurance than a relatively smaller number of ad hoc human-devised test cases.

eternaleye commented 8 years ago

@glaebhoerl: Well, there's also that the entire formalism the official grammar is based on on is called "generative grammars" - using them to create working parsers came after using them to create exemplars, which is essentially a matter of transforming the grammar into a tree and executing a depth-first search.

For example:

start := possibility start?
possibility := light | neutral | dark
light := "Solitari" | "Gandalf"
neutral := "Lunitari" | "Switzerland"
dark := "Nuitari" | "Chocolate"

We form the following tree:

start
- possibility
- light
  - "Solitari"
  - "Gandalf"
- ...
- possibility start
- ...

With that, we then walk the tree, generating each possible string covered by the grammar.

One can do this efficiently, without actually generating the tree, by assigning each optional subrule a bit, treating the collection of those bits as a number, and counting.

Enabling a subrule will sometimes reveal another optional; in that case, push the current "number" onto a queue, and when you've finished counting the "basic optionals", pop items off the queue and count the newly-revealed optionals again, adding them to the queue if they reveal more.

That ensures one will generate every rule, in their shortest exemplars, before continuing on to the next shortest, etc.

nikomatsakis commented 8 years ago

OK, so there's probably something horribly wrong with it, but the rustypop crate now builds without any shift/reduce conflicts. I haven't actually tried RUNNING the code it produces, of course, and I have all empty actions, so it will only yield a "true/false" result. Plus I need to adapt the rustc tokenizer. But it seems like progress. :)

Enough progress that it may be possible to start parallelizing the work (debugging shift/reduce conflicts is kind of a serial task...).

Of course, I also expect that as soon as we try using this grammar we'll find that I did some bone-headed things that resolved all conflicts by just not parsing anything at all, or something like that.

Anyway, I plan to write up a blog post about the approach I took, since it made use of a number of LALRPOP features to try and pare down duplication. I also took the liberty of ripping out various bits of obsolete syntax from the existing rust-parser.y, and made a few arbitrary judgement calls about dubious bits of our grammar. ;)

nikomatsakis commented 8 years ago

Oh, I should mention that it generates a 500MB .rs file. Definitely need to do some work reducing the size of LALRPOP's output (@fhahn has actually been pursuing the most immediately obvious strategy, and I've been meaning to implement various other cool optimizations that I've read about...)

nikomatsakis commented 8 years ago

Hmm, now I see some more conflicts. So maybe premature. But still, getting close I think. :)

matklad commented 8 years ago

Another tool which can be used for the canonical grammar is antlr4. I have not used it myself, but I think it should be mentioned in this thread.

nagisa commented 8 years ago

I believe we used to use antlr4 before the yacc-based grammar got merged.

matklad commented 8 years ago

Looks like only the lexer was implemented in antlr4: https://github.com/rust-lang/rust/tree/29bd9a06efd2f8c8a7b1102e2203cc0e6ae2dcba/src/grammar

jorendorff commented 8 years ago

I used antlr4 to make https://github.com/jorendorff/rust-grammar and it has a few drawbacks:

antlr4 is too lax about ambiguity: if two productions match a program, antlr simply selects whichever production appeared first in the grammar. So my grammar is almost certainly full of ambiguities I don't know about.
antlr4 doesn't support lookahead except by a gruesome hack
There are places where we really want a production to take boolean arguments, instead of having two productions, like assign_expr and assign_expr_no_struct (the production for if-statements is IF expr_no_struct block to force parentheses around extremely strange conditions like if Range { start: a, stop: b }.count() > 0 { do_something() }). antlr4's support for arguments to productions is unusable, which is why my grammar has a second complete copy of the expression grammar tagged with _no_struct.

DemiMarie commented 8 years ago

@jorendorff I suggest using a macro preprocessor or build step to reduce some of the code duplication.

alexcrichton commented 8 years ago

cc https://github.com/rust-lang/rust/issues/15880 (we'll want a bot for this)

matklad commented 8 years ago

Hey, and what about lexer / parser split?

Perhaps we should create a canonical lexical structure grammar before jumping onto the grammar for the whole language?

cc @dns2utf8

nagisa commented 8 years ago

@matklad I’d consider formalising lexer an inherent part of grammar formalisation.

That being said, unlike the grammar, which has been extended significantly over time, lexer has stayed considerably constant (i.e. is a different problem space) and the reference is still pretty good at capturing the lexical structure of the language.

matklad commented 8 years ago

I’d consider formalising lexer an inherent part of grammar formalisation.

The point is that lexer can be formalized before the rest of the grammar, so it is a good independent first step. Having an executable semi declarative specification of the lexer would help for technical reasons:

We will be able to add grammar testing infrastructure and make sure that it is executed during tests.
Having a tool which takes a rust file and produces tokens and spans will be useful for creating the grammar itself, because you wouldn't need to implement your own lexer. And it is important, because if you test your grammar against a large corpus of Rust code, you'll hit a lot of lexer bugs with a hand written lexer.

is still pretty good at capturing the lexical structure of the language.

It can be better though! There are some corner cases like 16 >> 2 vs collect::<Vec<_>>, raw string literals, self::foo::bar vs self::foo ::bar vs self ::foo::bar. Also, the reference is not executable, so it must have some bugs (like this one)

dns2utf8 commented 8 years ago

I would like to create a complete grammar first. If the src/grammar/*.g4 files covered everything we were able to run a generated lexer against the tests and spot cases where one of them accepts where the other does not and find differences.

I am currently at RustFest.eu if somebody is here too and would like to talk a little about it.

eternaleye commented 8 years ago

@matklad: There are a couple of factors that make doing a lexer separately from the grammar less meaningful.

Lexers aren't an inherent part of grammar formalisms; though most tooling does use them. There are also various advantages to scannerless parsing.
The whole lexer/parser thing works really well for languages that have keywords. It works considerably less well for languages with contextual keywords. Rust, since not too long ago, does.

Point (1) is mostly a question of usefulness, and I admit it may not be compelling. Point (2), however, is a problem.

Contextual keywords aren't (necessarily) a problem for the parser; carefully-chosen, I suspect they may be able to dodge making the grammar context-sensitive rather than context-free (though I don't have any proof of it; I just don't want to feel discouraged that Rust's grammar became a CSG).

However they do conflict with the entire concept of a lexer, which in a deterministic and entirely local manner turns text into tokens. Antlr4 handles this problem with a concept of "lexical modes" which the grammar can drive switching of, but a standalone lexer couldn't use that.

eddyb commented 8 years ago

@eternaleye The contextual keyword problem can be somewhat avoided by having the grammar use identifier tokens with specific values, instead of separate keyword tokens. That is how the implementation actually works, as keywords are merely pre-interned identifiers.

eternaleye commented 8 years ago

@eddyb: Mm, that makes sense. Sadly, it also makes the lexer less clear/useful independent of the grammar :/

eddyb commented 8 years ago

@eternaleye It's what macros interact with, so it's still useful for describing those.

arielb1 commented 8 years ago

@eternaleye

I don't see why. You can equivalently have the lexer have a special token for each keyword, and add a parser production such as

identifier = IDENTIFIER | UNION

This may require some odd finite-ambiguity-resolution hacks in the grammar, but it's not like we don't already have them.

matklad commented 8 years ago

Hey, I've created a pre pre canonical lexer rfc: https://internals.rust-lang.org/t/pre-pre-rfc-canonical-lexer-specification/4099

DemiMarie commented 8 years ago

On Sat, Sep 17, 2016 at 09:19:21AM -0700, eternaleye wrote:

@matklad: There are a couple of factors that make doing a lexer separately from the grammar less meaningful.

Lexers aren't an inherent part of grammar formalisms; though most tooling does use them. There are also various advantages to scannerless parsing.

The whole lexer/parser thing works really well for languages that have keywords. It works considerably less well for languages with contextual keywords. Rust, since not too long ago, does.

Point (1) is mostly a question of usefulness, and I admit it may not be compelling. Point (2), however, is a problem.

Contextual keywords aren't (necessarily) a problem for the parser; carefully-chosen, I suspect they may be able to dodge making the grammar context-sensitive rather than context-free (though I don't have any proof of it; I just don't want to feel discouraged that Rust's grammar became a CSG).

However they do conflict with the entire concept of a lexer, which in a deterministic and entirely local manner turns text into tokens. Antlr4 handles this problem with a concept of "lexer modes" which the grammar can drive switching of, but a standalone lexer couldn't use that.

You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rust-lang/rust/issues/30942#issuecomment-247785740

One solution is to always emit special tokens for contextual keywords, and have grammar rules that allow them to be used in identifier context in certain places.

nikomatsakis commented 8 years ago

If anyone is interested, I'd like to push more on the rustypop grammar. It handles a number of issues (e.g., the < vs << lookahead problem; contextual keywords; subsets of expressions) using pretty clean techniques (i.e., no precedence hacks or stuff that steps outside standard theory, mostly just macros and careful definition of the set of tokens). I've not had time to actually make it executable and start fixing the various bugs that i'm sure are to be found, however. If that interests you, please contact me on IRC or e-mail!

chriskrycho commented 7 years ago

One quick thing that I'd like to flag up as it occurred to me in the context of working on https://github.com/rust-lang-nursery/reference/issues/9 (aka "Document all the things!"): when this eventually becomes true, we should link it in the reference and other docs as appropriate.

harpocrates commented 7 years ago

I've been developing a Haskell library for parsing and pretty printing Rust source. In the process, I wrote an LALR grammar for Rust. The grammar is meant to be consumed by Haskell's Happy parser generator. AFAIK it is pretty complete - I think it handles pretty much everything on stable and beta (and a lot of stuff on nightly too)

It is based on

what (used to be?) the official ANTLR grammar
whatever information I get from RFCs
the rustc parser source code

Note I'm still in the process of tweaking and testing it. Right now, I've only got a couple hundred manually written test cases.

If this is in any way useful or there is anything I could apply to help with this issue, I'd be glad to - I've been looking for a way to get involved with rustc.

nikomatsakis commented 7 years ago

@harpocrates cool! I've been making slow progress on rustypop as well, though I've not yet taken the time to implement the testing harness. It'd be good to compare notes.

harpocrates commented 7 years ago

@nikomatsakis Sure thing. I'm currently writing what I think is my last set of tests for the parser - basically diff-ing out the output of my parser with the JSON output of rustc -Z ast-json-noexpand. Once I get that done, I'll try my parser on as much Rust code as I can get my hands on. Once that is complete, I'll report back here!

willy610 commented 6 years ago

Hello

Each open source jungle is difficult in the beginning. Therefore, the following posts may be placed in the wrong place. Someone can certainly correct this to the right forum.

Well well

I lack a grammar for Rust. For users. There are fragments in some documents but no cohesive. The grammar does not have to be the basis for compilation but should help to understand the structure of the language RUST.

I am 70 years old and have used RUST for a few months and have done a tool

https://github.com/willy610/bnf2railroad

who reads a grammar in EBNF style and produces so-called railroad views. The views are raw TTY, html with anchors and svg.

I have always thought that these railroad views have been a good complement to other documentation such as examples and small projects. My long experience of C, Smalltalk, SQL, Mathematica, JS and Java has of course helped me to get RUST relatively quickly. RUST is a very ambitious and radical approach to programming.

Upon learning, I have interpreted the grammar files contained in RUST documentation. They are a bit incomplete and do not have the quality that other languages offer. Much is undefined.

Together with the tool there are grammar for Pascal, Smalltalk, Lua, Json, rege and fragments of RUST.

The working method could be:

Create an EBNF file for a section to be documented
Produce output in the form of railroad
Finally, run all the files in a scan to get complete documentation
If necessary, do an analysis using the tool for undefined, unused and duplicate

I have also used the tool to generate better help information for using the tool (parameters of the program)

Perhaps the tool could be useful internally for developers of RUST for, for example, specification, bug report etc

nikomatsakis commented 6 years ago

@harpocrates I'm curious, what is the status of your rust parser? @matklad and I have been talking about trying to get a better effort going here. The rough plan is to start with your grammar (probably converted to LALRPOP and then perhaps clean it up). Also, would you be interested in being involved in any such effort?

harpocrates commented 6 years ago

@nikomatsakis my rust parser is complete. I would be interested in helping out with any effort to convert to LALRPOP. What would be the best way to proceed?

Note that converting to LALRPOP may not be completely trivial. My grammar is currently written for Happy (a Haskell parser generator) and makes use of a handful of Happy features, namely:

parameterized productions (e.g. to reduce the boilerplate of defining many very similar productions for the different classes of expressions in Rust)
pushing back tokens (e.g. a >> token in something like Vec<Vec<i32>>, gets split into > and >)

nikomatsakis commented 6 years ago

@harpocrates looking at your parser, I was curious, how have you tested it? I see a few tests in the repository, but I was wondering if you tested against e.g. the sources of the Rust compiler, or crates.io, or something like that.

nikomatsakis commented 6 years ago

@harpocrates

Note that converting to LALRPOP may not be completely trivial. My grammar is currently written for Happy (a Haskell parser generator) and makes use of a handful of Happy features, namely:

LALRPOP supports parameterized productions. It does not support pushing back tokens, but there is an alternative way to handle >> in any case. (You distinguish "> followed by >" as one token and "> followed by something else" as another kind of token.)

The trickier thing will be precedence, since LALRPOP does not support those sorts of declarations. We ought to be able to refactor the grammar though (I actually converted the existing .y grammar once already, in rustypop, but I figured I would rather start with your grammar since it seems far more complete and better tested.)

The first thing I had planned to do, in any case, is to write a tool to convert your grammar into LALRPOP syntax, and see how LALRPOP feels about it. The next priority would be a testing harness. It's rather late here so I'll have to write more later but I'd love to collaborate, particularly since my personal time is quite limited.

Separately, @matklad and I have been talking about extending LALRPOP with a way to generate default actions that will fire off events suitable for building a generic parse tree, roughly as described in their RFC.

willy610 commented 6 years ago

Hej Niko The status of my product is stable and no bigger issues more than proper UNICOE/UTF processing remains. Today only the ASCII subset is supported.

I really will - and can!? contribute - as I got a lot of time. I’m am retired but works at the computer at least 8 hour a day.

As of writing I have a very tiny end expensive connection to the internet; just via phone as the wlan supplier in this big city of Gothenburg have trouble with just my connection for more than a week. Crazy.

So please pass the git(s) to the most covering project on LALRPOP so I don’t have the surf around too much.

And also I would very much like some kind of use case or scenarios how the function/tool should be used

Looking forward to the work and the result!

Kindly, Willy

On 2018-03-29, at 18:27, Niko Matsakis notifications@github.com wrote:

@harpocrates I'm curious, what is the status of your rust parser? @matklad and I have been talking about trying to get a better effort going here. The rough plan is to start with your grammar (probably converted to LALRPOP and then perhaps clean it up). Also, would you be interested in being involved in any such effort?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Willy Svenningsson Johannas Vag 28 S-425 42 Hisings Karra Sweden +46 0768 22 20 26 willy@fager.st

willy610 commented 6 years ago

Hej I copy and pasted grammar rules from https://doc.rust-lang.org/grammar.html for RUST I did the same for Lua, for Smalltalk I read from the 'Smalltalk-80 The language’ and for Pascal of read the Nicklaus Wirth ’Algorithms + Data Structures = Program’

I have not locked into the RUST compiler source or any projects in crates.io. My vision! with my work was to support writers of language documentation giving them tools to work with - reference - subsets of proper/versioned grammar rules as inclusions of both BNF rules and RailRoad graphs.

I assume that there are no explicit language rules in the compilers sources but the rules appear in some pre-step for generating part of the complier. That why I didn’t digger into the compiler source. So from some versioned preprocess rule definitions there might a possibility to extract rule snippets to be converted to BNF and RailRoad.

But I will look into both the source of RUST compiler and some crates.io and the LALRPOP too.

Kindly, Willy

On 2018-03-29, at 23:25, Niko Matsakis notifications@github.com wrote:

@harpocrates looking at your parser, I was curious, how have you tested it? I see a few tests in the repository, but I was wondering if you tested against e.g. the sources of the Rust compiler, or crates.io, or something like that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Willy Svenningsson Johannas Vag 28 S-425 42 Hisings Karra Sweden +46 0768 22 20 26 willy@fager.st

ehuss commented 6 years ago

You may also be interested in following rust-lang-nursery/reference#221. The people working on the reference have been making great progress, and the up-to-date grammar (at https://brauliobz.github.io/rust-reference/grammar.html) appears to be getting closer to complete.

nikomatsakis commented 6 years ago

@willy610

The status of my product is stable and no bigger issues more than proper UNICOE/UTF processing remains.

Sorry for disappearing! After writing those messages, I got totally overwhelmed (and this week i'm actually on vacation, so I'll probably be slow to reply again.) I'm a bit confused about which project you are referring to -- do you mean the railroad diagrams you referenced here? If so, that seems like a cool visualization technique, but presumably it requires an EBNF grammar to start? At the moment, that last part is what I am most interesting in obtaining; note though that if we had a working LALRPOP grammar, it "desugars down" to a plain CFG internally, so we ought to be able to use your tool to visualize it.

@ehuss

The people working on the reference have been making great progress, and the up-to-date grammar...appears to be getting closer to complete

That's great! Is it presently being tested? And, if so, how?

willy610 commented 6 years ago

Hej Niko

The project I’m referencing is my project https://github.com/rust-lang/rust/issues/30942#issuecomment-359103426

My focus is to keep the possibility to generate railroad view from grammars of kind EBNF. And have the verification function there (Unused, missing rules etc)

At the moment I’m investing some RUST grammar approaches

The old yacc from RUST source

https://github.com/rust-lang/rust/issues/30942#issuecomment-359103426

From some md files in the Rust nursery

https://github.com/rust-lang-nursery/reference/tree/master/src

From the work by Jason Orendorff. I think he said who wrote the grammar in order to understand RUST when writing the book Programming Rust. What a grammar and what a book!!

https://github.com/jorendorff/rust-grammar/blob/master/Rust.g4

and from nikos older? rusty-pop

https://github.com/nikomatsakis/rustypop/blob/master/src/rusty.lalrpop

but also LALRPOP

https://github.com/lalrpop/lalrpop

A. In most sources I struggled with character set descriptions and all the regeexpr. So I introduced, in my EBNF, 'Character Set Expressions' with set, union, difference and range. It will show up soon.

B. I think Mark Down ’marking’ is insufficient for a BNF. Perhaps generating plain HTML snippets from rules. They can be styled looking like program source. Color, font etc

C. EBNF decorated with types is a must. I’m not there yet

D. Actions in the rules like {} in yacc and => in LARPOP could perhaps be specified in the EBNF as a (parametric) reference to other source like pub Expr: Box = { Expr ExprOp Factor @action(ActionName,Expr), Factor, }; So the @ or similar unused sign could be an element in the syntax.

Clean up grammar from actions and relate them using ’@' instead. The intersection of EBNF and LALRPOP could be ...

E. Or perhaps the best. Have a tool exporting from LALRPOP as EBNF (with type) And the continue with generation railroad and markdown files from that source.

I’m really sorry for working mostly off road.

I will probably not update the above bnf2railroad but add a heavy refactored one as an other project

On 2018-04-17, at 12:35, Niko Matsakis notifications@github.com wrote:

@willy610

The status of my product is stable and no bigger issues more than proper UNICOE/UTF processing remains.

Sorry for disappearing! After writing those messages, I got totally overwhelmed (and this week i'm actually on vacation, so I'll probably be slow to reply again.) I'm a bit confused about which project you are referring to -- do you mean the railroad diagrams you referenced here? If so, that seems like a cool visualization technique, but presumably it requires an EBNF grammar to start? At the moment, that last part is what I am most interesting in obtaining; note though that if we had a working LALRPOP grammar, it "desugars down" to a plain CFG internally, so we ought to be able to use your tool to visualize it.

@ehuss

The people working on the reference have been making great progress, and the up-to-date grammar...appears to be getting closer to complete

That's great! Is it presently being tested? And, if so, how?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Willy Svenningsson Johannas Vag 28 S-425 42 Hisings Karra Sweden +46 0768 22 20 26 willy@fager.st

nagisa commented 6 years ago

The grammar that's within reference (the markdown files) is way overdue for an update and you are not the first to notice it. In fact -- this RFC happened specifically because of it.

Replacing the grammar definitions in the reference markdown files with something else (such as railroads) would be super awesome, but before we can do that we need a complete grammar definition anyway.

On Thu, Apr 19, 2018, 00:10 Willy Svenningsson notifications@github.com wrote:

Hej Niko

The project I’m referencing is my project https://github.com/rust-lang/rust/issues/30942#issuecomment-359103426

My focus is to keep the possibility to generate railroad view from grammars of kind EBNF. And have the verification function there (Unused, missing rules etc)

At the moment I’m investing some RUST grammar approaches

The old yacc from RUST source

https://github.com/rust-lang/rust/issues/30942#issuecomment-359103426

From some md files in the Rust nursery

https://github.com/rust-lang-nursery/reference/tree/master/src

From the work by Jason Orendorff. I think he said who wrote the grammar in order to understand RUST when writing the book Programming Rust. What a grammar and what a book!!

https://github.com/jorendorff/rust-grammar/blob/master/Rust.g4

and from nikos older? rusty-pop

https://github.com/nikomatsakis/rustypop/blob/master/src/rusty.lalrpop

but also LALRPOP

https://github.com/lalrpop/lalrpop

A. In most sources I struggled with character set descriptions and all the regeexpr. So I introduced, in my EBNF, 'Character Set Expressions' with set, union, difference and range. It will show up soon.

B. I think Mark Down ’marking’ is insufficient for a BNF. Perhaps generating plain HTML snippets from rules. They can be styled looking like program source. Color, font etc

C. EBNF decorated with types is a must. I’m not there yet

D. Actions in the rules like {} in yacc and => in LARPOP could perhaps be specified in the EBNF as a (parametric) reference to other source like pub Expr: Box = { Expr ExprOp Factor @action(ActionName,Expr), Factor, }; So the @ or similar unused sign could be an element in the syntax.

Clean up grammar from actions and relate them using ’@' instead. The intersection of EBNF and LALRPOP could be ...

E. Or perhaps the best. Have a tool exporting from LALRPOP as EBNF (with type) And the continue with generation railroad and markdown files from that source.

I’m really sorry for working mostly off road.

I will probably not update the above bnf2railroad but add a heavy refactored one as an other project

On 2018-04-17, at 12:35, Niko Matsakis notifications@github.com wrote:

@willy610

The status of my product is stable and no bigger issues more than proper UNICOE/UTF processing remains.

Sorry for disappearing! After writing those messages, I got totally overwhelmed (and this week i'm actually on vacation, so I'll probably be slow to reply again.) I'm a bit confused about which project you are referring to -- do you mean the railroad diagrams you referenced here? If so, that seems like a cool visualization technique, but presumably it requires an EBNF grammar to start? At the moment, that last part is what I am most interesting in obtaining; note though that if we had a working LALRPOP grammar, it "desugars down" to a plain CFG internally, so we ought to be able to use your tool to visualize it.

@ehuss

The people working on the reference have been making great progress, and the up-to-date grammar...appears to be getting closer to complete

That's great! Is it presently being tested? And, if so, how?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Willy Svenningsson Johannas Vag 28 S-425 42 Hisings Karra Sweden +46 0768 22 20 26 willy@fager.st

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/30942#issuecomment-382530349, or mute the thread https://github.com/notifications/unsubscribe-auth/AApc0klOHg4m2tWYN9CZhGDnfrqNIprlks5tp6vSgaJpZM4HF_MM .

ehuss commented 6 years ago

That's great! Is it presently being tested? And, if so, how?

I don't know, I don't think there is anything formal set up. I think @brauliobz is doing most of the work, and he once mentioned that he was using Antlr4 to test.

harpocrates commented 6 years ago

@nikomatsakis I'm sorry for the long overdue response.

@harpocrates looking at your parser, I was curious, how have you tested it? I see a few tests in the repository, but I was wondering if you tested against e.g. the sources of the Rust compiler, or crates.io, or something like that.

I have a script that automatically scrapes large files from repos under the rust-lang organization. I use rustc -Z ast-json -Z parse-only as an oracle to tell me if my parser's output is correct. Any difference in AST outputs causes tests to fail. Last time I ran the script, it collected (and ran tests) on upwards of a million lines of Rust code.

The next priority would be a testing harness. It's rather late here so I'll have to write more later but I'd love to collaborate, particularly since my personal time is quite limited.

I should have more free time in the coming months and I'd be happy to help on whatever you guys need the most help on (porting over a concrete grammar, generally improving LALRPOP, etc.)

rust-lang / rust

Canonical Rust grammar distinct from parser (tracking issue for RFC #1331) #30942