[WIP] Implement (possibly?) the entire Rust grammar.

eddyb commented 6 years ago

Tested at https://github.com/rust-lang/rust/commit/f6e9a6e41cd9b1fb687e296b5a6d4c6ad399f862, after git submodule update --recursive --init:

Out of 13485 Rust files tested:
* 1078 parsed fully and unambiguously
* 12166 parsed fully (but ambiguously)
* 188 parsed partially (only a prefix)
* 50 didn't parse at all (lexer error?)

The "partial / prefix parse" bit pretty much means there's a syntax error somewhere within the file.

(click to see the exhaustive list of files that didn't parse)

``` src/test/ui/similar-tokens.rs src/test/ui/parser/issue-1655.rs src/test/ui/parser/duplicate-visibility.rs src/test/ui/parser/removed-syntax-fixed-vec.rs src/test/ui/parser/no-unsafe-self.rs src/test/ui/parser/lifetime-semicolon.rs src/test/ui/parser/issue-30318.rs src/test/ui/parser/virtual-structs.rs src/test/ui/parser/brace-after-qualified-path-in-match.rs src/test/ui/parser/issue-33413.rs src/test/ui/parser/issue-21153.rs src/test/ui/parser/default.rs src/test/ui/parser/paamayim-nekudotayim.rs src/test/ui/parser/removed-syntax-with-2.rs src/test/ui/parser/type-parameters-in-field-exprs.rs src/test/ui/parser/doc-before-identifier.rs src/test/ui/parser/macro-bad-delimiter-ident.rs src/test/ui/parser/pat-tuple-4.rs src/test/ui/parser/bad-name.rs src/test/ui/parser/extern-crate-unexpected-token.rs src/test/ui/parser/removed-syntax-record.rs src/test/ui/parser/pat-tuple-5.rs src/test/ui/parser/issue-20711.rs src/test/ui/parser/removed-syntax-field-let.rs src/test/ui/parser/raw-byte-string-literals.rs src/test/ui/parser/removed-syntax-enum-newtype.rs src/test/ui/parser/issue-27255.rs src/test/ui/parser/attr.rs src/test/ui/parser/issue-41155.rs src/test/ui/parser/bounds-lifetime-where-1.rs src/test/ui/parser/attr-bad-meta.rs src/test/ui/parser/doc-before-rbrace.rs src/test/ui/parser/raw-byte-string-eof.rs src/test/ui/parser/macros-no-semicolon-items.rs src/test/ui/parser/issue-32505.rs src/test/ui/parser/pat-tuple-1.rs src/test/ui/parser/bounds-type-where.rs src/test/ui/parser/doc-before-fn-rbrace.rs src/test/ui/parser/bound-single-question-mark.rs src/test/ui/parser/use-ends-with-mod-sep.rs src/test/ui/parser/issue-20711-2.rs src/test/ui/parser/impl-parsing.rs src/test/ui/parser/multiline-comment-line-tracking.rs src/test/ui/parser/raw-str-unbalanced.rs src/test/ui/parser/bind-struct-early-modifiers.rs src/test/ui/parser/attr-before-eof.rs src/test/ui/parser/removed-syntax-extern-const.rs src/test/ui/parser/import-from-path.rs src/test/ui/parser/inner-attr-after-doc-comment.rs src/test/ui/parser/bad-pointer-type.rs src/test/ui/parser/issue-6610.rs src/test/ui/parser/removed-syntax-mode.rs src/test/ui/parser/trait-pub-method.rs src/test/ui/parser/removed-syntax-mut-vec-expr.rs src/test/ui/parser/removed-syntax-uniq-mut-ty.rs src/test/ui/parser/struct-field-numeric-shorthand.rs src/test/ui/parser/doc-before-attr.rs src/test/ui/parser/where-clauses-no-bounds-or-predicates.rs src/test/ui/parser/column-offset-1-based.rs src/test/ui/parser/inverted-parameters.rs src/test/ui/parser/doc-before-struct-rbrace-2.rs src/test/ui/parser/trait-pub-assoc-ty.rs src/test/ui/parser/issue-24375.rs src/test/ui/parser/issue-15980.rs src/test/ui/parser/associated-types-project-from-hrtb-explicit.rs src/test/ui/parser/extern-no-fn.rs src/test/ui/parser/trait-pub-assoc-const.rs src/test/ui/parser/bounds-lifetime-2.rs src/test/ui/parser/better-expected.rs src/test/ui/parser/removed-syntax-fn-sigil.rs src/test/ui/parser/not-a-pred.rs src/test/ui/parser/doc-before-mod-rbrace.rs src/test/ui/parser/pat-lt-bracket-7.rs src/test/ui/parser/doc-before-struct-rbrace-1.rs src/test/ui/parser/removed-syntax-closure-lifetime.rs src/test/ui/parser/attrs-after-extern-mod.rs src/test/ui/parser/issue-17904-2.rs src/test/ui/parser/issue-24780.rs src/test/ui/parser/removed-syntax-mut-vec-ty.rs src/test/ui/parser/paren-after-qualified-path-in-match.rs src/test/ui/parser/issue-32214.rs src/test/ui/parser/recover-enum2.rs src/test/ui/parser/doc-before-eof.rs src/test/ui/parser/bounds-lifetime-1.rs src/test/ui/parser/doc-after-struct-field.rs src/test/ui/parser/attr-dangling-in-mod.rs src/test/ui/parser/doc-before-semi.rs src/test/ui/parser/issue-33455.rs src/test/ui/parser/import-glob-rename.rs src/test/ui/parser/import-from-rename.rs src/test/ui/parser/removed-syntax-field-semicolon.rs src/test/ui/parser/extern-foreign-crate.rs src/test/ui/parser/import-glob-path.rs src/test/ui/parser/use-as-where-use-ends-with-mod-sep.rs src/test/ui/parser/class-implements-bad-trait.rs src/test/ui/parser/pat-lt-bracket-2.rs src/test/ui/parser/attr-bad-meta-3.rs src/test/ui/parser/issue-10392-2.rs src/test/ui/parser/attr-dangling-in-fn.rs src/test/ui/parser/regions-out-of-scope-slice.rs src/test/ui/parser/removed-syntax-static-fn.rs src/test/ui/parser/trait-bounds-not-on-impl.rs src/test/ui/parser/unsized.rs src/test/ui/parser/empty-impl-semicolon.rs src/test/ui/parser/attr-bad-meta-2.rs src/test/ui/parser/inner-attr.rs src/test/ui/parser/doc-before-extern-rbrace.rs src/test/ui/parser/issue-19096.rs src/test/ui/parser/issue-19398.rs src/test/ui/parser/pat-ref-enum.rs src/test/ui/parser/issue-32446.rs src/test/ui/parser/trait-object-lifetime-parens.rs src/test/ui/parser/bounds-type.rs src/test/ui/parser/issue-17904.rs src/test/ui/parser/pat-lt-bracket-1.rs src/test/ui/parser/issue-17718-const-mut.rs src/test/ui/parser/removed-syntax-ptr-lifetime.rs src/test/ui/parser/trait-object-bad-parens.rs src/test/ui/parser/multitrait.rs src/test/ui/parser/extern-expected-fn-or-brace.rs src/test/ui/structs/struct-field-init-syntax.rs src/test/ui/structs/struct-duplicate-comma.rs src/test/ui/structs/struct-missing-comma.rs src/test/ui/tuple/tuple-struct-fields/test.rs src/test/ui/invalid/invalid-variadic-function.rs src/test/ui/issues/issue-20616-5.rs src/test/ui/issues/issue-49257.rs src/test/ui/issues/issue-34222-1.rs src/test/ui/issues/issue-20616-4.rs src/test/ui/issues/issue-20616-6.rs src/test/ui/issues/issue-46186.rs src/test/ui/issues/issue-20616-7.rs src/test/ui/issues/issue-20616-3.rs src/test/ui/issues/issue-39616.rs src/test/ui/issues/issue-20616-2.rs src/test/ui/issues/issue-20616-1.rs src/test/ui/issues/issue-48636.rs src/test/ui/issues/issue-20616-8.rs src/test/ui/issues/issue-20616-9.rs src/test/ui/issues/issue-44021.rs src/test/ui/issues/issue-45296.rs src/test/ui/issues/issue-28433.rs src/test/ui/issues/issue-43196.rs src/test/ui/issues/issue-49040.rs src/test/ui/issues/auxiliary/issue-21146-inc.rs src/test/ui/self/self-vs-path-ambiguity.rs src/test/ui/traits/trait-alias.rs src/test/ui/parse-error-correct.rs src/test/ui/resolve/issue-54379.rs src/test/ui/missing/missing-block-hint.rs src/test/ui/pub/pub-restricted-non-path.rs src/test/ui/pub/pub-ident-struct.rs src/test/ui/pub/pub-ident-fn-or-struct.rs src/test/ui/pub/pub-ident-fn-or-struct-2.rs src/test/ui/pub/pub-restricted-error-fn.rs src/test/ui/pub/pub-restricted-error.rs src/test/ui/pub/pub-ident-fn.rs src/test/ui/pub/pub-ident-fn-2.rs src/test/ui/pub/pub-restricted.rs src/test/ui/attrs-with-no-formal-in-generics/attrs-with-no-formal-in-generics-1.rs src/test/ui/attrs-with-no-formal-in-generics/attrs-with-no-formal-in-generics-3.rs src/test/ui/attrs-with-no-formal-in-generics/attrs-with-no-formal-in-generics-2.rs src/test/ui/macros/macro-attribute.rs src/test/ui/exclusive-range/exclusive_range_pattern_syntax_collision3.rs src/test/ui/error-codes/E0585.rs src/test/ui/error-codes/E0586.rs src/test/ui/extern/extern-const.rs src/test/ui/impossible_range.rs src/test/ui/associated-types/issue-36499.rs src/test/ui/type/type-ascription-instead-of-statement-end.rs src/test/ui/did_you_mean/issue-48492-tuple-destructure-missing-parens.rs src/test/ui/did_you_mean/issue-41679-tilde-bitwise-negation-attempt.rs src/test/ui/did_you_mean/issue-54109-and_instead_of_ampersands.rs src/test/ui/did_you_mean/bad-assoc-pat.rs src/test/ui/did_you_mean/multiple-pattern-typo.rs src/test/ui/did_you_mean/bad-assoc-ty.rs src/test/ui/did_you_mean/issue-40006.rs src/test/ui/did_you_mean/issue-46718-struct-pattern-dotdotdot.rs src/test/ui/bad/bad-crate-name.rs src/test/ui/trait-alias-fail.rs src/test/ui/obsolete-syntax-impl-for-dotdot.rs src/test/ui/feature-gates/feature-gate-extern_prelude.rs src/test/ui/syntax-trait-polarity.rs src/test/run-pass/macros/auxiliary/macro-include-items-expr.rs src/test/run-pass/macros/auxiliary/macro-comma-support.rs src/test/run-pass/shebang.rs src/tools/clang/test/Sema/renderscript.rs src/tools/rustfmt/tests/coverage/target/comments.rs ```

AFAICT, none of those should parse, but I haven't double-checked against libsyntax yet. Also, it might be useful to do the reverse (to make sure we don't parse more than rustc).

In order to get as much as possible to parse, I did include unstable syntax. However, we might be able to separate it out, or even just comment it out for now.

There are 3 blacklisted files, because they're unfeasibly large to parse currently:

I might end up using something similar to them to benchmark improvements to lykenware/gll.

eddyb commented 6 years ago

I may have come up with a way to separate stable and unstable features: adjoining grammars - that is, making gll::Grammar::extend, when it sees rules with the same name, | them together, instead of panicking (or, worse, overwriting one of the definitions with the other).

eddyb commented 6 years ago

cc @petrochenkov @jseyfried @Manishearth @nikomatsakis @nrc @pnkfelix AFAIK all of you also have some level of interest in Rust's grammar, so it's be great if you could look over and point anything out, either where I've been too specific, too general, or just comments that should be added to the grammar, about various quirks, peculiarities, etc.

I should probably also spend some time reading previous attempts at a Rust grammar, to see if they hold some information I might be missing here.

Note, however, that this particular PR is not an at attempt at disambiguating and fully restricting the grammar, to one specific parse tree (even if it does that 8% of the time already).

(there's also the option of removing some of the restrictions, in unambiguous situations; I mean, it's almost 2019 and we still haven't standardized on fully general CFG parsing everywhere?!)

matklad commented 6 years ago

Am I correct that this grammar does not reflect peculiarities around "restrictions" ? That, that some expressions must be terminated by ; in statments and ',' in match arms, and that conditions can't contain struct literals? What's the plan with dealing with them?

eddyb commented 6 years ago

@matklad I think that's a long-term question, that this WG should figure out, eventually. I wanted to get something out the door ASAP, so we can start experimenting with checking for consistency with libsyntax and syn, start writing tests for all the nonterminals, etc.

@eternaleye suggested using boolean grammars to indicate such things, but even though it would be possible to extend the GLL implementation to handle conjunction and negation, natively, I would prefer if we waited for something based on rewriting the grammar to a CFG (similar to what you'd do with propagating the restriction down through grammar templates, but automated).

We might introduce templates into the notation, but I'd rather not make a mess of things with it.

If we end up with two grammars (because this initial one is pretty small and could be kept around anyway), I would want to have some sort of naive "inclusion check", that would ensure that the refined grammar parses strictly less than the simple, more general, one.

petrochenkov commented 6 years ago

I usually assign myself to PRs/issues I need to do something with, to not forget about them. What GitHub team I need to be included into to be able to do it in this repo? Or perhaps this repo should be included to the list of repos accessible to rust-lang-nursery/compiler team?

eddyb commented 6 years ago

@petrochenkov Oh that's annoying, this is a different org than rust-lang, so I can't share teams between them. I'll try to figure something out though. EDIT: done, @rust-lang-nursery/wg-grammar should now be a team.

eddyb commented 6 years ago

@ehuss General note: this grammar doesn't include "restrictions" or any disambiguations in general. It has all the structure of Rust, but not all the constraints.

Practical declaractive constraints and disambiguations are an open question that this WG will have to come up with a solution for.

The official implementation has often picked arbitrary constraints, in the name of being able to make decisions early, while complicating the formalism. Encoding those ad-hoc rules in a more formal setting may very well be needed, but I would prefer to start from and maintain a "clean slate".

eddyb commented 6 years ago

General note about unstable features: I plan to actually split them out into a subdirectory and have a system for merging rules (I mean, |-ing them together is enough). The comments are still really helpful though!

We could even have one file per feature-gate, instead of duplicating the file structure of the stable grammar.

Centril commented 6 years ago

@eddyb From the POV of language design (i.e. changing the grammar later) it seems more readable to me to have the feature gates integrated in the grammar where they naturally change things because feature gates sometimes affect multiple different parts of the grammar at the same time... I think a sort of pragma system might work well here, but for now we should just add comments saying that some alternation is unstable...

eddyb commented 6 years ago

@Centril I don't see how having them separate would hamper that? You just put them where they belong, and maybe simplify the definition to not include old special-cases.

eddyb commented 6 years ago

General note: I assumed optionally-trailing-separator separated lists were X* % Sep Sep?. And we planned to make that more ergonomic by adopting the X* %% Sep syntax (also from Perl6).

But now I'm not sure whether X* %% Sep is X* % Sep Sep? or (X+ %% Sep)? in Perl6. We appear to use the latter in Rust, but I don't want to make %% do that unless it's consistent with where we're taking the syntax from (i.e. Perl6 regexes/grammars).

EDIT: @solson tried it out, and it looks like Rust is different than Perl6's %% syntax (which parses lone commas)! I wonder what languages also have trailing commas everywhere and what they chose. The sad thing is that even if we add (X+ %% Sep)? syntax, that's no longer a list, it's an optional list.

EDIT2: I've been informed that the language Perl6 doesn't have lone commas (and some are speculating a bug might be letting X* %% "," parse ","). So we could make a pragmatic decision about X* %% Sep and say it doesn't allow lone Sep.

Centril commented 6 years ago

@eddyb Maybe I'm not entirely understanding the suggestion... but I think having something like:

ExprKind
    = Literal:LITERAL
    | Try:{ expr:Expr "?" }
    | #[feature = "async"]   -- invent concrete syntax corresponding to this...
      AsyncBlock:{ "async" block:Block }
    | #[feature = "try_block"]
      TryBlock:{ "try" block:Block }
    | #[feature = "generators"]
      Yield:{ "yield" value:Expr? }
    | Closure:{
        #[feature = "generators"]
        generator_static:"static"?
        #[feature = "async"]
        asyncness:"async"?
        by_val:"move"? "|" args:ClosureArg* % "," ","? "|" { "->" ret_ty:Type }? body:Expr
      }
    ;

with the semantics that #[feature = ...] will strip out the next alternation or the next thing in a "concatenation"... would be nice.

It's nice because you can see the context to which the gated alternation applies and the other things so you can understand this better.

Centril commented 6 years ago

General note: I assumed optionally-trailing-separator separated lists were X* % Sep Sep?. And we planned to make that more ergonomic by adopting the X* %% Sep syntax (also from Perl6).

But now I'm not sure whether X* %% Sep is X* % Sep Sep? or (X+ %% Sep)? in Perl6. We appear to use the latter in Rust, but I don't want to make %% do that unless it's consistent with where we're taking the syntax from (i.e. Perl6 regexes/grammars).

I would personally advocate higher order grammars and just writing List(X, Sep)" and such things; it does make things more readable and you have to remember the concrete syntax of the GLL notation less.

eddyb commented 6 years ago

What I mean is you'd have something like this in an async.g file:

ExprKind |=
    AsyncBlock:{ "async" block:Block };

(not an imperative |=, but @Centril mentioned on Discord a plain = could be confusing)

eddyb commented 6 years ago

Oops, I forgot to mark this as WIP (so let's not land it for now). I still want to work on the GLL side to improve the experience here.

petrochenkov commented 5 years ago

So, I started reading this at last (now at generics.g).

Two issues are immediately noticeable:

The grammar is written in terms of multi-character operators, so things requiring token splitting like double borrows &&EXPR cannot be expressed. Unless the parser generator automatically desugars things like "EXPR" "&&" "EXPR" into EXPR "&join" ("&join" | "&nojoin") "EXPR", and "&" EXPR into ("&join" | "&nojoin") "EXPR"?
The list notation with trailing separators. All the lists currently use DELIM_OPEN ELEM* % SEP SEP? DELIM_CLOSE, but that should accept undesired single separator DELIM_OPEN SEP DELIM_CLOSE?

CAD97 commented 5 years ago

The generated parser accepts a TokenStream and handles processing the Punct tokens and enforcing their Spacing to not accept [Punct { ch: '&', spacing: Alone, .. }, Punct { ch: '&', spacing: Alone, .. }] as "&&" and to accept [Punct { ch: '&', spacing: Joint, .. }, Punct { ch: '&', spacing: Alone }] as "&" "&".

(lykenware/gll side note: how are unbalanced groups handled (error reported) in the TokenStream generator? Not relevant for us though.)

List notation has been noted before, intent is to add a notation ELEM* %% SEP that behaves like Rust lists do. (Though, do we want to prohibit a lone trailing comma? Something to consider sometime down the line, I suppose.)

petrochenkov commented 5 years ago

One thing I noticed is that the grammar is somewhat relaxed compared to what accepted by libsyntax parser (and libsyntax parser in general accept semantically invalid constructions if they can fit into AST - that forms the "language accepted under cfg(FALSE)"), but relaxed to different extent, so it's hard to decide what exactly to report as an issue and what not. (So I didn't report anything that's more relaxed than the real parser.)

qmx commented 5 years ago

in the spirit of improving visibility and making progress I'm going to fix the conflicts now and merge this PR.

qmx commented 5 years ago

closing this as #17 is merged

rust-lang / wg-grammar

[WIP] Implement (possibly?) the entire Rust grammar. #13