Closed eddyb closed 5 years ago
I may have come up with a way to separate stable and unstable features: adjoining grammars - that is, making gll::Grammar::extend
, when it sees rules with the same name, |
them together, instead of panicking (or, worse, overwriting one of the definitions with the other).
cc @petrochenkov @jseyfried @Manishearth @nikomatsakis @nrc @pnkfelix AFAIK all of you also have some level of interest in Rust's grammar, so it's be great if you could look over and point anything out, either where I've been too specific, too general, or just comments that should be added to the grammar, about various quirks, peculiarities, etc.
I should probably also spend some time reading previous attempts at a Rust grammar, to see if they hold some information I might be missing here.
Note, however, that this particular PR is not an at attempt at disambiguating and fully restricting the grammar, to one specific parse tree (even if it does that 8% of the time already).
(there's also the option of removing some of the restrictions, in unambiguous situations; I mean, it's almost 2019 and we still haven't standardized on fully general CFG parsing everywhere?!)
Am I correct that this grammar does not reflect peculiarities around "restrictions" ? That, that some expressions must be terminated by ;
in statments and ',' in match arms, and that conditions can't contain struct literals? What's the plan with dealing with them?
@matklad I think that's a long-term question, that this WG should figure out, eventually.
I wanted to get something out the door ASAP, so we can start experimenting with checking for consistency with libsyntax
and syn
, start writing tests for all the nonterminals, etc.
@eternaleye suggested using boolean grammars to indicate such things, but even though it would be possible to extend the GLL implementation to handle conjunction and negation, natively, I would prefer if we waited for something based on rewriting the grammar to a CFG (similar to what you'd do with propagating the restriction down through grammar templates, but automated).
We might introduce templates into the notation, but I'd rather not make a mess of things with it.
If we end up with two grammars (because this initial one is pretty small and could be kept around anyway), I would want to have some sort of naive "inclusion check", that would ensure that the refined grammar parses strictly less than the simple, more general, one.
I usually assign myself to PRs/issues I need to do something with, to not forget about them.
What GitHub team I need to be included into to be able to do it in this repo?
Or perhaps this repo should be included to the list of repos accessible to rust-lang-nursery/compiler
team?
@petrochenkov Oh that's annoying, this is a different org than rust-lang
, so I can't share teams between them. I'll try to figure something out though.
EDIT: done, @rust-lang-nursery/wg-grammar should now be a team.
@ehuss General note: this grammar doesn't include "restrictions" or any disambiguations in general. It has all the structure of Rust, but not all the constraints.
Practical declaractive constraints and disambiguations are an open question that this WG will have to come up with a solution for.
The official implementation has often picked arbitrary constraints, in the name of being able to make decisions early, while complicating the formalism. Encoding those ad-hoc rules in a more formal setting may very well be needed, but I would prefer to start from and maintain a "clean slate".
General note about unstable features: I plan to actually split them out into a subdirectory and have a system for merging rules (I mean, |
-ing them together is enough).
The comments are still really helpful though!
We could even have one file per feature-gate, instead of duplicating the file structure of the stable grammar.
@eddyb From the POV of language design (i.e. changing the grammar later) it seems more readable to me to have the feature gates integrated in the grammar where they naturally change things because feature gates sometimes affect multiple different parts of the grammar at the same time... I think a sort of pragma system might work well here, but for now we should just add comments saying that some alternation is unstable...
@Centril I don't see how having them separate would hamper that? You just put them where they belong, and maybe simplify the definition to not include old special-cases.
General note: I assumed optionally-trailing-separator separated lists were X* % Sep Sep?
.
And we planned to make that more ergonomic by adopting the X* %% Sep
syntax (also from Perl6).
But now I'm not sure whether X* %% Sep
is X* % Sep Sep?
or (X+ %% Sep)?
in Perl6.
We appear to use the latter in Rust, but I don't want to make %%
do that unless it's consistent with where we're taking the syntax from (i.e. Perl6 regexes/grammars).
EDIT: @solson tried it out, and it looks like Rust is different than Perl6's %%
syntax (which parses lone commas)! I wonder what languages also have trailing commas everywhere and what they chose.
The sad thing is that even if we add (X+ %% Sep)?
syntax, that's no longer a list, it's an optional list.
EDIT2: I've been informed that the language Perl6 doesn't have lone commas (and some are speculating a bug might be letting X* %% ","
parse ","
).
So we could make a pragmatic decision about X* %% Sep
and say it doesn't allow lone Sep
.
@eddyb Maybe I'm not entirely understanding the suggestion... but I think having something like:
ExprKind
= Literal:LITERAL
| Try:{ expr:Expr "?" }
| #[feature = "async"] -- invent concrete syntax corresponding to this...
AsyncBlock:{ "async" block:Block }
| #[feature = "try_block"]
TryBlock:{ "try" block:Block }
| #[feature = "generators"]
Yield:{ "yield" value:Expr? }
| Closure:{
#[feature = "generators"]
generator_static:"static"?
#[feature = "async"]
asyncness:"async"?
by_val:"move"? "|" args:ClosureArg* % "," ","? "|" { "->" ret_ty:Type }? body:Expr
}
;
with the semantics that #[feature = ...]
will strip out the next alternation or the next thing in a "concatenation"... would be nice.
It's nice because you can see the context to which the gated alternation applies and the other things so you can understand this better.
General note: I assumed optionally-trailing-separator separated lists were
X* % Sep Sep?
. And we planned to make that more ergonomic by adopting theX* %% Sep
syntax (also from Perl6).But now I'm not sure whether
X* %% Sep
isX* % Sep Sep?
or(X+ %% Sep)?
in Perl6. We appear to use the latter in Rust, but I don't want to make%%
do that unless it's consistent with where we're taking the syntax from (i.e. Perl6 regexes/grammars).
I would personally advocate higher order grammars and just writing List(X, Sep)"
and such things; it does make things more readable and you have to remember the concrete syntax of the GLL notation less.
What I mean is you'd have something like this in an async.g
file:
ExprKind |=
AsyncBlock:{ "async" block:Block };
(not an imperative |=
, but @Centril mentioned on Discord a plain =
could be confusing)
Oops, I forgot to mark this as WIP (so let's not land it for now). I still want to work on the GLL side to improve the experience here.
So, I started reading this at last (now at generics.g
).
Two issues are immediately noticeable:
&&EXPR
cannot be expressed.
Unless the parser generator automatically desugars things like "EXPR" "&&" "EXPR"
into EXPR "&join" ("&join" | "&nojoin") "EXPR"
, and "&" EXPR
into ("&join" | "&nojoin") "EXPR"
?DELIM_OPEN ELEM* % SEP SEP? DELIM_CLOSE
, but that should accept undesired single separator DELIM_OPEN SEP DELIM_CLOSE
?The generated parser accepts a TokenStream
and handles processing the Punct
tokens and enforcing their Spacing
to not accept [Punct { ch: '&', spacing: Alone, .. }, Punct { ch: '&', spacing: Alone, .. }]
as "&&"
and to accept [Punct { ch: '&', spacing: Joint, .. }, Punct { ch: '&', spacing: Alone }]
as "&" "&"
.
(lykenware/gll side note: how are unbalanced groups handled (error reported) in the TokenStream generator? Not relevant for us though.)
List notation has been noted before, intent is to add a notation ELEM* %% SEP
that behaves like Rust lists do. (Though, do we want to prohibit a lone trailing comma? Something to consider sometime down the line, I suppose.)
One thing I noticed is that the grammar is somewhat relaxed compared to what accepted by libsyntax parser (and libsyntax parser in general accept semantically invalid constructions if they can fit into AST - that forms the "language accepted under cfg(FALSE)
"), but relaxed to different extent, so it's hard to decide what exactly to report as an issue and what not.
(So I didn't report anything that's more relaxed than the real parser.)
in the spirit of improving visibility and making progress I'm going to fix the conflicts now and merge this PR.
closing this as #17 is merged
Tested at https://github.com/rust-lang/rust/commit/f6e9a6e41cd9b1fb687e296b5a6d4c6ad399f862, after
git submodule update --recursive --init
:The "partial / prefix parse" bit pretty much means there's a syntax error somewhere within the file.
(click to see the exhaustive list of files that didn't parse)
``` src/test/ui/similar-tokens.rs src/test/ui/parser/issue-1655.rs src/test/ui/parser/duplicate-visibility.rs src/test/ui/parser/removed-syntax-fixed-vec.rs src/test/ui/parser/no-unsafe-self.rs src/test/ui/parser/lifetime-semicolon.rs src/test/ui/parser/issue-30318.rs src/test/ui/parser/virtual-structs.rs src/test/ui/parser/brace-after-qualified-path-in-match.rs src/test/ui/parser/issue-33413.rs src/test/ui/parser/issue-21153.rs src/test/ui/parser/default.rs src/test/ui/parser/paamayim-nekudotayim.rs src/test/ui/parser/removed-syntax-with-2.rs src/test/ui/parser/type-parameters-in-field-exprs.rs src/test/ui/parser/doc-before-identifier.rs src/test/ui/parser/macro-bad-delimiter-ident.rs src/test/ui/parser/pat-tuple-4.rs src/test/ui/parser/bad-name.rs src/test/ui/parser/extern-crate-unexpected-token.rs src/test/ui/parser/removed-syntax-record.rs src/test/ui/parser/pat-tuple-5.rs src/test/ui/parser/issue-20711.rs src/test/ui/parser/removed-syntax-field-let.rs src/test/ui/parser/raw-byte-string-literals.rs src/test/ui/parser/removed-syntax-enum-newtype.rs src/test/ui/parser/issue-27255.rs src/test/ui/parser/attr.rs src/test/ui/parser/issue-41155.rs src/test/ui/parser/bounds-lifetime-where-1.rs src/test/ui/parser/attr-bad-meta.rs src/test/ui/parser/doc-before-rbrace.rs src/test/ui/parser/raw-byte-string-eof.rs src/test/ui/parser/macros-no-semicolon-items.rs src/test/ui/parser/issue-32505.rs src/test/ui/parser/pat-tuple-1.rs src/test/ui/parser/bounds-type-where.rs src/test/ui/parser/doc-before-fn-rbrace.rs src/test/ui/parser/bound-single-question-mark.rs src/test/ui/parser/use-ends-with-mod-sep.rs src/test/ui/parser/issue-20711-2.rs src/test/ui/parser/impl-parsing.rs src/test/ui/parser/multiline-comment-line-tracking.rs src/test/ui/parser/raw-str-unbalanced.rs src/test/ui/parser/bind-struct-early-modifiers.rs src/test/ui/parser/attr-before-eof.rs src/test/ui/parser/removed-syntax-extern-const.rs src/test/ui/parser/import-from-path.rs src/test/ui/parser/inner-attr-after-doc-comment.rs src/test/ui/parser/bad-pointer-type.rs src/test/ui/parser/issue-6610.rs src/test/ui/parser/removed-syntax-mode.rs src/test/ui/parser/trait-pub-method.rs src/test/ui/parser/removed-syntax-mut-vec-expr.rs src/test/ui/parser/removed-syntax-uniq-mut-ty.rs src/test/ui/parser/struct-field-numeric-shorthand.rs src/test/ui/parser/doc-before-attr.rs src/test/ui/parser/where-clauses-no-bounds-or-predicates.rs src/test/ui/parser/column-offset-1-based.rs src/test/ui/parser/inverted-parameters.rs src/test/ui/parser/doc-before-struct-rbrace-2.rs src/test/ui/parser/trait-pub-assoc-ty.rs src/test/ui/parser/issue-24375.rs src/test/ui/parser/issue-15980.rs src/test/ui/parser/associated-types-project-from-hrtb-explicit.rs src/test/ui/parser/extern-no-fn.rs src/test/ui/parser/trait-pub-assoc-const.rs src/test/ui/parser/bounds-lifetime-2.rs src/test/ui/parser/better-expected.rs src/test/ui/parser/removed-syntax-fn-sigil.rs src/test/ui/parser/not-a-pred.rs src/test/ui/parser/doc-before-mod-rbrace.rs src/test/ui/parser/pat-lt-bracket-7.rs src/test/ui/parser/doc-before-struct-rbrace-1.rs src/test/ui/parser/removed-syntax-closure-lifetime.rs src/test/ui/parser/attrs-after-extern-mod.rs src/test/ui/parser/issue-17904-2.rs src/test/ui/parser/issue-24780.rs src/test/ui/parser/removed-syntax-mut-vec-ty.rs src/test/ui/parser/paren-after-qualified-path-in-match.rs src/test/ui/parser/issue-32214.rs src/test/ui/parser/recover-enum2.rs src/test/ui/parser/doc-before-eof.rs src/test/ui/parser/bounds-lifetime-1.rs src/test/ui/parser/doc-after-struct-field.rs src/test/ui/parser/attr-dangling-in-mod.rs src/test/ui/parser/doc-before-semi.rs src/test/ui/parser/issue-33455.rs src/test/ui/parser/import-glob-rename.rs src/test/ui/parser/import-from-rename.rs src/test/ui/parser/removed-syntax-field-semicolon.rs src/test/ui/parser/extern-foreign-crate.rs src/test/ui/parser/import-glob-path.rs src/test/ui/parser/use-as-where-use-ends-with-mod-sep.rs src/test/ui/parser/class-implements-bad-trait.rs src/test/ui/parser/pat-lt-bracket-2.rs src/test/ui/parser/attr-bad-meta-3.rs src/test/ui/parser/issue-10392-2.rs src/test/ui/parser/attr-dangling-in-fn.rs src/test/ui/parser/regions-out-of-scope-slice.rs src/test/ui/parser/removed-syntax-static-fn.rs src/test/ui/parser/trait-bounds-not-on-impl.rs src/test/ui/parser/unsized.rs src/test/ui/parser/empty-impl-semicolon.rs src/test/ui/parser/attr-bad-meta-2.rs src/test/ui/parser/inner-attr.rs src/test/ui/parser/doc-before-extern-rbrace.rs src/test/ui/parser/issue-19096.rs src/test/ui/parser/issue-19398.rs src/test/ui/parser/pat-ref-enum.rs src/test/ui/parser/issue-32446.rs src/test/ui/parser/trait-object-lifetime-parens.rs src/test/ui/parser/bounds-type.rs src/test/ui/parser/issue-17904.rs src/test/ui/parser/pat-lt-bracket-1.rs src/test/ui/parser/issue-17718-const-mut.rs src/test/ui/parser/removed-syntax-ptr-lifetime.rs src/test/ui/parser/trait-object-bad-parens.rs src/test/ui/parser/multitrait.rs src/test/ui/parser/extern-expected-fn-or-brace.rs src/test/ui/structs/struct-field-init-syntax.rs src/test/ui/structs/struct-duplicate-comma.rs src/test/ui/structs/struct-missing-comma.rs src/test/ui/tuple/tuple-struct-fields/test.rs src/test/ui/invalid/invalid-variadic-function.rs src/test/ui/issues/issue-20616-5.rs src/test/ui/issues/issue-49257.rs src/test/ui/issues/issue-34222-1.rs src/test/ui/issues/issue-20616-4.rs src/test/ui/issues/issue-20616-6.rs src/test/ui/issues/issue-46186.rs src/test/ui/issues/issue-20616-7.rs src/test/ui/issues/issue-20616-3.rs src/test/ui/issues/issue-39616.rs src/test/ui/issues/issue-20616-2.rs src/test/ui/issues/issue-20616-1.rs src/test/ui/issues/issue-48636.rs src/test/ui/issues/issue-20616-8.rs src/test/ui/issues/issue-20616-9.rs src/test/ui/issues/issue-44021.rs src/test/ui/issues/issue-45296.rs src/test/ui/issues/issue-28433.rs src/test/ui/issues/issue-43196.rs src/test/ui/issues/issue-49040.rs src/test/ui/issues/auxiliary/issue-21146-inc.rs src/test/ui/self/self-vs-path-ambiguity.rs src/test/ui/traits/trait-alias.rs src/test/ui/parse-error-correct.rs src/test/ui/resolve/issue-54379.rs src/test/ui/missing/missing-block-hint.rs src/test/ui/pub/pub-restricted-non-path.rs src/test/ui/pub/pub-ident-struct.rs src/test/ui/pub/pub-ident-fn-or-struct.rs src/test/ui/pub/pub-ident-fn-or-struct-2.rs src/test/ui/pub/pub-restricted-error-fn.rs src/test/ui/pub/pub-restricted-error.rs src/test/ui/pub/pub-ident-fn.rs src/test/ui/pub/pub-ident-fn-2.rs src/test/ui/pub/pub-restricted.rs src/test/ui/attrs-with-no-formal-in-generics/attrs-with-no-formal-in-generics-1.rs src/test/ui/attrs-with-no-formal-in-generics/attrs-with-no-formal-in-generics-3.rs src/test/ui/attrs-with-no-formal-in-generics/attrs-with-no-formal-in-generics-2.rs src/test/ui/macros/macro-attribute.rs src/test/ui/exclusive-range/exclusive_range_pattern_syntax_collision3.rs src/test/ui/error-codes/E0585.rs src/test/ui/error-codes/E0586.rs src/test/ui/extern/extern-const.rs src/test/ui/impossible_range.rs src/test/ui/associated-types/issue-36499.rs src/test/ui/type/type-ascription-instead-of-statement-end.rs src/test/ui/did_you_mean/issue-48492-tuple-destructure-missing-parens.rs src/test/ui/did_you_mean/issue-41679-tilde-bitwise-negation-attempt.rs src/test/ui/did_you_mean/issue-54109-and_instead_of_ampersands.rs src/test/ui/did_you_mean/bad-assoc-pat.rs src/test/ui/did_you_mean/multiple-pattern-typo.rs src/test/ui/did_you_mean/bad-assoc-ty.rs src/test/ui/did_you_mean/issue-40006.rs src/test/ui/did_you_mean/issue-46718-struct-pattern-dotdotdot.rs src/test/ui/bad/bad-crate-name.rs src/test/ui/trait-alias-fail.rs src/test/ui/obsolete-syntax-impl-for-dotdot.rs src/test/ui/feature-gates/feature-gate-extern_prelude.rs src/test/ui/syntax-trait-polarity.rs src/test/run-pass/macros/auxiliary/macro-include-items-expr.rs src/test/run-pass/macros/auxiliary/macro-comma-support.rs src/test/run-pass/shebang.rs src/tools/clang/test/Sema/renderscript.rs src/tools/rustfmt/tests/coverage/target/comments.rs ```AFAICT, none of those should parse, but I haven't double-checked against
libsyntax
yet. Also, it might be useful to do the reverse (to make sure we don't parse more thanrustc
).In order to get as much as possible to parse, I did include unstable syntax. However, we might be able to separate it out, or even just comment it out for now.
There are 3 blacklisted files, because they're unfeasibly large to parse currently:
libcore/unicode/tables.rs
issues/issue-29466.rs
issues/issue-29227.rs
I might end up using something similar to them to benchmark improvements to
lykenware/gll
.