stan-dev / stanc3

The Stan transpiler (from Stan to C++ and beyond).
BSD 3-Clause "New" or "Revised" License
143 stars 47 forks source link

Simplify parser #614

Open nhuurre opened 4 years ago

nhuurre commented 4 years ago

Simplify parser

The parser distinguishes between var_decl, top_var_decl, and top_var_decl_no_assign even though the AST does not. There's no logical need for the grammar to have three separate rules; semantic check could handle the potential syntax errors. For instance, Stan does not allow tilde-statements outside of the model block but there's no special statement_no_tilde rule in the grammar for other blocks. So this fails to parse

parameters {
  real x;
  x ~ std_normal(); // only decls allowed
}

but this parses successfully and fails in semantic check

parameters {
  real x;
}
generated quantities {
  x ~ std_normal(); // any statement parses
}

Parser errors are defined in parser.messages. That file is not easy to work with. Currently it is incomplete because I didn't know I was supposed add some error states there in PR #573 The parser error messages have no context other than the parser state that failed. We decided to fix #469 by changing the error message. The message relevant is here: https://github.com/stan-dev/stanc3/blob/458a7933728debdb8b3ce14e5af6960faef95513/src/frontend/parser.messages#L1392-L1402 The parser state 661 corresponds to any unexpected symbol in data or parameters block. As far as I know the only way to give a specific error for a semicolon is by creating a new parser state with the UNREACHABLE hack. On the other hand, if this were handled in Semantic_check the change would be trivial.

So I'm proposing changing the parser to accept any program that AST can represent and moving all other checks to Semantic_check.

@rybern what do you think?

rybern commented 4 years ago

Yeah, interesting question.

One thing I should mention is that I did some refactoring of those declaration rules in #561. Those changes should substantially reduce their redundancy, and certainly we could do more simplifying/clarifying.

I agree with all of your points about the shortcomings of the parser setup, especially error messages. (Side note: we should really add a check to Jenkins for completeness of parser.messages).

Ultimately we should put the logic wherever its easiest to read, to update, and to make it obviously correct (plus efficient when that's a concern). To those ends, here are some (all very similar) design principals that I like:

So, in general, I prefer to put as much logic as possible into (declarative) parsing+type constraints and as little as possible in validation. That isn't to say that parser.mly is necessarily the right place to put that logic, the real point is that the constraints get encoded into the data structures we build. You could even argue that Semantic_check is a parsing step because it's building up a typed program from an untyped one, but it's checking many more things than are encoded in the output.

Coming back to the specific points you raised:

As far as I know the only way to give a specific error for a semicolon is by creating a new parser state with the UNREACHABLE hack. On the other hand, if this were handled in Semantic_check the change would be trivial.

To make sure I understand, you mean that if we parsed general statements in the parameters block, it would be easy to throw a meaningful error in semantic check when we find a non-declaration statement? That's fair.

So I'm proposing changing the parser to accept any program that AST can represent and moving all other checks to Semantic_check.

I agree that it's reasonable that Menhir should work on the same level as the AST can encode.

However, I think that if we move logic into OCaml (like Semantic_check), we should make the types more expressive so that that logic doesn't just become validation. That might even mean another intermediate representation before the AST, and possibly improving the type of the AST. Short of that refactor, I think I would prefer to restrict what the AST can encode rather than move logic from Menhir into Semantic_check.

A side point to consider is that moving too much logic out of Menhir might mean we have to reimplement some of its error handling capability, which might get cumbersome again.