ollef / Earley

Parsing all context-free grammars using Earley's algorithm in Haskell.
BSD 3-Clause "New" or "Revised" License
361 stars 24 forks source link

Extensible languages? #28

Closed liamoc closed 7 years ago

liamoc commented 8 years ago

Hi,

I know this might be impossible with this library directly, but I was wondering how to allow the user to declare their own mixfix identifiers in the document being parsed. In parsec, or something, because the parsers are monadic I can simply keep updating my table as I parse more information, but as this library is strictly only for context-free languages, it makes this approach not feasible.

I guess I'd have to partially parse the file and grab all the declarations first, and then parse the document properly with earley? Is there an easier way?

sboosali commented 8 years ago

I feel like it would have to be staged (like you say), but you can parse both with Earley.

ollef commented 8 years ago

Hey Liam!

This is a good question!

Spiros is right. I'll add my experience as well:

Staging seems to be the norm in the languages that I've looked at (GHC has a separate pass to fix-up the operator precedence after parsing, Agda uses parallel parsing combinators to parse mixfix identifiers after having done a "rough parse" with Happy).

Basically you can have a first parsing pass that parses all expressions as e.g. lists of identifiers, but otherwise gives you the rough structure of the file (i.e. grab all declarations as you say), and then when all identifiers are known, do the mixfix parsing.

liamoc commented 8 years ago

Thanks for your replies.

Not related to the original question, but I wonder how I could accomplish alignment sensitive parsing with Earley? Presumably I'd have to preprocess somehow. My tricks in parsec here also relied on monadic parsing.

liamoc commented 8 years ago

A rough idea would be,

Run a lexer first that annotates each token with its x position, determine each level of alignment used in the program in an initial pass and generate a unique production for each one in the grammar monad. But that sounds very painful.

ollef commented 8 years ago

Yeah, I think preprocessing could work. One idea that I've seen before that doesn't sound too painful is to insert generated indent and unindent tokens, basically inserting the explicit braces that the indentation "denotes".

If you're doing multi-pass parsing as discussed above, it might also make sense to do the first pass with a monadic parser like Parsec which could then handle indentation.

phadej commented 7 years ago

@ollef check https://michaeldadams.org/papers/layout_parsing/LayoutParsing.pdf and https://pdfs.semanticscholar.org/cd8e/5faaa60dfa946dd8a79a5917fe52b4bd0346.pdf

My gut feeling is that it won't be too hard to add that to Earley, but it might turn out to be quite intrusive.

ollef commented 7 years ago

Looks interesting, thanks! I probably won't be adding anything like that in the foreseeable future, but contributions would be welcome. :)