j-mie6 / design-patterns-for-parser-combinators

A Really Cool Calculator written in Haskell... or is it?
BSD 3-Clause "New" or "Revised" License
41 stars 1 forks source link

In-Place Lexing #5

Open j-mie6 opened 3 years ago

j-mie6 commented 3 years ago

Although it's often not specified by the grammar, parsers should be careful to consider how whitespace is consumed and different tokens separated lexically. This can get really messy though, and in the worst case a parser writer might be tempted to insert whitespace handling logic all over the parser!

Dealing with tokens while parsing is intrusive to the overall structure of the parser and introduces clutter.

Most of the time, parsers handle this by separating lexing and parsing into two separate stages. The lexer handles whitespace and distinguishing tokens, and then they parser finds structure in the token stream. With parser combinators, however, this isn't always the case. Often, we want to incorporate the two stages together into a single parser. This allows token selection to be dependent on what part of the grammar is being explored: this is really helpful for distinguishing SUBTRACT, NEGATE, and negative integer literals, for instance!

Surely there has to be a way of leveraging the reusable higher-order combinators to abstract lexical parts of the parser nicely and uniformly?

j-mie6 commented 3 years ago

After some thought, I think the cleanest way of going about this is to split up into two modules: Lexer and Parser. Then, the lexing module contains any primitive token combinators, as well as exposing token and fully combinators, properly abstracting whitespace away from the main parser (which need not be aware of it).

The idea is that token atomically tries to parse something, and handles whitespace immediately after. On the other hand, fully makes sure to consume the very initial whitespace and ensure the end of file is reached at the very end.

j-mie6 commented 3 years ago

When it comes to keywords, it's best not to use token . string, since this doesn't ensure that the keyword we are trying to parse is not the prefix of another, longer, token (i.e. an identifier). Instead, a bespoke keyword combinator can also be exposed by the Lexer API.