Open j-mie6 opened 3 years ago
After some thought, I think the cleanest way of going about this is to split up into two modules: Lexer
and Parser
. Then, the lexing module contains any primitive token combinators, as well as exposing token
and fully
combinators, properly abstracting whitespace away from the main parser (which need not be aware of it).
The idea is that token
atomically tries to parse something, and handles whitespace immediately after. On the other hand, fully
makes sure to consume the very initial whitespace and ensure the end of file is reached at the very end.
When it comes to keywords, it's best not to use token . string
, since this doesn't ensure that the keyword we are trying to parse is not the prefix of another, longer, token (i.e. an identifier). Instead, a bespoke keyword
combinator can also be exposed by the Lexer
API.
Although it's often not specified by the grammar, parsers should be careful to consider how whitespace is consumed and different tokens separated lexically. This can get really messy though, and in the worst case a parser writer might be tempted to insert whitespace handling logic all over the parser!
Most of the time, parsers handle this by separating lexing and parsing into two separate stages. The lexer handles whitespace and distinguishing tokens, and then they parser finds structure in the token stream. With parser combinators, however, this isn't always the case. Often, we want to incorporate the two stages together into a single parser. This allows token selection to be dependent on what part of the grammar is being explored: this is really helpful for distinguishing
SUBTRACT
,NEGATE
, and negative integer literals, for instance!Surely there has to be a way of leveraging the reusable higher-order combinators to abstract lexical parts of the parser nicely and uniformly?