mrkkrp / megaparsec

Industrial-strength monadic parser combinator library
Other
919 stars 87 forks source link

Add non-eager variants for `sepBy` combinators #401

Closed johnchandlerburnham closed 4 years ago

johnchandlerburnham commented 4 years ago

In this stack overflow issue: https://stackoverflow.com/questions/60222934/avoid-parsing-last-separator-with-sepby

Suppose we have

foo :: Parser a -> Parser b -> Parser c -> Parser (Maybe c)
foo p sep q = sepBy p sep >> optional (sep >> q)

For foo (string "a") (string " ") (string "b"), we might expect "a a b" to parse, but it does not, since sepBy eagerly descends into another sep *> p iteration when it sees the space character.

This can be solved by adding a try to sepBy1 to make the noneager variants:

sepBy' :: Alternative m => m a -> m sep -> m [a]
sepBy' p sep = sepBy1' p sep <|> pure []

sepBy1' :: Alternative m => m a -> m sep -> m [a]
sepBy1' p sep = (:) <$> p <*> many (try $ sep *> p)

in Text.Megaparsec.Combinator. Ideally with inlining and docs explaining the eager behavior of ordinary sepBy. Happy to send a PR for this.

mrkkrp commented 4 years ago

Here is an example how to do it:

p :: Parser (String, String)
p = (,)
  <$> sepBy (char 'a') (try (char 's' <* notFollowedBy (char 'o')))
  <*> string "something"
mrkkrp commented 4 years ago

I believe this is called "non-greedy", not "non-eager".

johnchandlerburnham commented 4 years ago

I thought "greedy" and "eager" were synonyms, but I may be mistaken.

Using notFollowedBy is fine when you know the parser that follows, but I'm not sure it's general if you have:

p :: Parser a -> Parser (String, a)
p q = (,)
  <$> sepBy (char 'a') (try (char 's' <* notFollowedBy undefined))
  <*> q

My specific use case for non-greedy sepBy' was similar to this. I'm using Megaparsec to write the Haskell implementation of the Formality proof language, and ran into an issue on the case expression parser: https://github.com/moonad/Formality-Haskell/blob/353383df1202c7467f3842663fc6eacf3ee5c79f/src/Parser/Lang.hs#L311

case x
| true => 1
| false => 2
: Number

Essentially, I had to have my space consumer parser as the separator between different case variants (e.g. | true => 1), but not consume any space after the last variant if the type annotation is omitted. Would be infeasible to use notFollowedBy, since the next parser could be many many different things depending on context.

Anyway, I already solved my issue using the non-greedy sepBy1', but I thought I'd offer to upstream a PR here since it seemed like something other people have also been confused by. No problem whatsoever if that's not of interest, big fan of your library!

mrkkrp commented 4 years ago

It is not general, you need to tune it according to the specific grammar you want to parse.

The problem is that non-greedy versions require try and all the combinators like this come from the parser-combinators package which works with any Applicative / Alternative / Monad instances and doesn't know anything about try (which is often a good thing because the package can be used with other parsing libraries and it is easier to reason about combinators that cannot use try).

Your idea is not unreasonable, it's just there is an endless amount of combinators that we can have and I need to decide what to include and what not to include and where to draw the line.

johnchandlerburnham commented 4 years ago

Okay, not wanting to have the combinators depend on try makes sense to me. Thanks!

flip111 commented 3 years ago

Here is an example how to do it:

p :: Parser (String, String)
p = (,)
  <$> sepBy (char 'a') (try (char 's' <* notFollowedBy (char 'o')))
  <*> string "something"

Here notFollowedBy (char 'o') is needed to not match on the second letter of "something". In case you want to parse something else than "something", perhaps a choice <|> of things, than your notFollowedBy parser also becomes more complicated. Is there anything at all that can be done not having to figure out the notFollowedBy part? From a high-level view it doesn't seem impossible because regex can also figure it out itself how to do a non-greedy match.