mrkkrp / megaparsec

Industrial-strength monadic parser combinator library
Other
919 stars 87 forks source link

Greedy combinators #513

Open bristermitten opened 1 year ago

bristermitten commented 1 year ago

This may be an XY problem, but I'm having an issue with sepEndBy not being greedy enough, and is hiding errors.

For context, I'm parsing a haskell-like language with the following syntax

module Main
import Prelude
let main = println "Hello world!"

The code to do this is pretty simple:

module' :: Parser (Module Frontend)
module' = dbg "module'" $ do
    header <- optional . try $ header
    let _name = maybe (ModuleName ("Main" :| [])) fst header
    skipNewlines
    imports <- import' `sepEndBy` many newline
    declarations <- declaration _name `sepEndBy` many newline

    pure $
        Module
            { _moduleName = _name
            , _moduleExposing = maybe ExposingAll snd header
            , _moduleImports = imports
            , _moduleDeclarations = declarations
            }

header :: Parser (ModuleName, Exposing MaybeQualified)
header = do
    -- module Name exposing (..)
    symbol "module"
    moduleName' <- lexeme Parse.moduleName
    exposing' <- exposing
    pure (moduleName', exposing')

Using module' <* eof as the "entrypoint" parser.

The problem is, if declaration fails for some reason, the error thrown doesn't propagate up. Instead, sepEndBy just returns an empty list, and the parser succeeds. This results in some very opaque error messages. For example, if I omitted the body for the let declaration: let main = I'd expect to see an error saying something akin to Unexpected end of input, expecting expression. Instead, the module' succeeds with an empty list of declarations, and the eof causes an unintuitive and vague error message:

unexpected 'l'
expecting "import", end of input, or newline

(I'm not actually sure why it says this and not something like Unexpected let ..., expecting eof, but that's not the main problem here)

I have dbg'd and declaration throws the expected error, the problem is with sepEndBy.

My question is, is there an alternative to sepEndBy that will "greedily" parse the declarations? I still want to allow empty modules so sepEndBy1 won't do, but if a let is seen it should keep parsing and fail if necessary, rather than backtracking and returning an empty list.

Thanks for any help!

bristermitten commented 1 year ago

I wrote my own combinator

sepEndBy' :: Parser a -> Parser sep -> Parser [a]
sepEndBy' p sep = do
    x <- try p
    xs <- many (try (sep >> p))
    pure (x : xs)

and it seems to work, but not sure if this is an optimal way

mrkkrp commented 1 year ago

Can you show declaration? I think the issue here is that when declaration fails it completely backtracks (I don't know why, do you wrap the whole thing with a try by chance?) What you want is to "commit" early enough to parsing a declaration, that is, you need your declaration parser to consume input so that it cannot backtrack anymore. This way sepEndBy won't be able to succeed in the way you are observing and you should get a more helpful and localized error message. Your custom combinator sepEndBy' forces at least one occurrence of p (in that it is similar to sepEndBy1) and so it appears to solve the problem for the first declaration in your input, however you may find that the second and later declarations still suffer from the same problem.

bristermitten commented 1 year ago

That explanation makes sense, thanks for your response. How could I make it commit?

declaration currently looks like this:

declaration :: ModuleName -> Parser (Declaration Frontend)
declaration = liftA2 (<|>) defDec letDec

defDec :: ModuleName -> Parser (Declaration Frontend)
defDec modName = do
  symbol "def"
  name <- NVarName <$> lexeme varName
  symbol ":"
  ty <- type'
  let annotation = TypeAnnotation name ty
  let declBody = ValueTypeDef (Just annotation)
  pure (Declaration modName name declBody)

letDec :: ModuleName -> Parser (Declaration Frontend)
letDec modName = do
  (name, patterns, e) <- L.nonIndented sc letRaw
  let value = Value e patterns Nothing
  pure (Declaration modName (NVarName name) value)

letRaw :: Parser (MaybeQualified VarName, [Frontend.Pattern], Frontend.Expr)
letRaw = do
    ((name, patterns), e) <- optionallyIndented letPreamble element
    pure (name, patterns, e)
  where
    letPreamble = do
        symbol "let"
        name <- lexeme varName
        patterns <- sepBy (lexeme pattern') sc
        symbol "="
        pure (name, patterns)

To allow the syntax def <name> : <type> and let <name> = <body> with some messy indentation handling ;) Let me know if you need to see any more code. Thanks!

mitchellwrosen commented 1 year ago

@knightzmc Check out the headed-megaparsec package for some inspiration. The technique used there, which I quite like and often copy around to various projects, is to have a parser either return a value as normal (with the usual alternative semantics) or return another parser (which effectively "commits" to the current branch, and returns the parser to run on the remainder of the input).

However, I do wonder whether another package is even necessary. Could megaparsec itself just expose a commit/cut combinator which throws away the failure continuations of a parser? (Or perhaps there's a different way of committing that's already present?)

bristermitten commented 1 year ago

@mitchellwrosen I had a look at that package, seems to solve my problem! But I agree that there should be an easier way of doing this which doesn't require external libraries (and a lot of boilerplate converting to and from Parsec and HeadedParsec everywhere)

mrkkrp commented 1 year ago

From what I see in your code it should work as it is, but it is still incomplete, so I cannot be sure. @knightzmc Can you provide a repository with complete source code and an example input? I could then give it a try and it would perhaps be clearer what is going on there.

BlueNebulaDev commented 1 year ago

I'm having a similar issue.

I'd like to parse either a list of key-value pairs (separated by :), or a single value. For instance:

Here is a very minimal, simplified grammar. It doesn't support white spaces and both keys and values can only be identifiers. But it's enough to show the issue.

type KeyValue = (String, String)
data Val = KeyValList [KeyValue] | Val String deriving Show

ident :: Parser String
ident = some letterChar

keyVal :: Parser KeyValue
keyVal = do
    k <- ident
    ":"
    v <- ident
    pure (k, v)

prog :: Parser Val
prog = try (KeyValList <$> keyVal `sepEndBy1` ",") <|> (Val <$> ident)

This thing handles correctly the two examples I showed above, and reports the correct error for invalid inputs like a:x,b:err or or err or.

However the error it reports for an input like a:x,b:y,c or a:x,b:y,err or,c:z is not what I would like to see:

Input: "a:x,b:y"
Output: KeyValList [("a","x"),("b","y")]
✅

Input: "x"
Output: Val "x"
✅

Input: "a:x,b:err or"
1:10:
  |
1 | a:x,b:err or
  |          ^
unexpected space
expecting ',', end of input, or letter
✅

Input: "err or"
1:4:
  |
1 | err or
  |    ^
unexpected space
expecting end of input or letter
✅

Input: "a:x,b:y,c"
1:2:
  |
1 | a:x,b:y,c
  |  ^
unexpected ':'
expecting end of input or letter
❌

Input: "a:x,b:y,err or,c:z"
1:2:
  |
1 | a:x,b:y,err or,c:z
  |  ^
unexpected ':'
expecting end of input or letter
❌

When the input is a:x,b:y,c or a:x,b:y,err or,c:z, I would like to see an error after the last , that parsed correctly.

Is there any way to get Megaparsec to report the error I wish to see?

mrkkrp commented 1 year ago

@BlueNebulaDev with your definition of keyVal it cannot backtrack, since once it consumes at least one letter of ident it is already committed. Therefore somewhere after that final , it indeed fails, like you expect, but because you also have try around (KeyValList <$> keyVal `sepEndBy1` ",") that whole part backtracks to the very beginning of input. Next, ident gets a chance to run and fails with the error that you observe. You need to remove try from your definition of prog and instead think when it makes sense for keyVal to commit. It looks like as soon as there is : we can be sure that it should be a key-value pair so:

keyVal :: Parser KeyValue
keyVal = do
    k <- try (ident <* ":")
    v <- ident
    pure (k, v)

Try this and it should behave as you want in all four cases.

Still waiting a complete example from @knightzmc which should be fixable without resorting to any third-party libraries, but I'm happy to be proven wrong.

BlueNebulaDev commented 1 year ago

Thanks! After following your suggestion, the errors my parser is reporting are much better. I'm not sure I understand why a:x,b:err or gave the expected error even without this change though. I'll try to play with dbg to understand exactly what happens.

mrkkrp commented 1 year ago

"a:x,b:err or" resulted in a better error message because the sepEndBy1 combinator did not match the separator "," after fully consuming the second key-value pair. This way sepEndBy1 as a whole succeeded and then you probably have eof somewhere, which is actually what produces the error that you see.

bristermitten commented 1 year ago

Hey apologies for the delay. This code is pretty outdated now but if you're still happy to take a look I'd appreciate it!

https://github.com/ElaraLang/elara/blob/3de6a66f82a86a45726dc3b7aa1286bd7aaa6209/src/Elara/Parse/Declaration.hs