mrkkrp / megaparsec

Industrial-strength monadic parser combinator library
Other
919 stars 87 forks source link

(<|>) is not associative #412

Open gelisam opened 4 years ago

gelisam commented 4 years ago

Here is an example demonstrating that we get different error messages depending on how we parethesize a <|> b <|> c.

import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char.Lexer

type Parser = Parsec Void String

sym :: String -> Parser ()
sym s = do
  _ <- symbol (pure ()) s
  pure ()

-- consumes one character, then rolls back
a :: Parser ()
a = try (sym "a" >> empty) <?> "a"

-- fails immediately (on the given input)
b :: Parser ()
b = sym "b"

-- succeeds immediately
c :: Parser ()
c = pure ()

-- |
-- >>> parseTest (leftAssociative >> sym "d") "aaa"
-- 1:1:
--   |
-- 1 | aaa
--   | ^
-- unexpected 'a'
-- expecting 'd'
leftAssociative :: Parser ()
leftAssociative = (a <|> b) <|> c

-- |
-- >>> parseTest (rightAssociative >> sym "d") "aaa"
-- 1:1:
--   |
-- 1 | aaa
--   | ^
-- unexpected 'a'
-- expecting 'b' or 'd'
rightAssociative :: Parser ()
rightAssociative = a <|> (b <|> c)
gelisam commented 4 years ago

Personally I would prefer if the error message was expecting a, 'b', or 'd', because the user has three ways to try to fix the problem: they can fix the input so that a well-formed a begins at that character (which would require the user to understand why "aaa" is not a valid a even though it begins with an 'a'), so that a valid sym "b" begins at that character (which would require that character to be a 'b'), or so that a valid sym "d" begins at that character (which would require that character to be a 'd').

But since we wouldn't want e.g. a = try (sym "a" >> sym "z") to cause the error message to incorrectly suggest replacing the first 'a' with a 'z', I understand that it's probably easier not to include the suggestions which involve a try which has consumed characters. Maybe sym "a" sets a flag indicating that it has consumed a character, causing try to clear the suggestions when it rolls back, but try doesn't clear that flag, thus causing sym "b"'s suggestion to also get cleared?

Ailrun commented 4 years ago

The problem is caused by hint/error distinction and how error merge works.

(try (sym "a" >> empty) <|> sym "b") <|> pure ()
===> ((error [] at 1) <|> sym "b") <|> pure ()
===> ((error [] at 1) <|> error ["b"] at 0) <|> pure ()
===> (error [] at 1) <|> pure () -- the errors are merged (the second is ignored because of its offset)
===> hint [] -- since @pure ()@ succeeds, it tries to promote the error to a hint, and ignores it because of its offset.

try (sym "a" >> empty) <|> (sym "b" <|> pure ())
===> (error [] at 1) <|> (sym "b" <|> pure ())
===> (error [] at 1) <|> (error ["b"] at 0 <|> pure ())
===> (error [] at 1) <|> hint ["b"] -- since @pure ()@ succeeds, the second error is promoted to a hint
===> hint ["b"]
mrkkrp commented 4 years ago

This is a tricky one. Would be nice to have it fixed, but I'm not sure how.

gelisam commented 4 years ago

How about keeping track of both the error message at the latest-consumed position and at the rightmost-observed position? Here is a long explanation of why I think it's a good idea, a prototype, and a proof that my prototype's implementation of (<|>) is associative: https://gist.github.com/gelisam/bca49348986f119457837cdbfa746cd4

mrkkrp commented 4 years ago

@gelisam Thank you! I'll get to reading this soon.

mrkkrp commented 4 years ago

Sorry for the delay. I did not forget. Will try to look at this today.

gelisam commented 4 years ago

I'm not in a hurry, take your time!

mrkkrp commented 4 years ago

OK, I looked at this today. @gelisam I appreciate your write up and the propotype, but what we want here is one minimal change to the existing code base, a change that would not affect performance or anything else. I added your test to the test suite in the choice-associativity branch.

Looking at the two cases it seems to me that the "right associative" one works as it should. "Left associative" should do better. More precisely, the problem seems to be that it discards the error at offset 0 completely while it potentially still can be useful later (when the error is converted to hints). I think it could be preserved somehow "just in case" and then restored. But well, it is just an idea. A ParseError can have only one offset to begin with, so with current code base this path seems impossible. Let's face it too, it is nice to have simpler representation for parse errors. There doesn't seem to be another way to save that error at offset 0.

In the end, Parsec-like parser are stateful monsters. The current situation is the result of the compromises that we've made over time and in most cases the library works nicely. I'd welcome a PR which attempts to solve this in an non-intrusive way, but I myself will not allocate time for working on this issue because it is not clear how we could do better without a major re-write.

Lev135 commented 1 year ago

More precisely, the problem seems to be that it discards the error at offset 0 completely while it potentially still can be useful later (when the error is converted to hints).

Maybe we should convert errors with less offsets into hints (and store hints not only for ok branches, but also for error)? If I understood the problem correctly, it should be fixed:

(try (sym "a" >> empty) <|> sym "b") <|> pure ()
===> ((error [] at 1) <|> sym "b") <|> pure ()
===> ((error [] at 1) <|> error ["b"] at 0) <|> pure ()
===> (error [] at 1, hint ["b"]) <|> pure () -- the errors are merged (the second is converted to hint because of its offset)
===> hint ["b"]

try (sym "a" >> empty) <|> (sym "b" <|> pure ())
===> (error [] at 1) <|> (sym "b" <|> pure ())
===> (error [] at 1) <|> (error ["b"] at 0 <|> pure ())
===> (error [] at 1) <|> hint ["b"] -- since @pure ()@ succeeds, the second error is promoted to a hint
===> hint ["b"]