Using parsers with monad transformers

neongreen commented 9 years ago

(See the previous discussion here.)

ParsecT lets us combine parsing with things such as keeping state, accessing configuration, etc. However, any monad you embed into ParsecT won't be affected by Parsec's backtracking properties:

{-# LANGUAGE FlexibleContexts #-}

import Control.Monad.State
import Text.Parsec hiding (State)

trying s = modify (++ ["trying " ++ s])

test :: ParsecT String () (State [String]) Char
test = (trying "'a'" >> char 'a')
   <|> (trying "'b'" >> char 'b')

main = print $ flip execState [] $ runParserT test () "" "b"
-- prints:
--   trying 'a'
--   trying 'b'

In this example having non-backtracking state was desirable. Sometimes having backtracking state is desirable – for instance, my usecase is parsing a list of rules, where each rule is merely a named list of patterns, and patterns can reference earlier-defined patterns (which is why all parsed patterns are added to state); if a single pattern in a rule can't be parsed I want to throw the rule away completely and forget that I ever parsed it, which means forgetting about all successfully-parsed patterns in the rule. (Sorry if the example is unclear or seems contrived.)

One way to have backtracking state is to use Parsec's built-in “user state”. Unfortunately, this solution isn't extensible, as Parsec provides no built-in writer, reader, or any other useful monad you could think of.

Another way is to use

StateT MyStateType Parser a

which, if you aren't familiar with StateT, is another way of saying

MyStateType -> Parser (a, MyStateType)

which hopefully makes it clearer why this variant would preserve backtracking; if it doesn't, look at the definition of <|> for StateT:

m <|> n = StateT $ \ s -> runStateT m s <|> runStateT n s

Here both actions are applied to the same state, so it's impossible that both parsers' modifications would end up in the final state.

At the moment this solution isn't viable either, because all Parsec's parsers work in ParsecT and can't be used in StateT. So, we have to use lift:

{-# LANGUAGE FlexibleContexts #-}

import Control.Applicative
import Control.Monad.State
import Text.Parsec hiding (State, (<|>))
import Text.Parsec.String

trying s = modify (++ ["trying " ++ s])

test :: StateT [String] Parser Char
test = (trying "'a'" >> lift (char 'a'))
   <|> (trying "'b'" >> lift (char 'b'))

main = do
  let Right res = parse (flip execStateT [] test) "" "b"
  mapM_ putStrLn res
-- prints:
--   trying 'b'

However, we can't just define char = lift . primitiveCharParser because that would make char unusable in stacks of 2 and more transformers. So, instead we might have to create a new mtl-style class (perhaps MonadParsec?) instead. A similar class already exists in parsers (see Parsing, CharParsing); however, we likely won't be able to reuse parsers because parsers depends on parsec/attoparsec утв because parsers has different naming conventions etc:

I'm not against reinventing the wheel if it's easier then buying one and gives better results. For example Megaparsec have different naming conventions to parse various categories of characters, etc. Not sure we could reuse this “as is”.

— mrkkrp

mrkkrp commented 9 years ago

@neongreen, You have proposed many useful things, without these propositions Megaparsec would be different. I consider it significant contribution to the project. I would like to include you into the list of contributors, should I include you as "Artyom"?

neongreen commented 9 years ago

@mrkkrp, yep (and thanks). Just “Artyom” would be ambiguous, but since we're on Github anyway, you could just add a link to my Github nick:

Artyom (@neongreen)

mrkkrp commented 9 years ago

@neongreen, Can you point out where (<|>) is defined for StateT as:

m <|> n = StateT $ \ s -> runStateT m s <|> runStateT n s

?

neongreen commented 9 years ago

@mrkkrp, here <|> is defined as mplus, and here mplus is defined as follows:

m `mplus` n = StateT $ \ s -> runStateT m s `mplus` runStateT n s

So, yeah, the original definition uses mplus instead of <|>, but I didn't want to introduce extra baggage when pasting, so I rewrote it with <|>.

mrkkrp commented 9 years ago

@neongreen, I see. I will try to define mtl-style MonadParsec (although in Megaparsec we have more low-level functions than in Parsec (parsers is modeled after Parsec I guess) because I decided that we could define some of them clearer on lower level).

I'll try to eliminate user state, so user will need to combine monads to get backtracking state. This should work. Changes will live in monad-transformer branch until it's clear that everything is all right.

mrkkrp commented 9 years ago

I will also make combinators more general, many of them can work with Applicative and Alternative instances, not necessarily with MonadParsec instances.

mrkkrp commented 9 years ago

I had to make other minor changes to make it work smoothly. Also, it seems we could put functions to get current state (parser state, position in input stream, etc) into the class too. This way most part of the library will be defined for any instance of MonadParsec. Tests will need to be corrected too. This requires more work then it seemed.

mrkkrp commented 9 years ago

OK, this is almost done. I want to re-work some tests for Text.Megaparsec.Prim though. I think it's good idea to add tests that check backtracking abilities of StateT MyStateType Parser a monad combination. Once it's done I'll organize and push my changes.

@neongreen, I'll ask you to try the improvements and see if Megaparsec is flexible enough now.

mrkkrp commented 9 years ago

@neongreen, please try out code in monad-transformer branch. I also ask you to write one or two tests that you deem realistic and practical to test backtracking state achieved by the method you mention. I suggest you first test this locally and tell me about results, then I merge this into master. Then you can open a pull request adding the tests.

neongreen commented 9 years ago

Okay, here go some observations.

<?> should be moved out of the class, I think. Alternatively, there should be a MINIMAL pragma saying that only one of {label, <?>} is needed. Justification: I got bit by this when I was defining an instance of MonadParsec, forgot about both label and <?>, GHC didn't warn me (because they both have default definitions), and as the result I spent 10m trying to find out why my parser was going in an endless loop.

We need instances for MonadParsec s (WriterT w m) c and so on. Here's a sample instance for WriterT:

instance (Monoid w, MonadParsec s m c) => MonadParsec s (WriterT w m) c where
 unexpected x        = lift $ unexpected x
 eof                 = lift $ eof
 token x y           = lift $ token x y
 tokens x y z        = lift $ tokens x y z
 getParserState      = lift $ getParserState
 updateParserState f = lift $ updateParserState f
 try           (WriterT m) = WriterT $ try m
 label x       (WriterT m) = WriterT $ label x m
 lookAhead     (WriterT m) = WriterT $ lookAhead m
 notFollowedBy (WriterT m) = WriterT $
   notFollowedBy (fst <$> m) >>= \x -> return (x, mempty)

The same instance in parsers has {-# INLINE ... #-} for all parsers, so maybe we should inline all the things too. I'll try to benchmark it.

It's more annoying now to write generic parsers. Before:
```
whitespace :: Parser ()
whitespace = void (some (char ' '))
```
After:
```
whitespace :: MonadParsec s m Char => m ()
whitespace = void (some (char ' '))
```
Small as the difference may be, it might push people towards reusing less code than they used to, and it's bad. I'd like both Text.Megaparsec.String and Text.Megaparsec.Text to export a Parsing that would be respectively MonadParsec String m Char and MonadParsec Text m Char under the hood (after all, they already export different Parsers that are named the same). This would make the code look like this:
```
whitespace :: Parsing m => m ()
whitespace = ...

inParens :: Parsing m => m a -> m a
inParens = ...
```

neongreen commented 9 years ago

Oh, and another observation: I had to write

warnParse :: WarnParser a -> SourceName -> Text -> Either ParseError (a, [String])
warnParse p src s = parse (runWriterT p) src s

and it feels like it's something that should've been done automatically, but I have no idea how to actually accomplish this.

mrkkrp commented 9 years ago

I agree with points 1 and 2. As for 3, I think end user can define Parsing type or something like that if he thinks it's necessary. Mostly I don't see any problem in MonadParsec s m Char => m () stuff.

I currently don't think it should be done automatically. You're trying to run code inside monad and usually if you have stack of monads you need to nest functions like parse and runWriterT accordingly. Some functions like runStateT will need to accept additional arguments like initial state, so I don't think all these cases can be handled in uniform (non-hackish) way.

neongreen commented 9 years ago

Some functions like runStateT will need to accept additional arguments like initial state, so I don't think all these cases can be handled in uniform (non-hackish) way.

Okay, you're right.

As for 3, I think end user can define Parsing type or something like that if he thinks it's necessary.

Sure, but at least in my case I'm probably going to be defining Parsing literally every time I use Megaparsec with a custom stack, and unlike parse it can be done generically in a non-hackish way. I guess GenParser was included into original parsec for the same reason. (By the way, am I right that GenParser shouldn't be in the monad-transformer branch at all?)

mrkkrp commented 9 years ago

@neongreen, OK, I'll add Parsing and GenParser is indeed should disappear, just overlooked that.

mrkkrp commented 9 years ago

@neongreen, To write something like this:

type Parsing m = MonadParsec C.ByteString m Char

You need to enable ConstraintKinds extension. Because it's synonym for a constraint, not type. This once more makes me think that this is not entirely necessary.

I think we will do it this way: for now Parsing won't be added. Then in future releases this might be reconsidered. After all it saves not much typing.

neongreen commented 9 years ago

Okay.

mrkkrp commented 9 years ago

@neongreen, There will be two instances for WriterT: Lazy.WriterT and Strict.WriterT, by the way.

mrkkrp commented 9 years ago

@neongreen, about inlining: honestly, I think GHC will inline all this stuff anyway. So, you probably won't be able to find any performance difference.

mrkkrp commented 9 years ago

@neongreen, I've committed all the changes, so you can try again. Is there anything else you want to tell me?

neongreen commented 9 years ago

I've committed all the changes, so you can try again.

Okay, thanks. I'll start writing tests now.

neongreen commented 9 years ago

@mrkkrp, I have just realised that Parsec's lookAhead and Megaparsec's lookAhead have different semantics – Parsec's lookAhead reverts changes made by the parser to backtracking state, Megaparsec's lookAhead doesn't. It won't matter in most cases when lookAhead is used; however, it might be useful to have a version of lookAhead that adheres to the intuitive definition of “run the parser but don't let it change anything whatsoever”. Would it be easy to write? (I don't know Parsec's internals well enough to answer this question on my own. It may even be impossible, I don't know.)

mrkkrp commented 9 years ago

@neongreen, parsers seems to be brainy about the stuff. How does lookAhead work there?

neongreen commented 9 years ago

@mrkkrp, I already checked – it isn't being brainy in this particular case, unfortunately.

import Text.Parser.LookAhead
import Text.Parser.Combinators
import Text.Parsec.String
import Text.Parsec (parse, runParser, Parsec, setState, getState)
import qualified Control.Monad.State as State

tParsers = parse (State.evalStateT p 0) "" ""
  where p :: State.StateT Integer Parser Integer
        p = do
          State.put 0
          _ <- lookAhead (State.put 1 >> eof)
          State.get

tParsec = runParser p 0 "" ""
  where p = do
          setState 0
          _ <- lookAhead (setState 1 >> eof)
          getState

> tParsers
Right 1

> tParsec
Right 0

mrkkrp commented 9 years ago

@neongreen, ah I see they just use Parsec's version without reimplementing it.

instance (Parsec.Stream s m t, Show t) => LookAheadParsing (Parsec.ParsecT s u m) where
  lookAhead = Parsec.lookAhead

mrkkrp commented 9 years ago

@neongreen, I think Megaparec should revert backtracking state, you're right. I'll see how to fix it. Make sure you include something in your tests to check this feature in future.

mrkkrp commented 9 years ago

@neongreen, Have you tried with simple alternatives à la p <|> n? Does it work as expected?

neongreen commented 9 years ago

@mrkkrp, yep, it works alright. E.g. the modified prop_user_backtrack test (that already was there) passes:

prop_user_backtrack :: Integer -> Integer -> Property
prop_user_backtrack n m = runParser (State.evalStateT p 0) "" "" === Right n
  where p = do
          State.put n
          (State.put m >> fail "failed") <|> return ()
          State.get

mrkkrp commented 9 years ago

OK, first of all, Megaparsec's and Parsec's lookAhead are essentially identical:

-- Megaparsec

pLookAhead :: ParsecT s m a -> ParsecT s m a
pLookAhead p = ParsecT $ \s _ cerr eok eerr ->
  let eok' a _ _ = eok a s mempty
  in unParser p s eok' cerr eok' eerr

-- Parsec

lookAhead :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m a
lookAhead p =
    ParsecT $ \s _ cerr eok eerr -> do
        let eok' a _ _ = eok a s (newErrorUnknown (statePos s))
        unParser p s eok' cerr eok' eerr

Another thing is Megaparsec currently has no built-in user state. But this should not matter. As your little experiment demonstrates, while Parsec's lookAhead generally works OK, that version with StateT transformer doesn't work so great. So I think the problem is in definition of lookAhead for StateT.

mrkkrp commented 9 years ago

@neongreen, This means that all these definitions of WriterT, StateT, etc. in parsers are not necessarily correct with respect this sort of corner case. We should perhaps correct them in Megaparsec instead of copying them.

Maybe we should also open an issue on GitHub page of parsers library.

mrkkrp commented 9 years ago

@neongreen, We also should tests all these definitions too, I think. Don't worry about that, I will fix it and later add new tests myself.

mrkkrp commented 9 years ago

@neongreen, Try with this definition (with TupleSections extension):

instance (MonadPlus m, MonadParsec s m t) =>
         MonadParsec s (L.StateT e m) t where
  label n       (L.StateT m) = L.StateT $ \s -> label n (m s)
  try           (L.StateT m) = L.StateT $ try . m
  lookAhead     (L.StateT m) = L.StateT $ \s ->
    (,s) . fst <$> lookAhead (m s)
  notFollowedBy (L.StateT m) = L.StateT $ \s ->
    notFollowedBy (fst <$> m s) >> return ((),s)
  unexpected                 = lift . unexpected
  eof                        = lift eof
  token  f e                 = lift $ token  f e
  tokens f e ts              = lift $ tokens f e ts
  getParserState             = lift getParserState
  updateParserState f        = lift $ updateParserState f

If this passes the test I will commit necessary corrections.

abooij commented 9 years ago

I don't have anything to add, but it looks like you two are doing pretty good work - thanks!

mrkkrp commented 9 years ago

@neongreen, I checked it and it seems to work OK. I've committed the changes so this bug should be eliminated by now.

neongreen commented 9 years ago

@mrkkrp, I have the following (passing) tests that use lookAhead, notFollowedBy, and <|>:

-- Monad transformers + Parser

prop_StateT :: Integer -> Integer -> Property
prop_StateT n m = checkParser (State.evalStateT p 0) (Right (n+m)) ""
  where p = do
          State.put n
          State.modify (+ m)
          State.get

prop_StateT_backtrack :: Integer -> Integer -> Property
prop_StateT_backtrack n m = checkParser (State.evalStateT p 0) (Right n) ""
  where p = do
          State.put n
          (State.put m >> fail "failed") <|> return ()
          State.get

-- See <https://github.com/mrkkrp/megaparsec/issues/27#issuecomment-141785141>.
prop_StateT_lookAhead :: Integer -> Integer -> Property
prop_StateT_lookAhead n m = checkParser (State.evalStateT p 0) (Right n) ""
  where p = do
          State.put n
          lookAhead (State.put m >> eof)
          State.get

prop_WriterT :: String -> String -> Property
prop_WriterT pre post = checkParser (Writer.runWriterT p) result "abx:"
  where logged_letter = do
          x <- letterChar
          Writer.tell [x]
          return x
        logged_colon = do
          x <- char ':'
          Writer.tell [x]
          return x
        p = do
          Writer.tell pre
          cs <- Writer.censor (map toUpper) $
                  many (try (logged_letter <* notFollowedBy logged_colon))
          Writer.tell post
          _ <- logged_letter
          _ <- logged_colon
          return cs
        result = Right ("ab", pre ++ "AB" ++ post ++ "x:")

Not claiming the last one is particularly realistic :)

I have a question: how should label be tested, if at all? I.e. can Megaparsec's new error reporting system somehow interfere with monad transformers?

mrkkrp commented 9 years ago

@neongreen, There are two cases when label has effect:

When parser p succeeds without consuming input and next parser n fails immediately after p without consuming input. “Label” of p will be displayed in the error message.
When labeled parser fails without consuming input.

See existing prop_label for example. Although it's a bit dense in that it tests many behaviors at once, so you should not necessarily take it as base for new tests.

Also, prop_StateT has nothing to do with MonadParsec, it tests features of StateT, which we can safely assume work OK. Two other tests for StateT are OK. Please add one more to test notFollowedBy, I think it's easy. Other functions all should be tested too, preferably…

I advise you enable coverage statistics and see what else you can test. It's a good idea to test everything, although it's more than I can ask from you. Try at least test all functions in instance declaration of StateT. Also, you can have two copies of the same tests, one for lazy version of state monad and another one for strict version.

Coverage dropped because of these new changes and I think we should restore it at least on 80 % level. Initially I naively thought that these “boilerplate” definitions are not necessary to test, but after this issue with lookAhead I think everything should be tested.

mrkkrp commented 9 years ago

I suppose this may be closed now.

mrkkrp / megaparsec

Using parsers with monad transformers #27