ollef / Earley

Parsing all context-free grammars using Earley's algorithm in Haskell.
BSD 3-Clause "New" or "Revised" License
363 stars 24 forks source link

Use Data.Loc rather than Int #10

Closed mitchellwrosen closed 9 years ago

mitchellwrosen commented 9 years ago

Would it be possible to tag parsed items with a Data.Loc.L rather than an Int? Having both the column and row number is very useful.

More specifically,

allParses :: (forall s. ST s (Result s e i a)) -> ([(a, Int)], Report e i)

would become

allParses :: (forall s. ST s (Result s e i a)) -> ([L a], Report e i)

Currently, I'm lexing with https://hackage.haskell.org/package/lexer-applicative, which returns a stream of L-tagged lexemes anyway, and then I'm awkwardly unpacking and re-packing the locations into the parsed items. Here's a small example:

-- Token
data Token 
    = TkInt Int
    | ...

-- Lexer
lexer :: String -> [L Token]

-- Syntax
data Expr 
    = EInt Int
    ...

-- Parser
expr :: Prod r String (L Token) (L Expr)
expr = (\L loc n -> L loc (EInt n)) <$> intLit
   <|> ...

intLit :: Prod r String (L Token) (L Int)
intLit = (\L loc (TkInt n) -> L loc n) <$> satisfy isIntLit <?> "int"
  where
    isIntLit (L _ (TkInt _)) = True
    isIntLit _ = False

(apologies if the above doesn't typecheck, I just wrote it by hand)

If Earley annotated parsed items with L then I could simply forget all the work that the lexer did by mapping unLoc over the stream of tokens, so I'd have productions like Prod r String Token Expr rather than Prod r String (L Token) (L Expr). I wonder if there's a better way to integrate these two libraries.

ollef commented 9 years ago

Thanks for your input. That would indeed be useful. It would be much easier to create nice (e.g. clang-style) error messages if that information was available.

The Earley library doesn't know enough about the input token type to tell where new lines start and end, so it's not able to construct a Loc on its own as it is. For that to work it would have to always work on Char or L tokentype (which is what you're doing). Though both seem useful, it is needlessly restrictive to commit to using only one of those.

Can we can find a way of doing it that does not sacrifice generality?

For what it's worth, your current approach seems good. It allows you to put location information in subexpressions, which you couldn't do with the proposed change to the library. You may also be able to get the information that you want by looking at the location of the first token in the unconsumed field of the Report (though the report is about how far it was able to parse before failing and not where individual parses were obtained).

mitchellwrosen commented 9 years ago

I realized this morning that my original proposal makes no sense, as you point out (the parser has no source information). I think all I need are some simple combinators, which might not even belong in Earley itself:

locSymbol :: Eq t => t -> Prod r e (L t) (L t)
locSymbol = symbol . L NoLoc

locSatisfy :: (t -> Bool) -> Prod r e (L t) (L t)
locSatisfy p = satisfy (p . unLoc)

This works because the Eq instance of L ignores its Loc

ollef commented 9 years ago

That looks like a nice approach. I'll close this issue now, but if you do come up with a better way to do it, we can revisit it.