ollef / Earley

Parsing all context-free grammars using Earley's algorithm in Haskell.
BSD 3-Clause "New" or "Revised" License
361 stars 24 forks source link

WIP Disambiguation production #63

Open expipiplus1 opened 1 year ago

expipiplus1 commented 1 year ago

Definitely WIP at the moment,

The thrust of it is that the continuations are explicitly [a] -> [b] now, which for most productions is just a map, however for disambiguation, this structure is exposed.

Currently only nonterminals work, as, for example, the Pure case runs the continuation once for every individual element. What needs to change would be to maintain a list of a, and defer applying args. I had a quick look at jiggling things around towards that end without luck.

expipiplus1 commented 1 year ago

If this would be desirable even for just nonterminals, then I'd be happy to neaten things up from the state it's in at the moment.

expipiplus1 commented 1 year ago

Ok, I neatened it up a little and limited disambiguation to nonterminals.

Motivating example parsing an expression without precedence

-- λ> :main "A + B * C * G"
-- Ambiguous
-- ├╴ ((A+B)*(C*G))
-- ├╴ *
-- │  ├╴ Ambiguous
-- │  │  ├╴ ((A+B)*C)
-- │  │  └╴ (A+(B*C))
-- │  └╴ G
-- └╴ +
--    ├╴ A
--    └╴ Ambiguous
--       ├╴ (B*(C*G))
--       └╴ ((B*C)*G)

I'm sure there's still some neatening + testing to do!

expipiplus1 commented 1 year ago

As a further simplification to the implementation, I wonder if we could just make NonTerminal :: !(r e t a) -> !(Prod r e t ([a] -> [b])) -> Prod r e t b...

expipiplus1 commented 1 year ago

Ah, instead of making disambiguate operate in Grammar we can make the nonterminal 'live' during parse with something like this:

    Disamb p d -> do
      r <- mkRule p
      scont' <- newConts =<< newSTRef [Cont C.id d args scont]
      parse (State (NonTerminal r (pure id)) C.id pos scont' : ss) env

I don't think this should be a performance disadvantage compared to running in Grammar, however it feels bad putting pure id as the right prod for NonTerminal (I guess this would be solved by the previous comment). This also means that we end up calling the disamb function with an empty list some of the time which is also a little odd.

ollef commented 1 year ago

This is a neat idea. I had a little think about the API and implementation:

If we had

expose :: Prod r e t a -> Prod r e t [a]

and

conceal :: Prod r e t [a] -> Prod r e t a

(but perhaps with better chosen names), then it seems like we should be able to implement

disambiguate :: ([a] -> [b]) -> Prod r e t a -> Prod r e t b
disambiguate f = conceal . fmap f . expose

without it needing to be in grammar. This also seems to solve https://github.com/ollef/Earley/issues/62, because we can implement

constraint :: (a -> Bool) -> Prod r e t a -> Prod r e t a
constraint = disambiguate . filter

This is possible if we tweak the Prod type e.g. as follows (along the lines of one of your comments):

data Prod r e t a where
  -- Applicative.
  Terminal    :: !(t -> Maybe a) -> !(Prod r e t (a -> [b])) -> Prod r e t b
  NonTerminal :: !(r e t a) -> !(Prod r e t (a -> [b])) -> Prod r e t b
  Pure        :: [a] -> Prod r e t a
  -- Monoid/Alternative. We have to special-case 'many' (though it can be done
  -- with rules) to be able to satisfy the Alternative interface.
  Alts        :: ![Prod r e t a] -> !(Prod r e t (a -> [b])) -> Prod r e t b
  Many        :: !(Prod r e t a) -> !(Prod r e t ([a] -> [b])) -> Prod r e t b
  -- Error reporting.
  Named       :: !(Prod r e t a) -> e -> Prod r e t a

because then we get

expose :: Prod r e t a -> Prod r e t [a]
expose (Terminal b p)    = Terminal b (fmap (fmap pure) p)
expose (NonTerminal r p) = NonTerminal r (fmap (fmap pure) p)
expose (Pure x)          = Pure [x]
expose (Alts as p)       = Alts as (fmap (fmap pure) p)
expose (Many p q)        = Many p (fmap (fmap pure) q)
expose (Named p n)       = Named (expose p) n

conceal :: Prod r e t [a] -> Prod r e t a
conceal (Terminal b p)    = Terminal b $ fmap (fmap concat) p
conceal (NonTerminal r p) = NonTerminal r $ fmap (fmap concat) p
conceal (Pure as)         = Pure $ concat as
conceal (Alts as p)       = Alts as $ fmap (fmap concat) p
conceal (Many p q)        = Many p $ fmap (fmap concat) q
conceal (Named p n)       = Named (conceal p) n

but I have not tried updating the other modules (parser etc).

Maybe it breaks down somewhere.

expipiplus1 commented 1 year ago

I like expose and conceal a lot! Much nicer than smashing it all into [a] -> [b].

It makes me worry a little bit from a performance point of view to be passing lists around everywhere in Prod and I feel like there's basically no chance of any fusion taking place in parse.

Just as an aside, @ollef, did you ever experiment with using Data.Array.ST for anywhere lists are used in parse? In fact, I wonder if this would be just slightly better done with Seq (or even Tree) given all the concatting... Must benchmark after

expipiplus1 commented 1 year ago

mmm, I think that keeping this to nonterminals might be the pragmatic thing to do, with your scheme above it doesn't seem possible to push Pure fs down into the leaves of Alts for example.

Attaching to nonterminals only seems important wrt parse as written, as they're the place where new continuations are made which act as "join" points in the continuation chain.

expipiplus1 commented 1 year ago

I haven't thought it through, but I wonder if it would be possible to push disambiguations all the way down to Alts, and have it be Alts :: ![Prod r e t a] -> !(Prod r e t ([a] -> b)) -> Prod r e t b

expipiplus1 commented 1 year ago

hmm, I've got a testcase where:

This works, and d is called with a list of length 2:

  xx <-
    disambiguate d $
      (Foo <$ anyToken)
        <|> (Bar <$ anyToken)

This doesn't work, d is called twice with a list of length 1 and the whole parse returns 2 results:

  xx <-
    disambiguate d $
      (Foo <$ many anyToken)
        <|> (Bar <$ anyToken)

Sadly I've long since flushed the context needed to easily understand what's going on :(

...

Oh, I still see several todos in the code, I guess I'm just hitting one of these cases past-me foresaw.

...

On a hunch I removed the first case in simplifyCont and this seems to sort things for at least this example, don't quite understand why though yet.