Optimize prooftree parser

gleachkr commented 9 years ago

It appears to be causing lag in the editor. -O2 mostly fixes it, but it'd be better to have a sensible algorithm.

JakeEhrlich commented 9 years ago

So it is always best to work from the grammar on these things. Here is what I understand the grammar to be right now. I assume 'formula' is an already defined non-terminal. capital names denote non terminals. If you are unfamiliar with grammars then just read '->' as 'can have the form. The '*' is called a Kleene closure and is the same thing as parsec's 'many' combinator. Indentation isn't needed in this grammar it turns out. This stuff might be old news to you though.

rule -> formula justification NEWLINE
      | "Show" ":" formula NEWLINE rule* close justification NEWLINE
close -> ":" | "/" | "close"
justification -> ID int_list
int_list -> INT (",")? int_list
          | INT

In order to get to where this is helpful You need to have a sort of lexical phase. I'll post back with what I think about that in a bit.

JakeEhrlich commented 9 years ago

I'd like to stick to the above grammar as much as possible. It's really clean. But we want to be able to handle partial proofs which complicates things. Inserting things into the token stream is a good and tested way of taking car these kinds of parsing issues. There really isn't a good theory of context sensitive grammars (which is what we want here) so the trick is to use the most direct route to being context free as possible. Inserting missing information is a common trick to do this (see Go, Python, and I think even Haskell). That said this is really open to change where as I think we should stick pretty closely to the above grammar. The above is of course open to change as well but we should probably be a bit more hesitant in going that way.

So...

lexer should maintain state of stack of number of white space characters. It should be a parser that returns a list of tokens where a token is a special type that we define.

So if I understand the docs right our lexer type should be the following ParsecT s [Int] Identity [Token]

And our parser type (will vary) but will look like this ParsecT [Token] () Identity ProofTree

then we can write a monad combinator that composes these two kinds of parsers to get something of the following type. ParsecT s () Identity ProofTree

Also note that I leave the type of the stream for the lexer open. Byte strings are way faster in some cases so leaving it open to use byte strings might possibly be beneficial. I really doubt that is the issue here but leaving it open can't hurt.

I think the above grammar should be followed closely because even when implemented in the most naive possible simplest way it is still LL(2) and with one tiny optimization it is LL(1) which is optimal. So to account for partial proofs a dummy close and justification should be inserted when the tabs decrease.

The allowed tokens are as follows: all the tokens for a formula NEWLINE "Show" ":" "/" "close" ID INT ","

A good token type might be somthing like

data Token = 
                      TInt Integer --for all integers
                    | TId String --for all identifiers
                    | TSlash -- for "/"
                    | TColon -- for ":"
                    | TShow -- for "Show
                    | TComma -- for ","
                    | TClose --for "close"
                    | TOp String --for all operators
                    | TLParen --for '('
                    | TRParen --for ')'
                    | TNewline

If you just write a parser for each of these tokens and 'or' them together using <|> it should work. The only tricky case is NEWLINE which should look something like this:

parseNewline = do
    newline
    s <- many space
    stk <- getState
    let additions = const [TColon, TId "undefined", TNewline] >>= takeWhile (> (length s)) stk
    modifyState $ dropWhile (> (length s))
    return $ additions ++ [TNewline]

I just wrote that code without type checking or anything so it assuredly has some kind of error in it but it should convey the basic idea here. The basic idea is to add as many ": undefiend\n" as is needed to close out the above sub proofs. Also I'm only handling the case where spaces decrease here. There is also the case where the whitespace increases and the state has to be modified to account for that.

gleachkr commented 9 years ago

Commit ab2e4fc150828ff0632bfd7ded06d6ecd7505bae introduces a kind of lexing-parsing division of labor that should be more efficient and robust than the old parser. It'll recover in more cases where there's something malformed buried in the proof, and it'll let you be much more flexible in how you choose to indent things (just make sure that subproofs are more indented than the show line that introduces them).

It's not whitespace agnostic, but there have been some architectural changes that will make mix-and-matching with proof parsers and formula parsers easier, for when the whitespace agnostic parser is done.

gleachkr / Carnap-Old

Optimize prooftree parser #29