ollef / Earley

Parsing all context-free grammars using Earley's algorithm in Haskell.
BSD 3-Clause "New" or "Revised" License
363 stars 24 forks source link

document implementation #12

Open sboosali opened 9 years ago

sboosali commented 9 years ago

I want to contribute a few ideas I have for new features. but after reading the source, I still can't understand how parse works.

in http://www.nltk.org/book/ch08-extras.html (2 Chart Parsing and 2.7 The Earley Algorithm), the concepts seem to be:

  1. edges between tokens, complete or non-complete
  2. rules, which add new edges given the current edges
  3. a chart, a structure that holds edges
  4. strategies, this set of rules that differentiate chart parsers from each other

but I don't see these in the types (State/Cont/Conts/Result). if they're not really there, how should I think about the types? is there a blog post or paper I could read that more closely reflects the core ideas behind this package?

also, it helps to know about:

  1. any patterns this package uses, or tricks. e.g. why the nested reference in Conts:
     contsArgs :: !(STRef s (Maybe (STRef s (Results s a))))

is it for performance, for correctness? Everything is there for a reason, but it's hard to read for a noob like me.

thanks :-)

ollef commented 9 years ago

I'm aware that the internals aren't exactly easy to follow, and I've been meaning to write something up about it. This issue just might be the push I need to actually do it. :)

Some quick remarks about the things you mentioned: Both State and Cont represent Earley states. They differ in the types, since we're in a typed setting and are constructing the results as we parse. Cont needs an argument, i.e. a parse result, to become a State.

The double references look a bit peculiar in Haskell, but are there for a reason. I'll start with the one in Rule: We only want a rule to be expanded once per position in the input. In traditional Earley parsing this is implicit in the usage of sets of states, since adding duplicates is then a no-op. We can't hold sets of states because states contain things of unknown type and are thus not in Eq or Ord.

So we instead make sure to only expand each rule once per position, which has the same effect. When a rule is expanded, we also have to keep track of a continuation, i.e. the productions to parse after a successful parse of the rule. But since we only expand a rule once per position, what happens when we encounter a rule a second time at the same position? This is where these pointers come in. At this point we look up its the rule's continuation pointer (i.e. dereference the outer pointer layer), and add ourselves to the list of continuations (i.e. update the inner pointer). Now any states (e.g. from the first time the rule was expanded) that refer to this continuation (the inner pointer) are automatically updated!

At each position we have to make sure that the outer reference points to a fresh continuation pointer. This is what the reset computation does.

We have a similar situation when there are alternatives, i.e. one state that branches into two states. In this case both branches will have a continuation that is the same (e.g. c in (a <|> b) <*> c). If both branches produce a result at the same position we want to merge the branches --- otherwise we would suddenly have several states at the same position (at c in the example) and parsing would become exponential. This is what the contsArgs field that you referred to is for, in a way roughly the same as that for rules. The inner reference gives us a way to inject more results into the continuation state even when that state has already been processed at a position.

sboosali commented 9 years ago

oh yeah that helps.

feel free to link me to any drafts. sooner or later I'll learn how it works (it's just a few lines too). until then, I can provide "fresh eyes".

On Thursday, September 17, 2015, Olle Fredriksson notifications@github.com wrote:

I'm aware that the internals aren't exactly easy to follow, and I've been meaning to write something up about it. This issue just might be the push I need to actually do it. :)

Some quick remarks about the things you mentioned: Both State and Cont represent Earley states. They differ in the types, since we're in a typed setting and are constructing the results as we parse. Cont needs an argument, i.e. a parse result, to become a State.

The double references look a bit peculiar in Haskell, but are there for a reason. I'll start with the one in Rule: We only want a rule to be expanded once per position in the input. In traditional Earley parsing this is implicit in the usage of sets of states, since adding duplicates is then a no-op. We can't hold sets of states because states contain things of unknown type and are thus not in Eq or Ord.

So we instead make sure to only expand each rule once per position, which has the same effect. When a rule is expanded, we also have to keep track of a continuation, i.e. the productions to parse after a successful parse of the rule. But since we only expand a rule once per position, what happens when we encounter a rule a second time at the same position? This is where these pointers come in. At this point we look up its the rule's continuation pointer (i.e. dereference the outer pointer layer), and add ourselves to the list of continuations (i.e. update the inner pointer). Now any states (e.g. from the first time the rule was expanded) that refer to this continuation (the inner pointer) are automatically updated!

At each position we have to make sure that the outer reference points to a fresh continuation pointer. This is what the reset computation does.

We have a similar situation when there are alternatives, i.e. one state that branches into two states. In this case both branches will have a continuation that is the same (e.g. c in (a <|> b) <*> c). If both branches produce a result at the same position we want to merge the branches --- otherwise we would suddenly have several states at the same position (at c in the example) and parsing would become exponential. This is what the contsArgs field that you referred to is for, in a way roughly the same as that for rules. The inner reference gives us a way to inject more results into the continuation state even when that state has already been processed at a position.

— Reply to this email directly or view it on GitHub https://github.com/ollef/Earley/issues/12#issuecomment-141024012.

(this message was composed with dictation: charitably interpret typos)Sam Boosalis

ollef commented 8 years ago

I've started doing this. https://github.com/ollef/Earley/blob/master/docs/implementation.md

Please let me know if anything needs clarification or if there are mistakes, typos, or whatever.

Most of it is background so far, but there is already some meaty information in there.

sboosali commented 8 years ago

this is amazing!

On Wednesday, September 23, 2015, Olle Fredriksson notifications@github.com wrote:

I've started doing this. https://github.com/ollef/Earley/blob/master/docs/implementation.md

Please let me know if anything needs clarification or if there are mistakes, typos, or whatever.

Most of it is background so far, but there is already some meaty information in there.

— Reply to this email directly or view it on GitHub https://github.com/ollef/Earley/issues/12#issuecomment-142713265.

(this message was composed with dictation: charitably interpret typos)Sam Boosalis

greglearns commented 8 years ago

The implementation.md file helps a bunch. However, is there any explanation of how to write the grammars themselves? You have the examples, but don't really explain what the different symbols are used for or how to think about writing a grammar from scratch, or how to handle whitespace effectively, etc. I'd really like to use this! (thank you, by the way, for your work so far!)

sboosali commented 8 years ago

(this implementation file helps explain the internals only.)

have you used other parse combinator libraries? the API is similar. also, what's a grammar you'd like to define? maybe I can help, and we can then address these issues in the documentation.

On Saturday, October 3, 2015, Greg Edwards notifications@github.com wrote:

The implementation.md file helps a bunch. However, is there any explanation of how to write the grammars themselves? You have the examples, but don't really explain what the different symbols are used for or how to think about writing a grammar from scratch, or how to handle whitespace effectively, etc. I'd really like to use this! (thank you, by the way, for your work so far!)

— Reply to this email directly or view it on GitHub https://github.com/ollef/Earley/issues/12#issuecomment-145315133.

(this message was composed with dictation: charitably interpret typos)Sam Boosalis

greglearns commented 8 years ago

I've done parsing in Ruby, Java, and C -- I'm pretty new to Haskell (sorry! I know that complicates things)

Your offer to help is very kind. I'd be happy to help out in terms of documentation if you wish once I understand things better, so that I can give back.

Here's an example of a line I'd like to parse, and the resulting structure I'd like to produce (ultimately as JSON)

  input = " .. [next x] some note #other #for:bob, mary and (betty or

larry) /6m/12/40h"

  var expected = {
    numberOfInitialDots: 2,
    boxed_word: 'next',
    boxed_letter: 'x',
    text: 'some note',
    tags: {
      other: '',
      for: 'bob, mary and (betty or larry)'
    },
    numbers: [
      { value: 6, letter: 'm' },
      { value: 12, letter: '' },
      { value: 40, letter: 'h' }
    ]
  }

On Sat, Oct 3, 2015 at 9:43 PM, Sam Boosalis notifications@github.com wrote:

(this implementation file helps explain the internals only.)

have you used other parse combinator libraries? the API is similar. also, what's a grammar you'd like to define? maybe I can help, and we can then address these issues in the documentation.

On Saturday, October 3, 2015, Greg Edwards notifications@github.com wrote:

The implementation.md file helps a bunch. However, is there any explanation of how to write the grammars themselves? You have the examples, but don't really explain what the different symbols are used for or how to think about writing a grammar from scratch, or how to handle whitespace effectively, etc. I'd really like to use this! (thank you, by the way, for your work so far!)

— Reply to this email directly or view it on GitHub https://github.com/ollef/Earley/issues/12#issuecomment-145315133.

(this message was composed with dictation: charitably interpret typos)Sam Boosalis

— Reply to this email directly or view it on GitHub https://github.com/ollef/Earley/issues/12#issuecomment-145315351.

sboosali commented 8 years ago

cool.

here's a tutorial for parsing log files with more explanation, using a different parser combinator library:

https://www.fpcomplete.com/school/starting-with-haskell/libraries-and-frameworks/text-manipulation/attoparsec

(I think we should link to a general parser combinator tutorials from the readme).

for your problem, the "Haskell way" is first to define a custom type. we construct this type as soon as possible whenever consuming data, from a parser or a database or whatever, and then printing/serializing that data back again at the very end. the benefits being that it's much easier manipulate structured data then raw text, and also easier to manipulate typed data than json.

{-# LANGUAGE DerivingGeneric, DeriveAnyClass  #-}

import Data.Aeson (ToJSON)  -- from the `aeson` package 

data GregLine = 
 { numberOfInitialDots :: Int
 , boxed_word          :: String
 , boxed_letter        :: Char 
 , text                :: String
 , tags                :: [Tag] 
 , numbers             :: [Number] 
 } deriving (Generic, ToJSON) -- this one line lets us serialize our datatype into json with `encode`.   

type Tag = (String, Maybe String)   -- the Maybe provides explicit optionality...

type Number = (Int, Maybe Char)     -- ...and will also simplify the parsers.  

Haskell has many "combinator libraries". in particular, "parser combinator libraries" like this one. the basic idea is that the library exports a type (call it Combinator), some "primitives" that construct values of that type (like aPrimitive :: String -> Combinator), and "combinators" that transform values of that type (like aCombinator :: Integer -> Combinator -> Combinator). if we wanted to build a simple "arithmetic combinator library" (sounds cooler than it is), the primitives might be 0 and 1, and the combinators might be + and *.

so first, let's import the primitives from Text.Earley and the combinators from Control.Applicative (the latter is in the standard library, yay code reuse!)

import Text.Earley
import [Control.Applicative](https://hackage.haskell.org/package/base-4.8.1.0/docs/Control-Applicative.html) (many, some, optional, (<$>), (<*>), (<*), (*>))

if we specialize the type signatures to Prod, we get:

optional :: Prod r e t a -> Prod r e t (Maybe a)
many     :: Prod r e t a -> Prod r e t [a]
some     :: Prod r e t a -> Prod r e t [a]
(<*)     :: Prod r e t a -> Prod r e t b -> Prod r e t a
(*>)     :: Prod r e t b -> Prod r e t a -> Prod r e t a

we can read them as:

optional  -- zero or one, or regex "?"
many      -- zero or more, or regex "*"  
some      -- one or more, or regex "+"
(<*)      -- parse the left and keep its result, then parse the right and ignore its result 
(*>)      -- parse the left and ignore its result, then parse the right and keep its result 

we can read the type constructor Prod r e t as "consumes some tokens and then returns", so an expression of type pInt :: Prod r e t Int means that the expression "consumes some tokens and then returns an int". thus, if we wanted to increment the result of a parse, we couldn't write (+1) pInt since that was incrementing the parser, not the number parsed. we would write (+1) <$> pInt.

anyways, following is the basic scaffold for your problem, using the library. it reads like BNF, hopefully. or more like yacc, where you can insert "interpretations" around productions.

(the g stands for "grammar" and the p stands for "parser". I haven't tried type checking this)

-- | can parse @" .. [next x] some note #other #for:bob, mary and (betty or larry) /6m/12/40h"@
gGregLine :: Grammar r (Prod r e Char GregLine)                  -- the tokens are characters
gGregLine = do
 pWhitespace          <- rule$ some (symbol ' ')
 pNumberOfInitialDots <- rule$ length <$> some (symbol '.')      -- length "interprets" the parse result
                                                                 -- of "one or more" dots
 pBoxed_word          <- rule$ _                                 -- uses <* and *>
 pBoxed_letter        <- rule$ _
 pText                <- rule$ _
 pTags                <- rule$ _                                 -- uses optional
 pNumbers             <- rule$ _            
 pGregLine            <- rule$ GregLine <$> (pWhitespace *> pNumberOfInitialDots)  -- skip initial whitespace 
                                        <*> pBoxed_word
                                        <*> pBoxed_letter
                                        <*> pText
                                        <*> pTags
                                        <*> pNumbers

 return pGregLine

let me know if you're unfamiliar with "Applicative"s. if so, for now, you can think of f <$> x <*> y as similar to normal pure function application f x y, only "with effects". here, the "effects" are "matching tokens and parsing them into some value".

and you can ignore rule for now, it just says it's argument is a grammatical production with the left hand side, rather than just a grammatical expression with only a right-hand side.

(:: Prod r e Char Char)
----
|  |
pInt <- rule$ read <$> satisfy isDigit
        |     |        |_____________| 
        |     |        (:: Prod r e Char Char)
        |     |______________________| 
        |     (:: Prod r e Char Int) |
        |                            | 
        |                            |
        |____________________________| 
        (:: Grammar r (Prod r e Char Int))

by the way, how did you come across this package?

and let me know if you need more info.

ollef commented 8 years ago

@sboosali: I've added you as a collaborator to the repository so you can add this to the documentation if you're up for it. I suggest adding it to the docs directory and linking it somewhere appropriate in the README. I can help flesh it out when I find the time. A minor issue from a quick look is that the types of a few of the combinators are incorrect: the list and maybe types should be "pushed into" the return types.

sboosali commented 8 years ago

thanks, will do (the type typo is noted).

also, I'll try to clearly separate "new to parser combinators" from "new to this package".