ollef / Earley

Parsing all context-free grammars using Earley's algorithm in Haskell.
BSD 3-Clause "New" or "Revised" License
361 stars 24 forks source link

Generate all members of a grammar? #20

Closed copumpkin closed 7 years ago

copumpkin commented 8 years ago

Intuitively, it feels like it should be possible for your library to generate all strings (as a potentially infinite list) that a grammar would parse, as well as all parsed values that those strings would generate. Is that something you've considered or would consider adding?

sboosali commented 8 years ago

i have some code that "refines" a grammar, ie Prod, (or rather, see below) into a "Maybe FiniteGrammar" then enumerates all strings in that finite grammar.

  1. getting the order right for infinite granmars might take some work, like a breadth first search traversal, but doesnt seem too hard.
  2. more importantly, terminals are currently stored as predicates (String -> Bool), not Strings. thus, we cant know what stringa Terminal t can parse.

(we could add a "OneTerminal :: String -> (String -> a) -> Prod a" case, and change the symbol smart constructor to use that?)

(or, maybe some dsl for defining predicates that can be "show"n? this seems hard. ive looked at Data.List.Split in the aplit package, but its core type isnt showable, being a function).

anyway, thats why in my project i have a type (call it My.Prod) with a simple conversion function that takes it into an Earley.Prod (as well as the FiniteGrammar mentioned). there are some tricks i have (like an Alt of Terminals is "optimized" to a Terminal with the predicates or'd together).

unfortunately, but it (1) may lose some efficiency, where distinct calls are made to a predicates and (2) of course, it does lose expressivity, as i cant match against isUpper (without an explicit IsUpper case in My.Prod).

On Monday, February 22, 2016, Daniel Peebles notifications@github.com wrote:

Intuitively, it feels like it should be possible for your library to generate all strings (as a potentially infinite list) that a grammar would parse, as well as all parsed values that those strings would generate. Is that something you've considered or would consider adding?

— Reply to this email directly or view it on GitHub https://github.com/ollef/Earley/issues/20.

(this message was composed with dictation: charitably interpret typos)Sam Boosalis

ollef commented 8 years ago

I've been thinking about doing something like this for testing, to basically have tests that first generate arbitrary grammars and then test that against random (or all) strings in the language. I used to do something like that in my old Grempa library.

If the token type is finite I think we should be able to do it without too many tricks. I'll look into it.

phadej commented 8 years ago

I guess token don't need to be finite (e.g. Enum token, Bounded token), better to take an explicit [token]

phadej commented 8 years ago

or optionally have Monad m => m token -> Grammar .... -> m [token] generator!

danr commented 7 years ago

Use case: test the grammar for ambiguities.

phadej commented 7 years ago

Generally whether CFG is ambigious is undecidable, but you indeed could QuickCheck it. Yet, Earley's power is that it can deal with ambiguous grammars.

ollef commented 7 years ago

I think we could also do an exhaustive test up to some fixed input length and get some more confidence than random tests would give us.

ollef commented 7 years ago

There's some code for this here: https://github.com/ollef/Earley/tree/Generator

Does anyone want to volunteer some testing and/or tests?

ollef commented 7 years ago
language romanNumeralsGrammar "IVX"
= [(0,""),(1,"I"),(5,"V"),(10,"X"),(20,"XX"),(11,"XI"),(15,"XV"),(6,"VI"),(9,"IX"),(4,"IV"),(2,"II"),(3,"III"),(19,"XIX"),(16,"XVI"),(14,"XIV"),(12,"XII"),(7,"VII"),(21,"XXI"),(25,"XXV"),(30,"XXX"),(31,"XXXI"),(35,"XXXV"),(8,"VIII"),(13,"XIII"),(17,"XVII"),(26,"XXVI"),(29,"XXIX"),(24,"XXIV"),(22,"XXII"),(18,"XVIII"),(36,"XXXVI"),(39,"XXXIX"),(34,"XXXIV"),(32,"XXXII"),(23,"XXIII"),(27,"XXVII"),(33,"XXXIII"),(28,"XXVIII"),(37,"XXXVII"),(38,"XXXVIII")]

Pretty fun stuff!

copumpkin commented 7 years ago

@ollef is the "IVX" above the set of terminals to explore the grammar over, due to the token -> Bool predicate thing mentioned earlier?

ollef commented 7 years ago

@copumpkin: Yeah, that's it!

copumpkin commented 7 years ago

Oh that's super cool, so it's actually finite in this case

ollef commented 7 years ago

It's finite unless you give it some more tokens to work with (apparently).

TODO:

phadej commented 7 years ago

@ollef have you tried it with nasty grammars? left-recursive, right-recursive, ridiculously ambiguous?

sboosali commented 7 years ago

@phadej yeah, for termination?

what i did before was just to first validate the grammar as finite (by checking recursively for "non-finite" cases like many), and then return a Maybe [t].

also, known-infinite (or unknown-to-be-finite) grammars could be represented by an opaque symbol (possible the (<?>) if present), if it's needed. like a list of "Either (Maybe e) t" or something.

On Wed, Feb 1, 2017 at 10:12 PM Oleg Grenrus notifications@github.com wrote:

@ollef https://github.com/ollef have you tried it with nasty grammars? left-recursive, right-recursive, ridiculously ambiguous?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ollef/Earley/issues/20#issuecomment-276877032, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNoMRQ5h1R-WPrX_lMO7M1Dg5OJFELMks5rYXPBgaJpZM4HfJGl .

ollef commented 7 years ago

@phadej: I haven't done any principled testing, just tried a few examples. Those are good starting points, thanks! 👍

@sboosali: Hmm, I don't quite understand what you mean. Don't we want this to work even if a grammar generates an infinite language? What we're after, I think, is that the language generation is productive, such that you can take the number of results you want in infinite cases.

It'll likely have roughly the same restrictions as the parser by the way, so degenerate grammars like https://github.com/ollef/Earley/blob/master/examples/VeryAmbiguous.hs will loop because they produce circular results.

phadej commented 7 years ago

@sboosali for productivity. For finite languages you could use regex-applicative (and even regular languages aren't always finite).

sboosali commented 7 years ago

@olle (right, productivity makes sense. I was just thinking about finite subgrammars.)

are the results "breath first" (or broadly, not too biased in some way)? otherwise, they might not be representative of the language.

also, what use cases did you have in mind? maybe we could give a basic sanity check for ambiguity by running the parser on the "language"? or even check the stream for duplicates, depending on how its generated?

(not sure about any of this, havent studied cfg's that much).

On Wed, Feb 1, 2017 at 11:14 PM Oleg Grenrus notifications@github.com wrote:

@sboosali https://github.com/sboosali for productivity. For finite languages you could use regex-applicative (and even regular languages aren't always finite).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ollef/Earley/issues/20#issuecomment-276885065, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNoMUP5Xk11aJ1e8tEgbXU_W_f32jZKks5rYYJsgaJpZM4HfJGl .

ollef commented 7 years ago

The results are roughly in the same order as from the parser, so ordered by input length, and within the same input length shuffled around a bit (governed by the internals of the implementation).

I'm sure people can find more uses, but here are some that I can think of:

ollef commented 7 years ago

This is included in the 0.12.0.0 release. Thanks for reminding me!