Idea: String generation using Context Free Grammars

Centril commented 6 years ago

We can already generate Strings using RegexStrategy with type-3 grammars (regular expressions). However; those are sometimes not enough. If you want to test natural language processing APIs, some programming language parser, or just some format that is more complex than regular expressions, you need something richer.

A fortunate property of type-2 grammars (context free grammars) is that they can still be generated, even when they are ambiguous. Being able to generate from ambiguous grammars is helpful for NLP. Another fortunate part of proptest is how alternation in CFGs translate nicely to union strategies... To allow the user to specify the probability of each alternation, we can use PCFGs.

We should consider some BNF formalism that is:

Is understood by most people familiar with BNF There's:
- LBNF which has redundant information which is useful for generation of parsers and ASTs, but which we don't care about for generation of strings.
- ABNF
- EBNF
Some formalisms write production = definition ;, some write: production : definition ; and some write: production ::= definition ;.
Which can be extended with probabilities such that the resulting formalism is a superset.
Is ergonomic to write in (deals with tedious tasks such as repetition, optionality, separators, terminators well...).

We can then define a strategy producer:

fn cfg_strategy(depth: usize, grammar: &str) -> CfgStrategy { .. }

where CfgStrategy : Strategy<Value = String> (or Vec<u8>). The depth parameter controls the recursion depth when generating from recursive CFGs, which is necessary since CFGs can generate infinite words.

I have currently no idea how complex the implementation for this would be... It could reuse the logic for generating from regexes since regexes can often be embedded in CFGs... I'd like to try out an implementation and find out... If it turns out to be too complex, maybe we could add a proptest-extras crate but keep it in tree for maintenance purposes?

Thoughts?

AltSysrq commented 6 years ago

It definitely sounds useful.

I think it may be better for it to be a macro which defines a strategy function rather than something that parses a string, both so the syntax gets checked at compile time and so that each grammar element can be accessed separately (e.g., you might want to just test arbitrary expressions in some part of a language compiler rather than only complete source files).

I'm not sure how recursion can be handled well here. You not only need to control depth, but avoid the exponential size expansion that comes with lists containing lists containing lists, which is why prop_recursive() needs so many hints to help it along.

AltSysrq commented 6 years ago

Actually, I guess if the grammar elements are not straight strategy functions, i.e. the string generator is given something more akin to an AST, recursion won't be as much of an issue since the entire value could be generated in one shot so there would be no issues keeping state between the nodes in the grammar.

Centril commented 6 years ago

I think it may be better for it to be a macro which defines a strategy function rather than something that parses a string,

Interesting :) I think it needs proc macros to do this well..

Of the top of my head, I see some problems with using macros:

I don't think there's a good way to load a formalism from a file, say json.bnf, give it to the macro and produce a strategy from that... Unless you somehow get the macro to translate a file path to a token-stream and use that stream internally. At the very least this requires proc macros.
We may not get to encode a complete BNF formalism if parts of the BNF grammar are not valid tokens. But this could be wrong / needs to be tested :)
It is not possible to adjust the grammar based on run-time information (probably does not arise very often...).

both so the syntax gets checked at compile time

That does sound useful. This could also be done with const fn, but they are currently unstable :(

each grammar element can be accessed separately (e.g., you might want to just test arbitrary expressions in some part of a language compiler rather than only complete source files).

That does sound useful... So one strategy generating function would be generated for each production in the grammar then? Another solution would be to specify what the entry point production is... that way you can test arbitrary parts as well.

Actually, I guess if the grammar elements are not straight strategy functions, i.e. the string generator is given something more akin to an AST, recursion won't be as much of an issue since the entire value could be generated in one shot so there would be no issues keeping state between the nodes in the grammar.

Yeah, my basic (very unpolished) idea was to do something akin to:

Parse &str to an AST describing PCFGs.
Remove epsilons and self-referential cycles from AST.
Move the AST into CfgStrategy.
On .current(), generate String up to a desired depth (which we can shrink..) from the AST.
On .simplify(), we simplify the AST itself.

This could potentially be quite slow?

jgrillo-grapl commented 4 years ago

I learned the other day that Hypothesis has an extension which does this: https://hypothesis.readthedocs.io/en/latest/extras.html#hypothesis-lark

May be relevant to this issue.

proptest-rs / proptest

Idea: String generation using Context Free Grammars #61