Add a generator for recognized regular languages

pmetzger commented 1 month ago

This is a very long term suggestion; it is not urgent and very much a wish list item. This suggestion is currently pretty fuzzy and I haven't thought through all the implications, but I wanted to get it down where I wouldn't forget it.

The nature of regular languages is such that it is possible to use a regular grammar not only to recognize a language but also to generate random strings in the recognized language.

When debugging, it is obvious when a grammar fails to recognize a thing that the user is expecting to be recognized, but much less obvious when the grammar recognizes something the user really didn't expect, because a programmer is unlikely to be feeding the grammar examples of things they didn't realize would be recognized.

It would be interesting, for debugging and testing purposes, to generate examples of what a given language recognizes, both to see if anything unexpected shows up, and to produce examples for automated regression testing.

It might also be interesting to use random "fuzzing" techniques to generate strings that are close to recognized strings but aren't members of the recognized set.

Note that it is very very difficult to prove that two regular grammars recognize the same language, but it is trivial to show that on a generated large set of examples that they recognize and fail to recognize the same strings. As an alternative to producing a pre-defined set of examples for testing grammar equivalence during refactoring of grammars, one might use "fuzzing" techniques to look for strings that are recognized (or not recognized) by one grammar but not recognized (or recognized) by the other.

skvadrik commented 1 month ago

I wonder if you have seen the --skeleton option, I added it long ago in order to verify codegen optimizations (to check that the strings recognized before the optimizations are recognized in the same way after optimization - I think what I needed to check was DFA minimization correctness). It's very similar to what you are suggesting - re2c generates a set of example strings that together form a path cover of the DFA.

Also we do have a fuzzer contributed by @trofi: look in the fuzz/ subdirectory (there're a few fuzzers there, e.g. https://github.com/skvadrik/re2c/blob/master/fuzz/regex_tdfa/check.hs, and new ones can be created by copy-pasting this one and adapting it to one's needs). It's based on the haskell QuickCheck fuzzing library, but it should not be difficult to port it to some other language.

I used fuzzer with skeleton extensively to capture a lot of bugs during my work on TDFA - it was one most useful tool that I ever had for testing. As long as you have something to compare (two algorithms, binaries, options, etc.) you can just generate skeleton data with one and feed it into the other and see if it fails, and vice versa (and do that on an infinite number of regexps supplied by the fuzzer). But for this tool I would have much lower confidence that my algorithms are correct. :)

pmetzger commented 2 weeks ago

So those are really useful as created for debugging re2c itself. I'm kind of interested in tools for helping users debug the grammars they are writing. It might be that they could be used for both however.

skvadrik / re2c

Add a generator for recognized regular languages #493