Serialize Counterexamples (a la Hypothesis)?

CodaFi commented 6 years ago

Our sister framework Hypothesis has a feature they call "The Database" where they serialize counterexamples to disk then run them before they start the actual testing loop.

I have some gut reactions to this, so I've laid out some problems that needs to be addressed first

[ ] Attaching mutable external state to tests

I'm most worried about this. Hypothesis goes to quite a lot of effort to make sure their example database is consistent and useful and even then they break it every so often between releases. On the one hand, it is incredibly useful - especially while practicing TDD - to write a property and have the framework just track your counterexamples, but there's got to be a less flaky way.

[ ] Universal serialization

Python kind of has it easy - everything is a key-value store at the end of the day, so everything serializes for free: pretty much all of the python middleware frameworks are capable of automatically deriving what would need to be Codable and Decodable instances for data.

[ ] Decreased test example diversity

The expected user model when a test fails now becomes

fail test
serialize counterexamples
rerun test with only counterexamples

This would seem to encourage growing test suites that exist solely to run through counterexamples which is not the point of property testing.

felix91gr commented 6 years ago

Regarding the how-to

Hmm... I think in order to have less flaky storage, we could do something like this:

Store the seeds of a counterexample. By remembering the parameters used for init, we could re-create the counterexamples even if binary serialization fails. Maybe storing in a Codable-derived JSON, this could work as well.
What if the class, struct or whatever mashup the counterexample was changes? If we are able to reliably track changes to them, we could at least notify the user and ask for input to update the seeds.

The problem with this would be the tracking involved. It’d require some compiler-level semantic knowledge, right? Or at least access to a data structure that represents the objects modeled in code and their relationships.

Regarding the point of property testing

I agree with you. But, just like the Generators in SwiftCheck allow, couldn’t we have the generators create new, unkown instances, as well as the counterexamples? I mean giving it a fixed set (the counterexamples) + an unbounded set (the rest of the domain).

And if the counterexamples are so many that they actually clog the generators, we could just give the c-ex set a weight and thus tell the main generator that it needs to take only as many counterexamples as it needs, and not all of them.

CodaFi commented 6 years ago

Store the seeds of a counterexample.

Would work, but would be terribly inefficient. Suppose I gather a large database of a hundred failing seeds, each of which fails test 99 of 100 or thereabouts. That's a little less than 10000 runs of the property testing block just to reproduce them all.

If we are able to reliably track changes to them

The problem with this would be the tracking involved. It’d require some compiler-level semantic knowledge, right? Or at least access to a data structure that represents the objects modeled in code and their relationships.

Now we're serializing the user's API as well? We don't have the capability to even introspect which function we're called from (hence the DSL). This is 110% the domain of a compiler plugin given the current state of Swift.

I agree with you. But, just like the Generators in SwiftCheck allow, couldn’t we have the generators create new, unkown instances, as well as the counterexamples?

A generator of counterexamples still needs to be created somehow. At the end of the day, something has to do the grunt work of coming up with them and storing them somewhere. Or, better yet, coming up with a consistent procedure for generating failures.

And if the counterexamples are so many that they actually clog the generators, we could just give the c-ex set a weight and thus tell the main generator that it needs to take only as many counterexamples as it needs, and not all of them.

Which kinda defeats the point of serializing them all in the first place. Hypothesis also keeps track of "interesting" test cases in their database and they specifically do this so they don't clog up the pipes. It just feels like a waste of effort on their part.

felix91gr commented 6 years ago

Hmm. And then... what do you think? Should this exist? How useful is it in practice?

CodaFi commented 6 years ago

For context, I watched this talk from one of our users who brought this up as a desirable feature. It sounds like a good and convenient thing to have if the right infrastructure is in place.

felix91gr commented 6 years ago

Hmm. Yes. I think I see what you mean.

What about the following?

We archive every counterexample. It is called The Regression Archive.
The counterexamples only store the tuple (Input, Property) that failed when they were discovered. We don’t want to run — as you said — 99 tests on one input which only broke the 100th.
We offer two modes of testing: Properties and Regressions.
Properties is the usual SwiftCheck run.
Regressions are run separately as a Unit Test suite.

And considering this to address its shortcomings:

The Regressions archive is bound to become redundant with time. In two ways:
- You’ll reach a point when you’ll know that your Properties passing imply some of the regressions passing as well. This should be detected automatically, if at all possible.
- Some of the Regressions will be redundant between themselves. The Archive needs to be concise, therefore we should eliminate “semantic duplicates” if possible.

Since adressing this automatically would be difficult, there should be a manual editor of the Archive where we show the amount of times a regression test has passed and group them by property or something.

Those are my thoughts atm. I think it’s a nice idea, but it feels like it should be separated from the Property Testing in order to be meaningful and to keep both clean.

Also, I’m assuming here that there is a way to ensure consistency or to warn of issues. Maybe there is a protocol that we could make that would enforce some kind of consistency throught time...? Idk. That part might as well be a compiler plugin 🤔

typelift / SwiftCheck

Serialize Counterexamples (a la Hypothesis)? #248

Regarding the how-to

Regarding the point of property testing