RFC: Quickcheck-like testing in snarkyjs

mitschabaude commented 1 year ago

Motivation

In order to ensure that critical operations like signing transactions with mina-signer work correctly, we need to include test cases with a wide diversity of different examples (https://github.com/o1-labs/snarkyjs/issues/712), including negative tests for invalid examples.

We could keep writing these test cases by hand: https://github.com/o1-labs/snarkyjs/blob/756733f8e5edc91d2f84d36e9c4dc77001c2c9ac/src/mina-signer/src/test-vectors/accountUpdate.ts#L6-L12 But that comes at a cost:

Since the structures that should be tested -- various kinds of transactions -- are very large, creating non-trivial examples takes a lot of time and code.
Some of the structures (zkApp transactions) change frequently. Every time this happens, hard-coded examples need to be updated. This would be extremely annoying if we had more than 1-2 hand-written examples.

This RFC proposes a solution to the problem of testing a large variety of transactions in an automated way.

Proposal: Home-grown quickcheck

This proposal is partially implemented here: https://github.com/o1-labs/snarkyjs/pull/718

We propose to add our own framework for specifying random generators of complex structures. This builds on two observations:

Transactions are built out of only a handful of simple "leaf" types, like "uint64", which are combined in arrays and records.
It's easy to create random generators for the leaf types
With the right API, it becomes easy to combine leaf random generators into generators for higher-level structures. In part, this can even be automated because we have a system of auto-generating a description of the structure of some complex types, from OCaml code: https://github.com/o1-labs/snarkyjs/blob/756733f8e5edc91d2f84d36e9c4dc77001c2c9ac/src/provable/gen/js-layout.ts

The core random generator type looks like this:

type Random<T> = { next(): T };

Example simple implementation for a boolean:

const boolean: Random<boolean> = { next: () => Math.random() > 0.5 };

// or, slightly prettified using a constructor
const boolean: Random<boolean> = Random(() => Math.random() > 0.5);

(note: our actual implementation exclusively uses secure randomness instead of Math.random())

Example for what combining generators can look like:

// some combined type
type AccountBalance = { address: PublicKey, balance: Uint64 };

// generator for combined type
const AccountBalance: Random<AccountBalance> = Random.record({
  address: Random.publicKey,
  balance: Random.uint64
};

Creating random examples just means calling .next():

let example = AccountBalance.next();

Examples of useful combinators are record, map, array, oneof, constant, reject. All of them are trivial to build.

Generators for types like public key, are easy to build on top of our existing custom cryptography:

let privateKey = Random(PrivateKey.random);
let publicKey = map(privateKey, sk => PrivateKey.toPublicKey(sk));

Important consideration: Often, great testing value comes from checking a few typical edge cases, like a 0, or the max uint64, or an empty array. Therefore, core generators are carefully built to produce a number of special values with much increased probability.

For positive numbers, we will typically use a log-uniform distribution, where a number of size $2^n$ has a similar probability to be generated, for every n. In other words, small values are much more likely than large ones.

Here are some interfaces we propose for generating positive numbers:

nat(10); // positive number from 0 to n, bias towards special values like 0, 1, n; log uniform
uint32, uint64; // log uniform positive bigints with predefined max value; bias for special values
field, scalar; // special range that fits our finite fields; bias for special values; otherwise, _uniformly_ distributed (not log) as is typically the case for hashes etc

Structures like arrays take a size parameter which can be either a number or -- more useful -- a number generator:

array(field, nat(1000)); // array of field elements, of size up to 1000

Note that because of the log uniform distribution, most generated arrays will be small, even if we allow a fairly high max value like 1000. The bias towards values like 0 in nat ensures that empty arrays will be among the generated cases, which is always useful.

Test runner

The core need outlined above is basically covered by only having generators for transaction structures in place. However, it's only a small step from there to add a convenient test runner which will check assertions against 10-100 random examples:

test(publicKey, (pk, assert) => {
  let g = PublicKey.toGroup(pk);
  assert(Group.isOnCurve(g), "public key can be converted to a valid curve point");
});

Proposal: the test runner just tries to be smart and run at least 10 tests (or 20?), then check how long the tests took, and if its fast enough to fit in 100ms, runs up to 100 tests.

Remark: currently, our assert is much less polished than something like expect (from Jest, but available stand-alone), which is able to display deep diffs between objects in a pretty way. We plan to often just use a good solution like expect inside a test run. Throwing an error is equivalent to failing an assert: the test is stopped and the counter-example printed, along with the error / assert message

It should also be possible to specify hand-picked test cases that have to be used with 100% probability. The intention is to achieve this via the normal Random<T> interface, by specifying a generator that is guaranteed to generate some specific values first. Example for a test where we definitely want to test an all-zeros array, and an all-ones array:

let array_: Random<bigint[]> = withHandPicked(
  array(uint32, nat(10)),
  [0n, 0n, 0n],
  [1n, 1n, 1n]
);

test(array_, (...));

Invalid cases

We propose a pattern where most of the core types have generators for invalid values (like, out of range for numbers) attached right next to them:

Random.uint64 // valid uint64
Random.uint64.invalid // numbers out of uint64 range

Combinators like array and record will have an "invalid" version where they intelligently generate just slightly invalid examples, like a record where one of the fields has an invalid value. This is typically much more useful than testing cases where all values are invalid. They can do so by inspecting the opional .invalid property of their input generators.

If we want fine-grained control over the type of invalidity we are testing, of complex structures, this will be easiest to achieve with map:

const InvalidNonceAccountUpdate = map(AccountUpdate, Random.uint32.invalid, (update, nonce) => {
  update.body.preconditions.account.nonce.value = { lower: nonce, upper: nonce };
  return update;
});

The test runner is planned to support a runner for negative tests where every input is expected to either throw or fail some assertion:

// only succeeds if _each_ test case ends negative (throws or exhibits a falsy assert)
testNegative(Random.signedPayment.invalid, (payment, assert) => {
  assert(verifyPayment(payment));
});

Alternative approaches

Two main alternatives emerged during the research on this problem:

Building on top of a well-designed quickcheck-style library: https://github.com/dubzzz/fast-check
- Pro: battle-tested, includes convenience capabilities that we won't implement, like shrinking and replayable seeds.
- Con: harder to specialize towards our needs than home grown solution
- integration with derivers is an unknown
- custom generators like publicKey would have to be built by extending a complex class
- invalidity testing that's useful for us doesn't seem to be built in
- Con: scope way beyond what we need, at the cost of complexity
- Conclusion: the pros are just nice to haves; the cons are too expensive
Exposing existing quickcheck generators for transactions from OCaml, and using those
- Pro: battle-tested
- Pro: already solved the non-trivial problem of generating transactions that are valid / invalid (from the point of view of the tx logic)
- Con: additional work of writing and maintaining bindings
- converters from and to JS values for each operation
- maintain matching types on the TS side (types / nice to use surface is often > half the work)
- Con: loses the fast iteration speed and easy remixability of a JS solution
- new binding for new generator; new binding for new combinator; instead of just let newStuff = map(something, () => {..})
- Con: harder to parse stack traces; much harder to debug when test fails
- Con: overhead cost of separate repo, seperate language, separate build step, separate branches
- Conclusion: cons are too costly of justifying for simple immediate use cases like mina-signer (where transactions don't need to be valid against tx logic). We want a JS solution either way for quick iteration, arbitrary use cases.
- Could be handy if more complex use cases are needed and turn out to pose difficulties when implementing in JS

mitschabaude commented 1 year ago

Update: I think I'll tweak the generator interface to implement a way to reset / start or to create a fresh instance, in order to implement generators with memory -- like those that generate a fixed set of examples first

mitschabaude commented 1 year ago

Open question re: invalid testing:

Is it enough to test invalid cases which have the correct type? Or should there be a mode of passing in random garbage instead of values with the proper type? Almost like a fuzzing mode?

nicc commented 1 year ago

This is very helpful! 🤸

My reflexive response is that we shouldn't reinvent this wheel but you do make a solid argument for building a new thing. I would love to hear what others think about the ongoing effort to build and maintain it.

Anecdotally, I’ve had terrible experiences leaning into generative testing just because I thought it was cool. I don't mean to suggest that we're doing this here! I just want to share my mistakes 😅 ...

The main issue I experienced was that we weren't careful with the kind of tests that required generated data. We used it everywhere and it became a huge pain to generate sequential interactions that made sense and actually tested the flows we were after. We'd constantly generate something, then have to write a bunch of very specific functions that generated a second thing based on the first, and so on. Of course this can be done properly or at least better than we did it, we just stumbled into it without realising there's a fair amount of domain modelling to be done. So we unded up reusing the actual generators but still had this proliferation of interdependencies between generated values and it became a huge hassle to constantly constrain the generators on a case-by-case basis to exercise the top-level use cases we were after. It certainly wasn't a case of "let the test suite generate test cases for you, yay". I suspect this was due to using generated data for high-level tests without properly building a kind of "domain simulator". Unit-level tests were a lot easier.

To be fair, that was in untyped languages. I suspect that something like quickcheck works because it leverages a solid type system to lift you into this property-based assertion at the function level - e.g. reverse reverse. I think there's a notable difference between that - declaring properties such as invariants on functions - and hand-crafting tests of indeterminate scope using generated data. So I do worry about the maintainability of passing individual generators into a higher-order function where you wire up a test by hand. Mostly for the reasons stated above - actually generating usable data sets as required by high-level tests can become a whole little domain model on its own.

Do we have a sense of which kinds of tests would use generators and which ones would not? I think that would help me assess this decision.

mitschabaude commented 1 year ago

Thanks @nicc, these are some very interesting learnings! Definitely the kind of global, domain-specific interdependencies needed in complex structures like transactions where on my mind since I started this, and I came to a clear picture about the small amount of domain modelling that will be needed for my immediate use cases in mina-signer. Similar to what you described, I anticipate this to be done in a custom map function where we override some of the initial (naive, locally-random) fields with good values. For signing tests, this only has to be done for a single account update field (callDepth), as far as I'm aware.

Do we have a sense of which kinds of tests would use generators and which ones would not? I think that would help me assess this decision.

I think given your experiences, we should be mindful of when to use this -- specifically in cases where we need many examples but not too many / too complex operations on those. It could also be that sometimes we only want to uses generated values and not the test runner.

There are two kinds of tests where I definitely want to use them right away:

generating many examples of simple transactions: payments / stake delegations
generating many examples of zkapp transactions

In both cases, the main use cases are to check that

hashing / signing these is consistent between different implementations
invalid inputs are rejected with the expected error

But I can think of other use cases right away as well, like

low-level crypto tests where the generated values are elliptic curve points and field elements
- this is classic property testing. like, we check that some expected algebraic law holds
- for things like this, there's no big advantage over what we do now, it's just a more convenient syntax for running something on many random examples

I think there's a notable difference between that - declaring properties such as invariants on functions - and hand-crafting tests of indeterminate scope using generated data. ... actually generating usable data sets as required by high-level tests can become a whole little domain model on its own.

I thought that quickcheck has pretty much the same problem - that its hard to craft useful data sets / generators for complex processes.

I pondered making the API more like quickcheck (input == function that returns a boolean, instead of an arbitrary function which can do arbitrary checks), but it's just less flexible and would incur a higher cost of generating inputs, because you need to run many small tests, each with many inputs. (note - generating some of our inputs, like public keys, is quite costly)

mitschabaude commented 1 year ago

TL;DR - there are some use cases where we definitely need the random generation (for variety and maintainability of the tests).

And apart from those, we don't need to lean into this framework where it doesn't have a clear purpose

mitschabaude commented 1 year ago

Updated type interface which allows random generators with memory (https://github.com/o1-labs/snarkyjs/pull/718/commits/5e9c5a95fe29f341ca93ab4333a73a2babb95559)

type Random<T> = { create(): () => T }

there's one more layer of function call: the create() function creates one instance of the generator. The instance is just a function that returns values, like the next() function before.

mitschabaude commented 1 year ago

Update: this RFC can now be viewed in full action in a test where random payments and delegations are tested: https://github.com/o1-labs/snarkyjs/pull/718/commits/d0f009741578b59278114ddcb4070e80ceec972b

Fun fact: This test would have caught the same bug (binprot serialization of numbers) that Joseph caught during the mina-signer audit. (I had to cherry-pick the binprot fixes to make it work.) The bug would've been uncovered not by writing any new test logic, but just by auto-generating a wider variety of inputs than we had previously hard-coded. Which demonstrates the point of doing all this pretty well :D

nicc commented 1 year ago

Thanks @mitschabaude, this is sounding good. Hope I haven't kept you waiting on a reply!

we should be mindful of when to use this -- specifically in cases where we need many examples but not too many / too complex operations on those Very much agreed and this makes me a lot more comfortable with the approach. This is a matter of discipline, which is easy to say and harder to do so we should just remember this decision and be willing to pull back on generative tests early if we see signs that it's becoming cumbersome.

I also like the examples you gave. That seems like appropriate use of generative testing to me.

I thought that quickcheck has pretty much the same problem - that its hard to craft useful data sets / generators for complex processes. That's interesting! I guess it substantiates that the real focal point is the scope of the test. Type-inferred generators allow us to declare properties about functions more directly / more "functionally" but it's really about the volume and interdependence of data where things go wrong. A hypothesis anyway. Very interesting.

Let's go ahead with it on the understanding that we keep an eye on the things we've surfaced here. I'll approve the PR and link back to this. Thanks Gregor!

bkase commented 1 year ago

I think we'll need another mini-RFC before we go forward re-implementing the quickcheck generators for zkapp transactions given the pain the protocol team went through to build valid generators for those. Is that alright @mitschabaude ?

mitschabaude commented 1 year ago

I think we'll need another mini-RFC before we go forward re-implementing the quickcheck generators for zkapp transactions given the pain the protocol team went through to build valid generators for those. Is that alright @mitschabaude ?

@bkase yes, I can describe that in a small paragraph, or show it with code. But the answer is, the protocol team had much more complex needs: generating sets of valid transactions that each pass the tx logic and can be applied to a ledger. Our need for mina-signer is just that the transactions have to be syntactically correct on a basic level. The signing code doesn't check any preconditions or balances.

Therefore, our needs can be addressed very easily, but an RFC might be warranted if we'd want to use the framework for complex testing like the protocol team.

nicc commented 1 year ago

Thanks @bkase, good call. I can buy into the rationale that it's reasonable to generate simple, isolated transactions within mina-signer tests but duplication of any significant complexity would definitely present a problem. Very much agreed that we should take note of input from the protocol team

jspada commented 1 year ago

This sounds like extremely useful work. There are two things that I think are important requirements, which may already be covered, but I will note them just in case

Special common test cases such as zero, max, max + 1 should always be covered. We do not want to leave it to randomness that these tests are present and could (even with very low probability) be missing.
All tests should be easily reproducible to the developer should any of them fail.

mitschabaude commented 1 year ago

Closing this because it's implemented!

o1-labs / o1js