Reduce bit twiddling in `run_wr_checks()`

Each wraparound check result is computed by taking the dot product of the encoded gradient and a random vector. To generate the vectors, we iterate over the output of an XOF two bits at a time. This minimizes the amount of XOF output we need to generate, but keeping track of where we are in the bit slice costs a fair bit of bit twiddling under the hood.

This computation requires iterating over the encoded gradient for each wraparound tests. We can avoid some bit twiddling if we can avoid splitting bytes of XOF output across iterations.

Divide the gradient into chunks of at most four elements, and for each chunk, consume a byte of XOF output (8 bits per byte, 2 bits per element). This strategy wastes a few bits per test if the gradient length is not a power of four. Overall we end up generating

chunk_count(4, dimension) * NUM_WR_CHECKS

bytes of XOF output instead of

chunk_count(4, dimension * NUM_WR_CHECKS)

However, this trade-off appears to be worth it: In Daphne's benchmarks, this saves 17% on encode time.

This is a breaking change because it changes the order in which we consume the output of the XOF.

junyechen1996 / draft-chen-cfrg-vdaf-pine

Reduce bit twiddling in `run_wr_checks()` #96