First set of benchmarks that run do so more slowly than they should.

When running sealbench, we noticed some odd results in the that poly degree 2048 keygen was faster than 1024, which makes no sense. We reversed the order in which the benchmarks run and this corrected the anomaly, but now 32768 runs more slowly than it should.

Finally, we kept the default order and ran 1024 twice. The second batch of results is significantly faster than the first (2-6x faster!):

---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
n=1024 / log(q)=27 / KeyGen / Secret/iterations:10                          392 us          245 us           10
n=1024 / log(q)=27 / KeyGen / Public/iterations:10                          516 us          495 us           10
n=1024 / log(q)=27 / BFV / EncryptSecret/iterations:10                      939 us          939 us           10
n=1024 / log(q)=27 / BFV / EncryptPublic/iterations:10                      569 us          566 us           10
n=1024 / log(q)=27 / BFV / Decrypt/iterations:10                            140 us          140 us           10
n=1024 / log(q)=27 / BFV / EncodeBatch/iterations:10                       30.7 us         30.5 us           10
...
n=1024 / log(q)=27 / KeyGen / Secret/iterations:10                         74.6 us         74.2 us           10
n=1024 / log(q)=27 / KeyGen / Public/iterations:10                          153 us          152 us           10
n=1024 / log(q)=27 / BFV / EncryptSecret/iterations:10                      344 us          342 us           10
n=1024 / log(q)=27 / BFV / EncryptPublic/iterations:10                      456 us          452 us           10
n=1024 / log(q)=27 / BFV / Decrypt/iterations:10                           59.0 us         58.7 us           10
n=1024 / log(q)=27 / BFV / EncodeBatch/iterations:10                       12.6 us         12.4 us           10

I suspect the first set of experiments that run are incurring page faults loading code into memory and while the second batch simply pull instructions for ICache (or at least memory without faulting...). There may also be some data cache hits if the allocator is reusing memory.

I propose running the first batch of benchmarks one time during precomputation and not printing the results to prime everything into the memory hierarchy.

microsoft / SEAL

First set of benchmarks that run do so more slowly than they should. #625