When running sealbench, we noticed some odd results in the that poly degree 2048 keygen was faster than 1024, which makes no sense. We reversed the order in which the benchmarks run and this corrected the anomaly, but now 32768 runs more slowly than it should.
Finally, we kept the default order and ran 1024 twice. The second batch of results is significantly faster than the first (2-6x faster!):
---------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------
n=1024 / log(q)=27 / KeyGen / Secret/iterations:10 392 us 245 us 10
n=1024 / log(q)=27 / KeyGen / Public/iterations:10 516 us 495 us 10
n=1024 / log(q)=27 / BFV / EncryptSecret/iterations:10 939 us 939 us 10
n=1024 / log(q)=27 / BFV / EncryptPublic/iterations:10 569 us 566 us 10
n=1024 / log(q)=27 / BFV / Decrypt/iterations:10 140 us 140 us 10
n=1024 / log(q)=27 / BFV / EncodeBatch/iterations:10 30.7 us 30.5 us 10
...
n=1024 / log(q)=27 / KeyGen / Secret/iterations:10 74.6 us 74.2 us 10
n=1024 / log(q)=27 / KeyGen / Public/iterations:10 153 us 152 us 10
n=1024 / log(q)=27 / BFV / EncryptSecret/iterations:10 344 us 342 us 10
n=1024 / log(q)=27 / BFV / EncryptPublic/iterations:10 456 us 452 us 10
n=1024 / log(q)=27 / BFV / Decrypt/iterations:10 59.0 us 58.7 us 10
n=1024 / log(q)=27 / BFV / EncodeBatch/iterations:10 12.6 us 12.4 us 10
I suspect the first set of experiments that run are incurring page faults loading code into memory and while the second batch simply pull instructions for ICache (or at least memory without faulting...). There may also be some data cache hits if the allocator is reusing memory.
I propose running the first batch of benchmarks one time during precomputation and not printing the results to prime everything into the memory hierarchy.
When running sealbench, we noticed some odd results in the that poly degree 2048 keygen was faster than 1024, which makes no sense. We reversed the order in which the benchmarks run and this corrected the anomaly, but now 32768 runs more slowly than it should.
Finally, we kept the default order and ran 1024 twice. The second batch of results is significantly faster than the first (2-6x faster!):
I suspect the first set of experiments that run are incurring page faults loading code into memory and while the second batch simply pull instructions for ICache (or at least memory without faulting...). There may also be some data cache hits if the allocator is reusing memory.
I propose running the first batch of benchmarks one time during precomputation and not printing the results to prime everything into the memory hierarchy.