I've been running some benchmarks to evaluate the potential regressions of some of my PRSS changes and found that the impact of truncate_from is quite significant.
Here is the code that I used for benchmarking
/* use section */
fn make_prss() -> Arc<IndexedSharedRandomness> {
let prss_setup = Endpoint::prepare(&mut thread_rng());
let left_peer = KeyExchange::new(&mut thread_rng());
let right_peer = KeyExchange::new(&mut thread_rng());
let prss = prss_setup.setup(&left_peer.public_key(), &right_peer.public_key())
.indexed(&Gate::default());
prss
}
fn prss_benchmark(c: &mut Criterion) {
// Setup PRSS outside measured code.
let mut group = c.benchmark_group("prss_one_chunk");
let prss = make_prss();
// PRSS generates a pair of 16 byte values
group.throughput(Throughput::Bytes(32));
let mut index = 0_u32;
group.bench_function("BA8", |b| {
b.iter(|| {
let data = prss.generate::<(BA8, _), _>(index);
index += 1;
black_box(data);
})
});
group.bench_function("BA64", |b| {
b.iter(|| {
let data = prss.generate::<(BA64, _), _>(index);
index += 1;
black_box(data);
})
});
group.finish();
let mut group = c.benchmark_group("prss_two_chunks");
group.throughput(Throughput::Bytes(64));
group.bench_function("BA256", |b| {
b.iter(|| {
let data = prss.generate::<(BA256, _), _>(index);
index += 1;
black_box(data);
})
});
}
criterion_group!(benches, prss_benchmark);
criterion_main!(benches);
BA8 is not that interesting, but BA64 performs worse than BA256 which was a bit surprising to me
BA256 has 10x throughput of BA64 and I think it can be explained by how they implement FromRandom.
BA256:
impl FromRandom for BA256 {
type SourceLength = U2;
fn from_random(src: GenericArray<u128, U2>) -> Self {
(src[0], src[1]).into()
}
}
impl From<(u128, u128)> for BA256 {
fn from(value: (u128, u128)) -> Self {
let iter = value
.0
.to_le_bytes()
.into_iter()
.chain(value.1.to_le_bytes());
let arr = GenericArray::<u8, U32>::try_from_iter(iter).unwrap();
BA256::deserialize_infallible(&arr)
}
}
BA64
impl FromRandomU128 for BA64 /* it is actually inside a macro, but it does not matter */ {
fn from_random_u128(src: u128) -> Self {
Self::truncate_from(src)
}
}
all boolean arrays implement a generic truncate_from operation
impl U128Conversions for $name {
fn truncate_from<T: Into<u128>>(v: T) -> Self {
let v = v.into();
let mut val = <Self as SharedValue>::ZERO;
for i in 0..std::cmp::min(128, $bits) {
val.set(i, Boolean::from((v >> i & 1) == 1));
}
val
}
fn as_u128(&self) -> u128 {
(*self).into()
}
}
For BA256 we sample two chunks from PRSS and then set them as lowest 128 bits and highest 128 bits with no bit fiddling. BA64 sets bits one by one.
We can work on specializing FromRandom implementations for boolean arrays to leverage setting bits in batches. For any type with size greater than 1 byte, we can make it much more efficient
I've been running some benchmarks to evaluate the potential regressions of some of my PRSS changes and found that the impact of
truncate_from
is quite significant.Here is the code that I used for benchmarking
BA8 is not that interesting, but BA64 performs worse than BA256 which was a bit surprising to me
BA256 has 10x throughput of BA64 and I think it can be explained by how they implement
FromRandom
.BA256:
BA64
all boolean arrays implement a generic
truncate_from
operationFor BA256 we sample two chunks from PRSS and then set them as lowest 128 bits and highest 128 bits with no bit fiddling. BA64 sets bits one by one.
We can work on specializing
FromRandom
implementations for boolean arrays to leverage setting bits in batches. For any type with size greater than 1 byte, we can make it much more efficient