private-attribution / ipa

A raw implementation of Interoperable Private Attribution
MIT License
39 stars 23 forks source link

Specialize boolean array from PRSS implementations #1184

Open akoshelev opened 2 weeks ago

akoshelev commented 2 weeks ago

I've been running some benchmarks to evaluate the potential regressions of some of my PRSS changes and found that the impact of truncate_from is quite significant.

Here is the code that I used for benchmarking

/* use section */

fn make_prss() -> Arc<IndexedSharedRandomness> {
    let prss_setup = Endpoint::prepare(&mut thread_rng());
    let left_peer = KeyExchange::new(&mut thread_rng());
    let right_peer = KeyExchange::new(&mut thread_rng());

    let prss = prss_setup.setup(&left_peer.public_key(), &right_peer.public_key())
        .indexed(&Gate::default());

    prss
}

fn prss_benchmark(c: &mut Criterion) {
    // Setup PRSS outside measured code.
    let mut group = c.benchmark_group("prss_one_chunk");
    let prss = make_prss();

    // PRSS generates a pair of 16 byte values
    group.throughput(Throughput::Bytes(32));
    let mut index = 0_u32;
    group.bench_function("BA8", |b| {
        b.iter(|| {
            let data = prss.generate::<(BA8, _), _>(index);
            index += 1;
            black_box(data);
        })
    });
    group.bench_function("BA64", |b| {
        b.iter(|| {
            let data = prss.generate::<(BA64, _), _>(index);
            index += 1;
            black_box(data);
        })
    });
    group.finish();

    let mut group = c.benchmark_group("prss_two_chunks");
    group.throughput(Throughput::Bytes(64));
    group.bench_function("BA256", |b| {
        b.iter(|| {
            let data = prss.generate::<(BA256, _), _>(index);
            index += 1;
            black_box(data);
        })
    });
}

criterion_group!(benches, prss_benchmark);
criterion_main!(benches);

BA8 is not that interesting, but BA64 performs worse than BA256 which was a bit surprising to me Screenshot 2024-07-05 at 12 15 06 PM Screenshot 2024-07-05 at 12 15 01 PM

BA256 has 10x throughput of BA64 and I think it can be explained by how they implement FromRandom.

BA256:

impl FromRandom for BA256 {
    type SourceLength = U2;
    fn from_random(src: GenericArray<u128, U2>) -> Self {
        (src[0], src[1]).into()
    }
}

impl From<(u128, u128)> for BA256 {
    fn from(value: (u128, u128)) -> Self {
        let iter = value
            .0
            .to_le_bytes()
            .into_iter()
            .chain(value.1.to_le_bytes());
        let arr = GenericArray::<u8, U32>::try_from_iter(iter).unwrap();
        BA256::deserialize_infallible(&arr)
    }
}

BA64

     impl FromRandomU128 for BA64 /* it is actually inside a macro, but it does not matter */ {
            fn from_random_u128(src: u128) -> Self {
                Self::truncate_from(src)
            }
        }

all boolean arrays implement a generic truncate_from operation

    impl U128Conversions for $name {
            fn truncate_from<T: Into<u128>>(v: T) -> Self {
                let v = v.into();
                let mut val = <Self as SharedValue>::ZERO;
                for i in 0..std::cmp::min(128, $bits) {
                    val.set(i, Boolean::from((v >> i & 1) == 1));
                }

                val
            }

            fn as_u128(&self) -> u128 {
                (*self).into()
            }
        }

For BA256 we sample two chunks from PRSS and then set them as lowest 128 bits and highest 128 bits with no bit fiddling. BA64 sets bits one by one.

We can work on specializing FromRandom implementations for boolean arrays to leverage setting bits in batches. For any type with size greater than 1 byte, we can make it much more efficient

benjaminsavage commented 1 week ago

Wow... yeah, setting bit by bit is definitely not going to be as performant as just a memcpy :)