add scalar batch invert

tyurek commented 1 week ago

There's one part of the PRF code that can likely be sped up by using batch inversion of Fp25519. This PR adds a function to call the Scalar batch_invert in curve25519-dalek which can be used here.

The function I wrote is not memory efficient at all because I couldn't figure out how to effectively work with <V as Vectorizable>::Array>. I'd appreciate some suggestions here :) Using Vecs appears to be necessary as it's part of the dalek batch_invert code. To re-implement that code using arrays, we would need to fork dalek to expose UnpackedScalar and the pack/unpack methods.

Even so, this version appears to be faster: 2M records in 6:12 https://draft-mpc.vercel.app/query/view/sore-deal2024-11-21T1848 4M records in 12:12 https://draft-mpc.vercel.app/query/view/roast-loft2024-11-21T2017

Compared to current main: 2M in 6:22 https://draft-mpc.vercel.app/query/view/gaudy-whack2024-11-21T1856 4M in 12:27 https://draft-mpc.vercel.app/query/view/beta-porch2024-11-21T2003

codecov[bot] commented 1 week ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 93.51%. Comparing base (f42e334) to head (ed44a9f).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1450 +/- ## ========================================== - Coverage 93.53% 93.51% -0.02% ========================================== Files 230 230 Lines 40683 40698 +15 ========================================== + Hits 38053 38060 +7 - Misses 2630 2638 +8 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

andyleiserson commented 1 week ago

My guess is that the overhead of the allocation is negligible relative to the cost of the inversion. But a few ideas:

Hack up an unsafe version to see how much performance is at stake (https://stackoverflow.com/questions/63174974/safely-cast-a-newtype-slice-into-inner-variant-without-cloning has some relevant discussion)
You could add a map method to SharedValueArray analogous to array::map, to enable converting to [Scalar; N] rather than a Vec.
You could add try_into after the collect to convert to [Scalar; N], and check the generated code. I'm not sure whether the compiler would optimize this -- I've definitely seen cases where it doesn't, but I think the vectorization factor here is only 16 and the .map() operation is simple, so maybe it will figure it out.
Possibly we could make <Fp25519 as Vectorizable<N>>::Array an array of Scalars rather than an array of Fp25519? There will almost certainly be issues to resolve, so I wouldn't pursue this unless there's a confirmed performance issue and none of the other strategies work.

akoshelev commented 1 week ago

I made the same change a few weeks ago and perf boost was negligible. Hope that it is not the case here

akoshelev commented 1 week ago

found the code: https://github.com/private-attribution/ipa/pull/1393

akoshelev commented 1 week ago

@andyleiserson pointed out that it was not the same thing - I did batch decompress while this PR is doing batch invert

tyurek commented 4 days ago

@andyleiserson I went with the try_into approach for now. What is your approach for checking generated code like this?

Also, how hard would it be to implement Iterable and the ability to index for SharedValueArray?

andyleiserson commented 4 days ago

What is your approach for checking generated code like this?

Either extract something that can be compiled standalone and use https://rust.godbolt.org/, or use objdump -d to inspect my own build. Or, a different way of checking this would be to find or create a suitable unit test or benchmark, and use cargo flamegraph to make sure there isn't a big chunk of time going somewhere you don't expect. (However, one problem with reading the flamegraphs is that inlining can obscure where the time is going.)

Also, how hard would it be to implement Iterable and the ability to index for SharedValueArray?

It already has IntoIterator. The problem with some kinds of iteration and indexing is that SharedValueArray can also be a bit vector. I'm also somewhat opposed to indexing into SharedValueArray on principle -- vectorized implementations should always be repeating the same operation for each element of the vector.

private-attribution / ipa

add scalar batch invert #1450

Codecov Report