Open tyurek opened 1 week ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 93.51%. Comparing base (
f42e334
) to head (ed44a9f
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
🚨 Try these New Features:
My guess is that the overhead of the allocation is negligible relative to the cost of the inversion. But a few ideas:
map
method to SharedValueArray
analogous to array::map
, to enable converting to [Scalar; N]
rather than a Vec
.try_into
after the collect
to convert to [Scalar; N]
, and check the generated code. I'm not sure whether the compiler would optimize this -- I've definitely seen cases where it doesn't, but I think the vectorization factor here is only 16 and the .map()
operation is simple, so maybe it will figure it out.<Fp25519 as Vectorizable<N>>::Array
an array of Scalar
s rather than an array of Fp25519
? There will almost certainly be issues to resolve, so I wouldn't pursue this unless there's a confirmed performance issue and none of the other strategies work.I made the same change a few weeks ago and perf boost was negligible. Hope that it is not the case here
found the code: https://github.com/private-attribution/ipa/pull/1393
@andyleiserson pointed out that it was not the same thing - I did batch decompress while this PR is doing batch invert
@andyleiserson I went with the try_into approach for now. What is your approach for checking generated code like this?
Also, how hard would it be to implement Iterable and the ability to index for SharedValueArray?
What is your approach for checking generated code like this?
Either extract something that can be compiled standalone and use https://rust.godbolt.org/, or use objdump -d
to inspect my own build. Or, a different way of checking this would be to find or create a suitable unit test or benchmark, and use cargo flamegraph
to make sure there isn't a big chunk of time going somewhere you don't expect. (However, one problem with reading the flamegraphs is that inlining can obscure where the time is going.)
Also, how hard would it be to implement Iterable and the ability to index for SharedValueArray?
It already has IntoIterator
. The problem with some kinds of iteration and indexing is that SharedValueArray
can also be a bit vector. I'm also somewhat opposed to indexing into SharedValueArray
on principle -- vectorized implementations should always be repeating the same operation for each element of the vector.
There's one part of the PRF code that can likely be sped up by using batch inversion of Fp25519. This PR adds a function to call the Scalar batch_invert in curve25519-dalek which can be used here.
The function I wrote is not memory efficient at all because I couldn't figure out how to effectively work with <V as Vectorizable>::Array>. I'd appreciate some suggestions here :)
Using Vecs appears to be necessary as it's part of the dalek batch_invert code. To re-implement that code using arrays, we would need to fork dalek to expose UnpackedScalar and the pack/unpack methods.
Even so, this version appears to be faster: 2M records in 6:12 https://draft-mpc.vercel.app/query/view/sore-deal2024-11-21T1848 4M records in 12:12 https://draft-mpc.vercel.app/query/view/roast-loft2024-11-21T2017
Compared to current main: 2M in 6:22 https://draft-mpc.vercel.app/query/view/gaudy-whack2024-11-21T1856 4M in 12:27 https://draft-mpc.vercel.app/query/view/beta-porch2024-11-21T2003