Closed atomictag closed 1 month ago
I've added invVec
for Fr and Fp at the dev branch.
Could you try it?
Thank you very much.
I just had a quick go. The batch inversion works correctly and it's quite fast indeed.
Tested and benchmarked (Apple M2 Max) with a vector of 1000 elements against a naive loop and my js version (Fr-only).
Fr: vector size =1000, runs = 1000. Averages:
mcl.inVec: 0.30 ms (in-place)
inVec_js: 1.06 ms
naive: 3.59 ms
Fp: vector size =1000, runs = 1000. Averages:
mcl.inVec: 0.73ms (in-place)
naive: 6.72ms
so mcl.invVec
for mcl.Fr is about 3x faster than a js-based implementation of the batching algorithm and about 10x faster than a naive algorithm. The Fp version has similar figures.
I see the batch inversion is "in place" and that you have added a clone()
method. As a suggestion, the invVec
method could take a parameter inPlace: boolean
to let the user choose the strategy that suits best the use case (this would make sense only in case there's a faster way to clone than doing the cloning in user code - which is easy)
My js function always returns a new vector and does not modify the input elements.
If I naively clone each element of the input array before calling mcl.inVec
I get these results:
Fr: vector size =1000, runs = 1000. Averages:
mcl.inVec: 0.51 ms (cloned)
inVec_js: 1.02 ms
naive: 3.61 ms
Fp: vector size =1000, runs = 1000. Averages:
mcl.inVec: 1.02 (cloned)
naive: 6.72ms
so there is obviously a decrease in performance of ~60%, but that's still 2x improvement vs the js-only version.
For smaller vectors (of say ~256 elements - typical for example in Verkle Tries) the benefits are obviously less perceivable but maintain the same relative improvements:
Fr: vector size =256, runs = 1000. Averages:
mcl.inVec: 0.08 (in-place), 0.12 ms (cloned)
inVec_js: 0.28 ms
naive: 0.90 ms
Interestingly the author of the Montgomery WASM library (optimized MSM for BLS12-381) reports that: "There's zero benefit in moving a routine like batchInverse to Wasm [...] the inverse function is about the highest level that Wasm should operate on".
BTW as a minor optimization:
const _cloneArray = (x) => {
const cstr = x.constructor;
const r = new cstr();
r.a_ = new Uint32Array(x.a_);
return r;
};
could be re-written as
const _cloneArray = (x) => {
const cstr = x.constructor;
const r = new cstr();
r.a_.set(x.a_);
return r;
};
since new cstr()
already creates the array buffer a_
I renamed invVec
to invVecInPlace
and append invVec
which returns the processed value.
https://github.com/herumi/mcl-wasm/commits/dev/
Though I don't see the referenced library, my library has an overhead of copying wasm data to a JavaScript array vice versa, so invVec may be faster.
Awesome!
Updated tests:
Fr: vector size = 1000, runs = 1000. Average per run:
- invVecInPlace: 0.31 ms
- inVec: 0.47 ms
- inVec_js: 1.12 ms
- naive: 3.68 ms
Fp: vector size = 1000, runs = 1000. Average per run:
- invVecInPlace: 0.72 ms
- inVec: 0.89 ms
- naive: 6.70 ms
Looks great - thank you so much 🙏
I have a question:
This is the current _invVec
(which does not change the input vector xVec)
const _invVec = (func, xVec) => {
const n = xVec.length;
if (n == 0) return [];
const cstr = xVec[0].constructor;
const yVec = Array.from({ length: n }, _ => new cstr());
const stack = mcl_1.mod.stackSave();
const yPos = mcl_1.mod.stackAlloc(xVec[0].a_.length * 4 * n); // <====== why??
const xPos = _sarrayAllocAndCopy(xVec);
func(yPos, xPos, n);
// copy from WASM memory into yVec
_saveArray(yVec, yPos);
// restore stack and return yVec
mcl_1.mod.stackRestore(stack);
return yVec;
};
the yPos
stack allocation seem unnecessary because we can do exacrly what invVecInPlace
and then save into a new y array. See the following:
const _invVec = (func, xVec) => {
const n = xVec.length;
if (n == 0) return [];
// Do the same as invVecInPlace does
const stack = mcl_1.mod.stackSave();
const xPos = _sarrayAllocAndCopy(xVec);
func(xPos, xPos, n);
// Now allocate the return vector yVec
const cstr = xVec[0].constructor;
const yVec = Array.from({ length: n }, _ => new cstr());
// copy from WASM memory into yVec
_saveArray(yVec, xPos);
// restore stack and return yVec
mcl_1.mod.stackRestore(stack);
return yVec;
};
My understanding of the MCL invVecWork
implementation is that when x and y refer to the same memory location, no extra memory is needed for y and I would assume fewer memory accesses as well. On the other hand when x and y are different we need enough memory for both arrays. That make sense in C/C++ to support both the in-place behavior and the copy behavior. However in JS/WASM the actual xVec is not touched at all unless we override its a_
buffer - which we don't if we use a fresh new yVec. So it would seem to me that it would make sense to always use the same x
and y
("in-place" behavior in C++) even when the desired behavior is "copy` in Javascript.
Now, this probably has no measurable impact in neither performance nor memory (as long as there's enough heap, I guess), but I am interested in the mechanics at play here.
As a consequence the main culprit for the difference in raw performance between invVec
and invVecInPlace
(and more in general any mcl-wasm API) is not really the copying of data to-from the heap, but the creation of the Uint32Array storage for each mcl object (noticeable if vectors are big enough)
Thank you for helping with my understanding!
Thank you for the quick comment. You are right. I was in the process of trial and error. https://github.com/herumi/mcl-wasm/commit/b98a45fce7ef72ca49454ad003d1a6cf4ff0b313
It would be very useful for several use cases (e.g. DFT, KZG, Verkle trie multi-proofs, ...) to provide a native batch Inversion algorithm such as
A naive javascript implementation using 2n multiplications and 1 inversion outperforms n inversions by 2x with n=1000 so I suspect a native implementation would do much better. If I am not mistaken MCL already has an optimized
invVec
function.Many thanks