herumi / mcl-wasm

59 stars 18 forks source link

Batch Inversion Algorithm #39

Closed atomictag closed 1 month ago

atomictag commented 1 month ago

It would be very useful for several use cases (e.g. DFT, KZG, Verkle trie multi-proofs, ...) to provide a native batch Inversion algorithm such as

invVec(vec: Fr[]): Fr[]

A naive javascript implementation using 2n multiplications and 1 inversion outperforms n inversions by 2x with n=1000 so I suspect a native implementation would do much better. If I am not mistaken MCL already has an optimized invVec function.

Many thanks

herumi commented 1 month ago

I've added invVec for Fr and Fp at the dev branch. Could you try it?

atomictag commented 1 month ago

Thank you very much.

I just had a quick go. The batch inversion works correctly and it's quite fast indeed.

Tested and benchmarked (Apple M2 Max) with a vector of 1000 elements against a naive loop and my js version (Fr-only).

Fr: vector size =1000, runs = 1000. Averages:
mcl.inVec: 0.30 ms (in-place)
inVec_js: 1.06 ms
naive: 3.59 ms

Fp: vector size =1000, runs = 1000. Averages:
mcl.inVec: 0.73ms  (in-place)
naive: 6.72ms

so mcl.invVec for mcl.Fr is about 3x faster than a js-based implementation of the batching algorithm and about 10x faster than a naive algorithm. The Fp version has similar figures.

I see the batch inversion is "in place" and that you have added a clone() method. As a suggestion, the invVec method could take a parameter inPlace: boolean to let the user choose the strategy that suits best the use case (this would make sense only in case there's a faster way to clone than doing the cloning in user code - which is easy)

My js function always returns a new vector and does not modify the input elements. If I naively clone each element of the input array before calling mcl.inVec I get these results:

Fr: vector size =1000, runs = 1000. Averages:
mcl.inVec: 0.51 ms (cloned)
inVec_js: 1.02 ms
naive: 3.61 ms

Fp: vector size =1000, runs = 1000. Averages:
mcl.inVec: 1.02 (cloned)
naive: 6.72ms

so there is obviously a decrease in performance of ~60%, but that's still 2x improvement vs the js-only version.

For smaller vectors (of say ~256 elements - typical for example in Verkle Tries) the benefits are obviously less perceivable but maintain the same relative improvements:

Fr: vector size =256, runs = 1000. Averages:
mcl.inVec: 0.08 (in-place), 0.12 ms (cloned)
inVec_js: 0.28 ms
naive: 0.90 ms

Interestingly the author of the Montgomery WASM library (optimized MSM for BLS12-381) reports that: "There's zero benefit in moving a routine like batchInverse to Wasm [...] the inverse function is about the highest level that Wasm should operate on".

atomictag commented 1 month ago

BTW as a minor optimization:

const _cloneArray = (x) => {
    const cstr = x.constructor;
    const r = new cstr();
    r.a_ = new Uint32Array(x.a_);
    return r;
};

could be re-written as

const _cloneArray = (x) => {
    const cstr = x.constructor;
    const r = new cstr();
    r.a_.set(x.a_);
    return r;
};

since new cstr() already creates the array buffer a_

herumi commented 1 month ago

I renamed invVec to invVecInPlace and append invVec which returns the processed value. https://github.com/herumi/mcl-wasm/commits/dev/

Though I don't see the referenced library, my library has an overhead of copying wasm data to a JavaScript array vice versa, so invVec may be faster.

atomictag commented 1 month ago

Awesome!

Updated tests:

Fr: vector size = 1000, runs = 1000. Average per run:
- invVecInPlace: 0.31 ms
- inVec:         0.47 ms
- inVec_js:      1.12 ms
- naive:         3.68 ms

Fp: vector size = 1000, runs = 1000. Average per run:
- invVecInPlace: 0.72 ms
- inVec:         0.89 ms
- naive:         6.70 ms

Looks great - thank you so much 🙏

atomictag commented 1 month ago

I have a question:

This is the current _invVec (which does not change the input vector xVec)

const _invVec = (func, xVec) => {
    const n = xVec.length;
    if (n == 0) return [];
    const cstr = xVec[0].constructor;
    const yVec = Array.from({ length: n }, _ => new cstr());
    const stack = mcl_1.mod.stackSave();
    const yPos = mcl_1.mod.stackAlloc(xVec[0].a_.length * 4 * n); // <====== why??
    const xPos = _sarrayAllocAndCopy(xVec);
    func(yPos, xPos, n);
    // copy from WASM memory into yVec
    _saveArray(yVec, yPos);
    // restore stack and return yVec
    mcl_1.mod.stackRestore(stack);
    return yVec;
};

the yPos stack allocation seem unnecessary because we can do exacrly what invVecInPlace and then save into a new y array. See the following:

const _invVec = (func, xVec) => {
    const n = xVec.length;
    if (n == 0) return [];
    // Do the same as invVecInPlace does
    const stack = mcl_1.mod.stackSave();
    const xPos = _sarrayAllocAndCopy(xVec);
    func(xPos, xPos, n);
    // Now allocate the return vector yVec
    const cstr = xVec[0].constructor;
    const yVec = Array.from({ length: n }, _ => new cstr());
    // copy from WASM memory into yVec
    _saveArray(yVec, xPos);
    // restore stack and return yVec
    mcl_1.mod.stackRestore(stack);
    return yVec;
};

My understanding of the MCL invVecWork implementation is that when x and y refer to the same memory location, no extra memory is needed for y and I would assume fewer memory accesses as well. On the other hand when x and y are different we need enough memory for both arrays. That make sense in C/C++ to support both the in-place behavior and the copy behavior. However in JS/WASM the actual xVec is not touched at all unless we override its a_ buffer - which we don't if we use a fresh new yVec. So it would seem to me that it would make sense to always use the same x and y ("in-place" behavior in C++) even when the desired behavior is "copy` in Javascript.

Now, this probably has no measurable impact in neither performance nor memory (as long as there's enough heap, I guess), but I am interested in the mechanics at play here.

As a consequence the main culprit for the difference in raw performance between invVecand invVecInPlace (and more in general any mcl-wasm API) is not really the copying of data to-from the heap, but the creation of the Uint32Array storage for each mcl object (noticeable if vectors are big enough)

Thank you for helping with my understanding!

herumi commented 1 month ago

Thank you for the quick comment. You are right. I was in the process of trial and error. https://github.com/herumi/mcl-wasm/commit/b98a45fce7ef72ca49454ad003d1a6cf4ff0b313