hyperledger-solang / solang

Solidity Compiler for Solana and Polkadot
https://solang.readthedocs.io/
Apache License 2.0
1.27k stars 213 forks source link

Substrate: Solidity version of `circomlib`s MiMC sponge hash is horribly slow #1361

Open xermicus opened 1 year ago

xermicus commented 1 year ago

While working on the tornado integration test, I came across this Solidity implementation. It was so slow that it wasn't even usable: It broke the merkle tree implementation for heights > 4 or so. Tornado cash uses a merkle tree height of 20 (the heaviest offender regarding gas usage is the EC pairing and not this). We should investigate as to why that is. I don't expect great performance but it should at least be usable.

contract NativeHasher { 
    function MiMCSponge(uint256 xL_in,uint256 xR_in) external pure returns(uint256,uint256) { 
        uint n = 220; 
        bytes32 ci = 0x0fbe43c36a80e36d7c7c584d4f8f3759fb51f0d66065d8a227b688d12488c5d4;
        uint q = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593f0000001;
        uint xL = xL_in;
        uint xR = xR_in;
        uint t=xL;
        uint b=mulmod(t, t, q);
        uint c=mulmod(b, b, q);
        uint d=mulmod(c, t, q);
        xR=addmod(xR, d, q);

        for(uint i = 0; i < n-1; i++){
            if (i<n-2 && i!=0){
                ci = keccak256(abi.encodePacked(ci));
            }
            else if(i==n-2)
            {
                ci = bytes32(0);
            }
            (xL, xR)=(xR, xL);
            t = addmod(uint(ci), xL, q);

            b = mulmod(t, t, q);
            c = mulmod(b, b, q);
            d = mulmod(c, t, q);
            xR = addmod(xR, d, q);
        }
        print("{:x}".format(xL));
        print("{:x}".format(xR));
        return (xL, xR);
    } 

} 
seanyoung commented 1 year ago

mulmod() is implemented with 512 bit arithmetic. This is going to super-slow. I think addmod() uses 257 bit types but that's from memory.

My guess is, this is an "optimize mulmod()" task.

xermicus commented 9 months ago

Hmm yeah, there is definitively something going on in our mulmod. I benchmarked it against EVM (which implements it with 512bit arithmetics too):

mimc/solang             time:   [125.97 ms 126.77 ms 127.62 ms]
                        change: [+2.6047% +3.4181% +4.1788%] (p = 0.00 < 0.05)
                        Performance has regressed.
mimc/evm                time:   [1.2140 ms 1.2171 ms 1.2203 ms]
                        change: [-22.061% -21.294% -20.509%] (p = 0.00 < 0.05)
                        Performance has improved.

Chances are that the Wasm interpreter slows this down 100x but I'd find that surprising (wasmi is slower than wasmtime but IIRC not in the orders of 100x).

xermicus commented 9 months ago

After some benchmarks, this might actually be mainly caused by interpreting Wasm.

Isolated benchmarks of mulmod and addmod

I isolated the mulmod and addmod builtins inside a benchmark:

    function remainders(
        uint256 xL_in,
        uint256 xR_in
    ) external pure returns (uint256, uint256) {
        uint q = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593f0000001;

        for (uint i = 0; i < 256; i++) {
            xL_in = mulmod(xL_in, xR_in, q);
            xR_in = addmod(xL_in, xR_in, q);
        }

        return (xL_in, xR_in);
    }

Which yields about the following results.

Wasm

remainders/solang       time:   [44.727 ms 44.846 ms 44.959 ms]
                        change: [-2.6571% -2.3218% -1.9669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
remainders/evm          time:   [230.79 µs 231.33 µs 231.98 µs]
                        change: [+0.9611% +1.8369% +2.7484%] (p = 0.00 < 0.05)
                        Change within noise threshold.

rv32e

remainders/solang       time:   [4.7196 ms 5.1727 ms 5.4670 ms]
                        change: [-15.227% -8.1393% -0.6845%] (p = 0.04 < 0.05)
                        Change within noise threshold.
remainders/evm          time:   [226.31 µs 227.77 µs 229.59 µs]
                        change: [-1.3529% -0.7989% -0.2998%] (p = 0.00 < 0.05)
                        Change within noise threshold.

I expect 32bit is penalizing this benchmark further; it should be better once we can benchmark with rv64e.

Naive mulmod and addmod implementations

As an additional experiment, in my risc-v branch I changed the implementation of mulmod (and addmod). It is just a naive implementation using LLVM IR arithmetics directly:

            let arith_ty = bin.context.custom_width_int_type(512);
            let res_ty = bin.context.custom_width_int_type(256);

            let x = expression(target, bin, &args[0], vartab, function, ns).into_int_value();
            let y = expression(target, bin, &args[1], vartab, function, ns).into_int_value();
            let k = expression(target, bin, &args[2], vartab, function, ns).into_int_value();

            // Z-Extend everything
            let x = bin.builder.build_int_z_extend(x, arith_ty, "x");
            let y = bin.builder.build_int_z_extend(y, arith_ty, "y");
            let k = bin.builder.build_int_z_extend(k, arith_ty, "k");

            // Compute product and remainder
            let m = bin.builder.build_int_mul(x, y, "x * y");
            let r = bin.builder.build_int_unsigned_rem(m, k, "m % k");

            // Because the remainder is always within uint256 bounds this is safe
            bin.builder.build_int_truncate(r, res_ty, "trunc").into()

The unit test pass. However for Wasm this doesn't work just like that because it now pulls in the __multi3 and for Wasm we don't link compiler-rt (for RISC-V I did that so it works on the experimental target).

Implemented this way, the code size increases drastically; however those builtins are inlined now. Shoving them into their own CFG would defuse that problem.

Benchmark with this alternative implementation:

remainders/solang       time:   [5.6077 ms 5.8882 ms 6.0744 ms]
                        change: [+5.6167% +12.019% +18.950%] (p = 0.00 < 0.05)
                        Performance has regressed.
remainders/evm          time:   [234.09 µs 236.31 µs 239.41 µs]
                        change: [+1.2612% +2.4471% +3.5587%] (p = 0.00 < 0.05)
                        Performance has regressed.

So, our existing implementation is already about 10% faster than that (while also maintaining smaller code size).

Conclusion

Around 10x performance improvement for free on the 32bit RISC-V target! I think we should wait and see how this looks like on 64bit RISC-V to decide how much time and effort further optimizing those built-ins would be worth it.

Robbepop commented 9 months ago

I just had a chat about the performance numbers with @xermicus and wanted to reiterate here for the public:

xermicus commented 9 months ago

My observation so far is that EVM interpreters usually have basically negligible overhead for compilation and instantiation. By my understanding they only have to construct a list of all valid jump destinations. So more optimizations in the startup phase are always very welcome :)

Robbepop commented 9 months ago

My observation so far is that EVM interpreters usually have basically negligible overhead for compilation and instantiation. By my understanding they only have to construct a list of all valid jump destinations. So more optimizations in the startup phase are always very welcome :)

Hearing this I am looking forward to benchmarks with the new Wasmi version using its lazy compilation. :)