Substrate: Solidity version of `circomlib`s MiMC sponge hash is horribly slow

xermicus commented 1 year ago

While working on the tornado integration test, I came across this Solidity implementation. It was so slow that it wasn't even usable: It broke the merkle tree implementation for heights > 4 or so. Tornado cash uses a merkle tree height of 20 (the heaviest offender regarding gas usage is the EC pairing and not this). We should investigate as to why that is. I don't expect great performance but it should at least be usable.

contract NativeHasher { 
    function MiMCSponge(uint256 xL_in,uint256 xR_in) external pure returns(uint256,uint256) { 
        uint n = 220; 
        bytes32 ci = 0x0fbe43c36a80e36d7c7c584d4f8f3759fb51f0d66065d8a227b688d12488c5d4;
        uint q = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593f0000001;
        uint xL = xL_in;
        uint xR = xR_in;
        uint t=xL;
        uint b=mulmod(t, t, q);
        uint c=mulmod(b, b, q);
        uint d=mulmod(c, t, q);
        xR=addmod(xR, d, q);

        for(uint i = 0; i < n-1; i++){
            if (i<n-2 && i!=0){
                ci = keccak256(abi.encodePacked(ci));
            }
            else if(i==n-2)
            {
                ci = bytes32(0);
            }
            (xL, xR)=(xR, xL);
            t = addmod(uint(ci), xL, q);

            b = mulmod(t, t, q);
            c = mulmod(b, b, q);
            d = mulmod(c, t, q);
            xR = addmod(xR, d, q);
        }
        print("{:x}".format(xL));
        print("{:x}".format(xR));
        return (xL, xR);
    } 

}

seanyoung commented 1 year ago

mulmod() is implemented with 512 bit arithmetic. This is going to super-slow. I think addmod() uses 257 bit types but that's from memory.

My guess is, this is an "optimize mulmod()" task.

xermicus commented 9 months ago

Hmm yeah, there is definitively something going on in our mulmod. I benchmarked it against EVM (which implements it with 512bit arithmetics too):

mimc/solang             time:   [125.97 ms 126.77 ms 127.62 ms]
                        change: [+2.6047% +3.4181% +4.1788%] (p = 0.00 < 0.05)
                        Performance has regressed.
mimc/evm                time:   [1.2140 ms 1.2171 ms 1.2203 ms]
                        change: [-22.061% -21.294% -20.509%] (p = 0.00 < 0.05)
                        Performance has improved.

Chances are that the Wasm interpreter slows this down 100x but I'd find that surprising (wasmi is slower than wasmtime but IIRC not in the orders of 100x).

xermicus commented 9 months ago

After some benchmarks, this might actually be mainly caused by interpreting Wasm.

Isolated benchmarks of `mulmod` and `addmod`

I isolated the mulmod and addmod builtins inside a benchmark:

    function remainders(
        uint256 xL_in,
        uint256 xR_in
    ) external pure returns (uint256, uint256) {
        uint q = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593f0000001;

        for (uint i = 0; i < 256; i++) {
            xL_in = mulmod(xL_in, xR_in, q);
            xR_in = addmod(xL_in, xR_in, q);
        }

        return (xL_in, xR_in);
    }

Which yields about the following results.

Wasm

remainders/solang       time:   [44.727 ms 44.846 ms 44.959 ms]
                        change: [-2.6571% -2.3218% -1.9669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
remainders/evm          time:   [230.79 µs 231.33 µs 231.98 µs]
                        change: [+0.9611% +1.8369% +2.7484%] (p = 0.00 < 0.05)
                        Change within noise threshold.

rv32e

remainders/solang       time:   [4.7196 ms 5.1727 ms 5.4670 ms]
                        change: [-15.227% -8.1393% -0.6845%] (p = 0.04 < 0.05)
                        Change within noise threshold.
remainders/evm          time:   [226.31 µs 227.77 µs 229.59 µs]
                        change: [-1.3529% -0.7989% -0.2998%] (p = 0.00 < 0.05)
                        Change within noise threshold.

I expect 32bit is penalizing this benchmark further; it should be better once we can benchmark with rv64e.

Naive `mulmod` and `addmod` implementations

As an additional experiment, in my risc-v branch I changed the implementation of mulmod (and addmod). It is just a naive implementation using LLVM IR arithmetics directly:

            let arith_ty = bin.context.custom_width_int_type(512);
            let res_ty = bin.context.custom_width_int_type(256);

            let x = expression(target, bin, &args[0], vartab, function, ns).into_int_value();
            let y = expression(target, bin, &args[1], vartab, function, ns).into_int_value();
            let k = expression(target, bin, &args[2], vartab, function, ns).into_int_value();

            // Z-Extend everything
            let x = bin.builder.build_int_z_extend(x, arith_ty, "x");
            let y = bin.builder.build_int_z_extend(y, arith_ty, "y");
            let k = bin.builder.build_int_z_extend(k, arith_ty, "k");

            // Compute product and remainder
            let m = bin.builder.build_int_mul(x, y, "x * y");
            let r = bin.builder.build_int_unsigned_rem(m, k, "m % k");

            // Because the remainder is always within uint256 bounds this is safe
            bin.builder.build_int_truncate(r, res_ty, "trunc").into()

The unit test pass. However for Wasm this doesn't work just like that because it now pulls in the __multi3 and for Wasm we don't link compiler-rt (for RISC-V I did that so it works on the experimental target).

Implemented this way, the code size increases drastically; however those builtins are inlined now. Shoving them into their own CFG would defuse that problem.

Benchmark with this alternative implementation:

remainders/solang       time:   [5.6077 ms 5.8882 ms 6.0744 ms]
                        change: [+5.6167% +12.019% +18.950%] (p = 0.00 < 0.05)
                        Performance has regressed.
remainders/evm          time:   [234.09 µs 236.31 µs 239.41 µs]
                        change: [+1.2612% +2.4471% +3.5587%] (p = 0.00 < 0.05)
                        Performance has regressed.

So, our existing implementation is already about 10% faster than that (while also maintaining smaller code size).

Conclusion

Around 10x performance improvement for free on the 32bit RISC-V target! I think we should wait and see how this looks like on 64bit RISC-V to decide how much time and effort further optimizing those built-ins would be worth it.

Robbepop commented 9 months ago

I just had a chat about the performance numbers with @xermicus and wanted to reiterate here for the public:

The Wasmi version used in these benchmarks is v0.31.0 which is the current stable version using a stack-machine execution engine.
Since Wasmi is an interpreter and PolkaVM a JIT compiler it is to be expected that Wasmi can be roughly 10-20x slower in execution performance.
I would recommend using the newer Wasmi v0.32.0-beta.N (most recent v0.32.0-beta.3) in your benchmarks since this version is a huge update over v0.31.0 with respect to performance improvements for both execution and compilation metrics. Also this is the version that is going to be used in pallet_contracts in the future and therefore the direct "competitor" to PolkaVM for smart contract execution. Since it is still in beta there might be bugs though.
Generally there are 3 different steps in benchmarking smart contract performance:
1. Compilation: how much time does Wasmi or PolkaVM need to compile a smart contract?
2. Instantiation: how quickly can Wasmi or PolkaVM establish the required runtime objects for execution?
3. Execution: how fast can Wasmi or PolkaVM execute Wasm or RISC-V code after steps 1. and 2.?
Steps 1. and 2. are the start-up phase and step 3 obviously execution phase. From benchmarks conducted so far Wasmi (register) with lazy compilation is slightly faster in step 1. and 2. and PolkaVM is faster in step 3. However, these numbers are very much WIP and should only provide a rough estimation of what to roughly expect of smart contract benchmarks.

My expectations: I expect that PolkaVM with 64-bit support should improve performance for some workloads significantly. Also I see no reason why PolkaVM cannot compile RISC-V as fast as Wasmi with its lazy compilation. Currently Jan tries to implement immediate execution (during compilation) but it is equally possible to also implement lazy compilation in PolkaVM in case immediate execution performance improvements aren't as good as hoped.

xermicus commented 9 months ago

My observation so far is that EVM interpreters usually have basically negligible overhead for compilation and instantiation. By my understanding they only have to construct a list of all valid jump destinations. So more optimizations in the startup phase are always very welcome :)

Robbepop commented 9 months ago

My observation so far is that EVM interpreters usually have basically negligible overhead for compilation and instantiation. By my understanding they only have to construct a list of all valid jump destinations. So more optimizations in the startup phase are always very welcome :)

Hearing this I am looking forward to benchmarks with the new Wasmi version using its lazy compilation. :)

hyperledger-solang / solang