Open xermicus opened 1 year ago
mulmod()
is implemented with 512 bit arithmetic. This is going to super-slow. I think addmod()
uses 257 bit types but that's from memory.
My guess is, this is an "optimize mulmod()
" task.
Hmm yeah, there is definitively something going on in our mulmod
. I benchmarked it against EVM (which implements it with 512bit arithmetics too):
mimc/solang time: [125.97 ms 126.77 ms 127.62 ms]
change: [+2.6047% +3.4181% +4.1788%] (p = 0.00 < 0.05)
Performance has regressed.
mimc/evm time: [1.2140 ms 1.2171 ms 1.2203 ms]
change: [-22.061% -21.294% -20.509%] (p = 0.00 < 0.05)
Performance has improved.
Chances are that the Wasm interpreter slows this down 100x but I'd find that surprising (wasmi
is slower than wasmtime but IIRC not in the orders of 100x).
After some benchmarks, this might actually be mainly caused by interpreting Wasm.
mulmod
and addmod
I isolated the mulmod
and addmod
builtins inside a benchmark:
function remainders(
uint256 xL_in,
uint256 xR_in
) external pure returns (uint256, uint256) {
uint q = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593f0000001;
for (uint i = 0; i < 256; i++) {
xL_in = mulmod(xL_in, xR_in, q);
xR_in = addmod(xL_in, xR_in, q);
}
return (xL_in, xR_in);
}
Which yields about the following results.
remainders/solang time: [44.727 ms 44.846 ms 44.959 ms]
change: [-2.6571% -2.3218% -1.9669%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
2 (10.00%) high mild
remainders/evm time: [230.79 µs 231.33 µs 231.98 µs]
change: [+0.9611% +1.8369% +2.7484%] (p = 0.00 < 0.05)
Change within noise threshold.
remainders/solang time: [4.7196 ms 5.1727 ms 5.4670 ms]
change: [-15.227% -8.1393% -0.6845%] (p = 0.04 < 0.05)
Change within noise threshold.
remainders/evm time: [226.31 µs 227.77 µs 229.59 µs]
change: [-1.3529% -0.7989% -0.2998%] (p = 0.00 < 0.05)
Change within noise threshold.
I expect 32bit is penalizing this benchmark further; it should be better once we can benchmark with rv64e.
mulmod
and addmod
implementationsAs an additional experiment, in my risc-v
branch I changed the implementation of mulmod
(and addmod
). It is just a naive implementation using LLVM IR arithmetics directly:
let arith_ty = bin.context.custom_width_int_type(512);
let res_ty = bin.context.custom_width_int_type(256);
let x = expression(target, bin, &args[0], vartab, function, ns).into_int_value();
let y = expression(target, bin, &args[1], vartab, function, ns).into_int_value();
let k = expression(target, bin, &args[2], vartab, function, ns).into_int_value();
// Z-Extend everything
let x = bin.builder.build_int_z_extend(x, arith_ty, "x");
let y = bin.builder.build_int_z_extend(y, arith_ty, "y");
let k = bin.builder.build_int_z_extend(k, arith_ty, "k");
// Compute product and remainder
let m = bin.builder.build_int_mul(x, y, "x * y");
let r = bin.builder.build_int_unsigned_rem(m, k, "m % k");
// Because the remainder is always within uint256 bounds this is safe
bin.builder.build_int_truncate(r, res_ty, "trunc").into()
The unit test pass. However for Wasm this doesn't work just like that because it now pulls in the __multi3
and for Wasm we don't link compiler-rt
(for RISC-V I did that so it works on the experimental target).
Implemented this way, the code size increases drastically; however those builtins are inlined now. Shoving them into their own CFG would defuse that problem.
Benchmark with this alternative implementation:
remainders/solang time: [5.6077 ms 5.8882 ms 6.0744 ms]
change: [+5.6167% +12.019% +18.950%] (p = 0.00 < 0.05)
Performance has regressed.
remainders/evm time: [234.09 µs 236.31 µs 239.41 µs]
change: [+1.2612% +2.4471% +3.5587%] (p = 0.00 < 0.05)
Performance has regressed.
So, our existing implementation is already about 10% faster than that (while also maintaining smaller code size).
Around 10x performance improvement for free on the 32bit RISC-V target! I think we should wait and see how this looks like on 64bit RISC-V to decide how much time and effort further optimizing those built-ins would be worth it.
I just had a chat about the performance numbers with @xermicus and wanted to reiterate here for the public:
v0.31.0
which is the current stable version using a stack-machine execution engine.v0.32.0-beta.N
(most recent v0.32.0-beta.3
) in your benchmarks since this version is a huge update over v0.31.0
with respect to performance improvements for both execution and compilation metrics. Also this is the version that is going to be used in pallet_contracts
in the future and therefore the direct "competitor" to PolkaVM for smart contract execution. Since it is still in beta there might be bugs though.Generally there are 3 different steps in benchmarking smart contract performance:
Steps 1. and 2. are the start-up phase and step 3 obviously execution phase. From benchmarks conducted so far Wasmi (register) with lazy compilation is slightly faster in step 1. and 2. and PolkaVM is faster in step 3. However, these numbers are very much WIP and should only provide a rough estimation of what to roughly expect of smart contract benchmarks.
My expectations: I expect that PolkaVM with 64-bit support should improve performance for some workloads significantly. Also I see no reason why PolkaVM cannot compile RISC-V as fast as Wasmi with its lazy compilation. Currently Jan tries to implement immediate execution (during compilation) but it is equally possible to also implement lazy compilation in PolkaVM in case immediate execution performance improvements aren't as good as hoped.
My observation so far is that EVM interpreters usually have basically negligible overhead for compilation and instantiation. By my understanding they only have to construct a list of all valid jump destinations. So more optimizations in the startup phase are always very welcome :)
My observation so far is that EVM interpreters usually have basically negligible overhead for compilation and instantiation. By my understanding they only have to construct a list of all valid jump destinations. So more optimizations in the startup phase are always very welcome :)
Hearing this I am looking forward to benchmarks with the new Wasmi version using its lazy compilation. :)
While working on the tornado integration test, I came across this Solidity implementation. It was so slow that it wasn't even usable: It broke the merkle tree implementation for heights > 4 or so. Tornado cash uses a merkle tree height of 20 (the heaviest offender regarding gas usage is the EC pairing and not this). We should investigate as to why that is. I don't expect great performance but it should at least be usable.