`approval-distribution`: process assignments and votes in parallel

sandreim commented 1 year ago

Follow up from some earlier scaling tests with https://github.com/paritytech/polkadot/pull/6247. The initial attempts did made it apparent that approval-distribution and approval-voting are bottlenecks increasing approval checking finality lag in our tests with 250 paravalidators and 50 parachains at the time. That argument still holds true, but only at much larger scale than what we were experimenting at the time. The real issue was https://github.com/paritytech/polkadot/issues/6400 which was solved.

Bumping up to 350+ para validators requires that we make approval-voting and approval-distribution do more work in the same amount of time. The first step is to implement the parallelization changes from https://github.com/paritytech/polkadot/pull/6247 which would result in maxing out cpu usage in approval-voting which sits right now at 80% due to the serial processing and waiting for completion of vote imports issued by approval-distribution. This will enable us to further improve how fast the subsystem churns through work by making approval-voting parallelise signature checks.

Important note: The implementation will make use of internal queueing of messages. We have to be careful to preserve the backpressure so we need these internal queues to be bounded, for example by a fixed number of imports that can be pending at a given time.

eskimor commented 1 year ago

I am still in favor of trying real hard to improve performance/reduce work somehow. Using multiple cores just for signature checking in approval voting ... maxing out multiple cores, even when all parachains are 100% idle? That does sound like something we should be able to improve.

sandreim commented 1 year ago

Yes, I created https://github.com/paritytech/polkadot-sdk/issues/729 and if I understand correctly it does exactly what you mean and helps a lot.

I think we should implement improvements at the protocol/cryptography level as well as architectural. While in perfect network conditions maybe the changes suggested in paritytech/polkadot-sdk#729 might be enough to scale to 500+ validators, but if we go higher or upper tranches trigger or disputes happen then we clearly need a solution to be able to handle bursts of assignments and votes efficiently and scale beyond 1 CPU.

rphmeier commented 1 year ago

maxing out multiple cores, even when all parachains are 100% idle? That does sound like something we should be able to improve.

FWIW, we just have these constant overheads for every parachain block. The flip side is that they don't get worse when parachains get heavier.

alexggh commented 1 year ago

Spent some time introspecting and profiling the approval voting to understand why we see the high ToF in approval-distribution, here are my findings, in the order of the impact:

We are processing a lot of assignments and votes, which come in bursts, and the crypto for checking them takes 2-3 hundreds of micro-seconds(with rare occasions where it takes 2-3millis) , that capes the full capacity of approval-voting assignments/votest processing to about 3k checks. At peak performance that's how many assignments/approval we are processing, so if the messages arrive in the queue at about the same time in the approval-distribution queue, just by the nature of processing them serially we are going to see higher than 1s ToF. There are also a lot of messages that contain more than one assignments, so there will be messages that keep approval-distribution busy for a few milliseconds, just because processing all the assignments/votes in the message take time to be processed, see process_incoming_peer_message. Besides this, our system doesn't process assignments/approvals at peak capacity because of the items bellow.

Reducing the assignments from this https://github.com/paritytech/polkadot/pull/6782 will definitely help with the number of messages we have to process.
There are other operations that keep the approval-voting main loop busy, occasionally:

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.
ActiveLeaves - takes about 20 millis every 6 seconds.
Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds

While this operations happen relatively rarely, they do happen at peak time, so they might be just the straw that break the camel's back.

Occasionally(every few seconds), there seems to be situations where approval-distribution is waiting after approval-voting for 3-4 millis, but approval-voting doesn't do any work it is just stuck in the main select my theory is that in those cases the subsystem future doesn't execute because the system is busy executing other tasks. (Do you guys have any other theory) ?

eskimor commented 1 year ago

Great work! On signature checking, do we know how much of it is approvals, vs. assignments?

burdges commented 1 year ago

ToF?

We need better benchmarks in schnorrkel but verifying an approval should take roughly 40 microseconds, so like ed25519, and verifying an assignment should be less than 3*25 + 40 = 115 micro seconds, so in 1 millisecond on one core you'd maybe 25 approvals, or 8 assignments, or 6 of each.

I've some tricks to speed up the VRF parts..

Actually use all the features correctly, ala https://github.com/dalek-cryptography/curve25519-dalek/pull/520
We can prove multiple VRFs with one signature, which helps our average case some, and really helps if attackers push no-shows on multiple candidates.
Thin VRF merges the verification equations, which maybe saves something right now, but definitely makes batching cheaper.

An elligator hash-to-curve should be fast, but we could cache the hash-to-curve if we uniformized them by excluding the validator public key from the hashed-to-curve. As the extreme we could compute a base point table for those uniformized hash-to-curves, which saves 10 microseconds if not batching.

I'd think sr25519 and ed25519 cost the same, so we'd only speed up approvals verification though either batching or else by adopting a considerably larger but non-batchable signature, like Rabin-Williams. We no longer think network throughout bottlenecks us here?

We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.

Why 6? An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here? We need roughly 25+25+15 = 65 microseconds for VRF signing, btw.

ActiveLeaves - takes about 20 millis every 6 seconds.

Is this querying approvals data or something about the chain?

Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds

mmap fixes this one. approvals-voting db needs to be serialized in memory for this, but that's not really problematic.

sandreim commented 1 year ago

Spent some time introspecting and profiling the approval voting to understand why we see the high ToF in approval-distribution, here are my findings, in the order of the impact:

Nice work @alexggh

We are processing a lot of assignments and votes, which come in bursts, and the crypto for checking them takes 2-3 hundreds of micro-seconds(with rare occasions where it takes 2-3millis) , that capes the full capacity of approval-voting assignments/votest processing to about 3k checks. At peak performance that's how many assignments/approval we are processing, so if the messages arrive in the queue at about the same time in the approval-distribution queue, just by the nature of processing them serially we are going to see higher than 1s ToF. There are also a lot of messages that contain more than one assignments, so there will be messages that keep approval-distribution busy for a few milliseconds, just because processing all the assignments/votes in the message take time to be processed, see process_incoming_peer_message. Besides this, our system doesn't process assignments/approvals at peak capacity because of the items bellow.

Reducing the assignments from this paritytech/polkadot#6782 will definitely help with the number of messages we have to process.

Yes! But this improvement becomes less useful if we just scale the validators up. It is more useful when we actually need to do more work per validator - more parachains.

There are other operations that keep the approval-voting main loop busy, occasionally:

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.

ActiveLeaves - takes about 20 millis every 6 seconds.

Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds

While this operations happen relatively rarely, they do happen at peak time, so they might be just the straw that break the camel's back.

Occasionally(every few seconds), there seems to be situations where approval-distribution is waiting after approval-voting for 3-4 millis, but approval-voting doesn't do any work it is just stuck in the main select my theory is that in those cases the subsystem future doesn't execute because the system is busy executing other tasks. (Do you guys have any other theory) ?

I also observed this some time ago. My theory back then:

both approval voting and approval distribution run on their own threads (for obvious reasons), but that doesn't mean anything in terms of priority
it might happen that due to system load, the OS will prefer to schedule any other threads vs the approval voting thread.
we can also observe the high CPU load of the networking stack which I believe also plays a role here

alexggh commented 1 year ago

ToF?

Time of flight and it defines how much time a message arrived in the approval-distribution waits till it gets picked up for processing.

do we know how much of it is approvals, vs. assignments?

They seem to be evenly spread, e.g 1 second snapshot: 1717 assignments adding up to 640ms of processing time and 2040 approvals adding up to 360ms.

We need better benchmarks in schnorrkel but verifying an approval should take roughly 40 microseconds, so like ed25519, and verifying an assignment should be less than 3*25 + 40 = 115 micro seconds, so in 1 millisecond on one core you'd maybe 25 approvals, or 8 assignments, or 6 of each

I'll dig a bit deeper here, but in both my instrumentation on the tracing we have here assignment imports seem to take 200-300 micro seconds and approvals imports between 100 and 200 micro seconds. Also, I noticed rare cases where checking of a RelayVRFDelay assignment takes 1-2 ms.

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.

Why 6?

It seems we are doing it per candidate, so I think we have more at the same time.

An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here?

What problems are you thinking about ?

We need roughly 25+25+15 = 65 microseconds for VRF signing, btw.

Will re-check it.

ActiveLeaves - takes about 20 millis every 6 seconds.

Is this querying approvals data or something about the chain?

Yes, it is asking the runtime for block information a few times.

I've some tricks to speed up the VRF parts..

Actually use all the features correctly, ala https://github.com/dalek-cryptography/curve25519-dalek/pull/520

Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.

We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?

Are you referring to this https://github.com/paritytech/polkadot/pull/6782, if yes. The answer is no, I wasn't running that PR when I did the measurements.

We no longer think network throughout bottlenecks us here?

The messages are already in the approval-distribution queue, so I would say the network shouldn't be the bottleneck here.

burdges commented 1 year ago

We no longer think network throughout bottlenecks us here?

The messages are already in the approval-distribution queue, so I would say the network shouldn't be the bottleneck here.

If I understand, we do not currently know what performance issues the networking itself brings, so we should not yet assume we can make life harder for the networking either.

Actually use all the features correctly, ala https://github.com/dalek-cryptography/curve25519-dalek/pull/520

Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.

Ask @koute maybe?

Is this querying approvals data or something about the chain?

Yes, it is asking the runtime for block information a few times.

Relay VRF preoutputs probably, maybe something else. We copy & keep the relay VRF output outside the chain once we start approvals work, no?

An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here?

What problems are you thinking about ?

I've no idea how the keystore works internally, but unrealted..

I noticed one reason signing might be slow or unpredictable: We do a syscall every time we sign anything in https://github.com/w3f/schnorrkel/blob/master/src/lib.rs#L228 which gives linux an excuse to bump the thread. I'd previously used ThreadRng but removed it to simplify maintenance in https://github.com/w3f/schnorrkel/commit/c333370c164746c6e4c2e74fd9b7a53c72dd10f9 I'll revert that commit, fix rand, and make some other changes.

They seem to be evenly spread, e.g 1 second snapshot: 1717 assignments adding up to 640ms of processing time and 2040 approvals adding up to 360ms.

So 3x and 4x longer than my guestimates for the bare crypto, respectively. We do need better benchmarks on schnorrkel of course, like the rand issue for example.

We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?

Are you referring to this https://github.com/paritytech/polkadot/pull/6782, if yes. The answer is no, I wasn't running that PR when I did the measurements.

No. We discussed batching together other approvals messages too, but maybe no issue exists yet. If we're already signing 6 around the same time then maybe this saves lots.

koute commented 1 year ago

Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.

It's not. This still requires passing a special flag to the compiler when compiling polkadot so that the SIMD instructions actually get used.

But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.

The expected speedup thanks to this is somewhere between ~30%-50%, depending on the hardware.

alexggh commented 1 year ago

No. We discussed batching together other approvals messages too, but maybe no issue exists yet. If we're already signing 6 around the same time then maybe this saves lots.

It doesn't exists yet, I think it is this one: https://github.com/paritytech/polkadot-sdk/issues/701

Relay VRF preoutputs probably, maybe something else. We copy & keep the relay VRF output outside the chain once we start approvals work, no?

Yes, we do that.

used ThreadRng but removed it to simplify maintenance in https://github.com/w3f/schnorrkel/commit/c333370c164746c6e4c2e74fd9b7a53c72dd10f9 I'll revert that commit, fix rand, and make some other changes.

Thank you, I'll try to see if I can obtain some numbers with that reverted.

But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.

@koute Could you point me to the PR, so I can bring it locally and get out some numbers.

koute commented 1 year ago

But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.

@koute Could you point me to the PR, so I can bring it locally and get out some numbers.

It's merged to curve25519-dalek's main, so if you pull that in it'll automatically work as long as the simd cargo feature is enabled. (Note: there are two backends, and the faster avx512 one requires a nightly compiler; when compiled on a non-nightly compiler it will be silently disabled. The crate will automatically pick the fastest available one at runtime.)

That said, actually integrating it is a little bit more complex because the new version of curve25519-dalek has breaking changes and it needs to be replaced in a few crates (schnorrkel, ed25519-zebra, sp-io), so it's not just a simple job of bumping the deps. I don't know whether you actually want to invest your time into this right now. (Hopefully there will be a new official release of curve25519-dalek soon-ish and then I'll be making the relevant PRs myself to pull it in.)

If you're interested here are some performance numbers from curve25519-dalek's own benchmarks for its two available SIMD backends. I took these on a GCP VM.

Scalar vs AVX2

| testcase | scalar | avx2 | % | |---------------------------------------------------------------------------------|-----------------------|--------------------|---------| | multiscalar benches/Constant-time variable-base multiscalar multiplication/384 | 5896.6 | 3886.29 | -34.09% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/768 | 11792.0 | 7778.8 | -34.03% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/512 | 7843.0 | 5194.8 | -33.77% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/16 | 271.05 | 179.74 | -33.69% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 | 15724.0 | 10432.0 | -33.66% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/64 | 1005.10 | 669.14 | -33.43% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/128 | 1973.4 | 1315.19 | -33.35% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/256 | 3902.8 | 2601.79 | -33.34% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/32 | 512.99 | 342.09 | -33.31% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/8 | 151.03 | 102.31 | -32.26% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/4 | 91.36 | 63.62 | -30.36% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/2 | 61.51 | 44.24 | -28.07% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 | 6677.2 | 4906.0 | -26.53% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/512 | 3775.3 | 2794.1 | -25.99% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/384 | 3034.5 | 2246.9 | -25.95% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1 | 46.62 | 34.53 | -25.92% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/256 | 2184.5 | 1621.3 | -25.78% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/768 | 5258.20 | 3903.9 | -25.76% | | multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) | 99.18 | 77.43 | -21.93% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/64 | 616.72 | 481.67 | -21.90% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/128 | 1201.6 | 938.67 | -21.88% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/16 | 177.74 | 139.12 | -21.73% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/32 | 324.31 | 253.95 | -21.70% | | multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) | 65.02 | 50.92 | -21.68% | | multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) | 166.7 | 130.68 | -21.61% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/8 | 104.01 | 81.73 | -21.42% | | multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) | 2172.5 | 1712.1 | -21.19% | | multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) | 566.45 | 446.53 | -21.17% | | multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) | 301.09 | 237.38 | -21.16% | | multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) | 3249.5 | 2562.60 | -21.14% | | multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) | 4320.2 | 3409.60 | -21.08% | | multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) | 1097.6 | 866.8 | -21.03% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/4 | 67.03 | 52.98 | -20.96% | | multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) | 47.33 | 37.52 | -20.73% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/2 | 48.53 | 38.59 | -20.49% | | multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) | 92.14 | 73.28 | -20.47% | | edwards benches/Constant-time variable-base scalar mul | 44.45 | 35.43 | -20.27% | | multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) | 155.31 | 124.24 | -20.01% | | multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) | 59.81 | 47.89 | -19.92% | | multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) | 59.79 | 47.89 | -19.89% | | multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) | 89.27 | 71.62 | -19.77% | | multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) | 44.73 | 35.99 | -19.54% | | multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) | 44.72 | 35.98 | -19.52% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1 | 39.15 | 31.54 | -19.44% | | multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) | 3986.4 | 3218.6 | -19.26% | | multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) | 2007.3 | 1620.9 | -19.25% | | multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) | 2995.6 | 2419.79 | -19.22% | | multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) | 521.72 | 421.92 | -19.13% | | multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) | 1015.09 | 821.04 | -19.12% | | multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) | 277.22 | 224.61 | -18.98% | | multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) | 6329.90 | 5141.4 | -18.78% | | multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) | 146.78 | 119.67 | -18.47% | | multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) | 37.08 | 30.24 | -18.45% | | multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) | 37.08 | 30.24 | -18.44% | | multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) | 37.07 | 30.24 | -18.42% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) | 8438.0 | 6894.3 | -18.29% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 | 43.61 | 35.93 | -17.60% | | multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) | 2808.39 | 2320.3 | -17.38% | | multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) | 1879.1 | 1553.39 | -17.33% | | multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) | 3736.60 | 3090.6 | -17.29% | | multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) | 260.52 | 215.51 | -17.28% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 | 57.79 | 47.82 | -17.25% | | multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) | 949.81 | 786.7 | -17.17% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 | 36.41 | 30.16 | -17.16% | | multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) | 488.12 | 404.41 | -17.15% | | multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) | 5834.1 | 4864.0 | -16.63% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 | 85.91 | 71.62 | -16.63% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 | 141.72 | 119.42 | -15.74% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) | 7773.79 | 6565.1 | -15.55% | | edwards benches/Variable-time aA+bB A variable B fixed | 40.69 | 34.62 | -14.91% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 | 252.32 | 215.53 | -14.58% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 | 473.06 | 405.01 | -14.39% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 | 1819.19 | 1558.6 | -14.32% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 | 2715.5 | 2328.6 | -14.25% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 | 3614.5 | 3099.8 | -14.24% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 | 918.51 | 788.72 | -14.13% | | multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) | 5413.2 | 4667.6 | -13.77% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 | 5424.29 | 4683.5 | -13.66% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 | 7260.2 | 6339.5 | -12.68% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) | 7231.8 | 6320.5 | -12.60% | | edwards benches/Constant-time fixed-base scalar mul | 11.56 | 12.72 | 9.99% | | montgomery benches/Montgomery pseudomultiplication | 50.93 | 55.64 | 9.25% | | montgomery benches/Constant-time fixed-base scalar mul | 15.57 | 16.56 | 6.36% | | scalar benches/Scalar addition | 23511000.0 | 22964000.0 | -2.33% | | scalar benches/Batch scalar inversion/1 | 13.10 | 13.21 | 0.89% | | scalar benches/Scalar inversion | 12.81 | 12.92 | 0.88% | | scalar benches/Batch scalar inversion/4 | 13.65 | 13.76 | 0.85% | | scalar benches/Batch scalar inversion/8 | 14.36 | 14.48 | 0.81% | | scalar benches/Batch scalar inversion/16 | 15.80 | 15.92 | 0.75% | | scalar benches/Batch scalar inversion/2 | 13.28 | 13.38 | 0.74% | | ristretto benches/Batch Ristretto double-and-encode/16 | 13.30 | 13.21 | -0.68% | | ristretto benches/Batch Ristretto double-and-encode/8 | 8.70 | 8.66 | -0.41% | | scalar benches/Scalar subtraction | 21408000.0 | 21485000.0 | 0.36% | | scalar benches/Scalar multiplication | 91928000.0 | 91662000.0 | -0.29% | | ristretto benches/Batch Ristretto double-and-encode/4 | 6.35 | 6.34 | -0.23% | | edwards benches/EdwardsPoint compression | 3.95 | 3.95 | 0.12% | | ristretto benches/RistrettoPoint decompression | 4.61 | 4.61 | 0.07% | | ristretto benches/Batch Ristretto double-and-encode/1 | 4.58 | 4.58 | -0.06% | | edwards benches/EdwardsPoint decompression | 4.31 | 4.32 | 0.04% | | ristretto benches/Batch Ristretto double-and-encode/2 | 5.18 | 5.18 | -0.01% | | ristretto benches/RistrettoPoint compression | 4.57 | 4.57 | 0.00% |

Scalar vs AVX512

| testcase | scalar | avx512 | % | |---------------------------------------------------------------------------------|-----------------------|--------------------------|---------| | multiscalar benches/Constant-time variable-base multiscalar multiplication/768 | 11792.0 | 5123.8 | -56.55% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/384 | 5896.6 | 2567.0 | -56.47% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/512 | 7843.0 | 3427.9 | -56.29% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 | 15724.0 | 6908.5 | -56.06% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/256 | 3902.8 | 1718.6 | -55.96% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/128 | 1973.4 | 872.49 | -55.79% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/64 | 1005.10 | 449.67 | -55.26% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/32 | 512.99 | 236.33 | -53.93% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/16 | 271.05 | 128.61 | -52.55% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/768 | 5258.20 | 2529.79 | -51.89% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/384 | 3034.5 | 1475.7 | -51.37% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 | 6677.2 | 3251.79 | -51.30% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/512 | 3775.3 | 1854.60 | -50.88% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/8 | 151.03 | 75.82 | -49.79% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/256 | 2184.5 | 1104.0 | -49.46% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/4 | 91.36 | 49.58 | -45.72% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/2 | 61.51 | 35.94 | -41.56% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1 | 46.62 | 29.46 | -36.80% | | edwards benches/Constant-time variable-base scalar mul | 44.45 | 29.97 | -32.56% | | multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) | 4320.2 | 2954.0 | -31.62% | | multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) | 3249.5 | 2223.6 | -31.57% | | multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) | 2172.5 | 1490.89 | -31.37% | | multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) | 1097.6 | 763.06 | -30.48% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/128 | 1201.6 | 835.7 | -30.45% | | multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) | 3986.4 | 2785.29 | -30.13% | | multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) | 2995.6 | 2095.1 | -30.06% | | multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) | 566.45 | 396.4 | -30.02% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/64 | 616.72 | 431.84 | -29.98% | | multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) | 6329.90 | 4435.2 | -29.93% | | multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) | 2007.3 | 1406.9 | -29.91% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) | 8438.0 | 5921.59 | -29.82% | | multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) | 301.09 | 212.54 | -29.41% | | multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) | 1015.09 | 719.19 | -29.15% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/32 | 324.31 | 230.12 | -29.04% | | multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) | 3736.60 | 2666.6 | -28.64% | | multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) | 2808.39 | 2007.1 | -28.53% | | multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) | 166.7 | 119.31 | -28.43% | | multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) | 5834.1 | 4180.5 | -28.34% | | multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) | 1879.1 | 1347.39 | -28.30% | | multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) | 521.72 | 374.48 | -28.22% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) | 7773.79 | 5599.5 | -27.97% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/16 | 177.74 | 129.05 | -27.39% | | multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) | 949.81 | 689.65 | -27.39% | | multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) | 277.22 | 201.88 | -27.18% | | multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) | 99.18 | 72.67 | -26.73% | | multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) | 5413.2 | 3998.3 | -26.14% | | multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) | 155.31 | 114.76 | -26.11% | | multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) | 488.12 | 360.78 | -26.09% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) | 7231.8 | 5371.8 | -25.72% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 | 7260.2 | 5398.5 | -25.64% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/8 | 104.01 | 77.50 | -25.48% | | multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) | 260.52 | 194.85 | -25.21% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 | 5424.29 | 4059.99 | -25.15% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 | 3614.5 | 2708.5 | -25.07% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 | 2715.5 | 2039.8 | -24.88% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 | 1819.19 | 1368.4 | -24.78% | | multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) | 65.02 | 48.93 | -24.75% | | multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) | 92.14 | 69.44 | -24.64% | | multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) | 146.78 | 111.45 | -24.07% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 | 918.51 | 699.73 | -23.82% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/4 | 67.03 | 51.08 | -23.79% | | multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) | 89.27 | 68.24 | -23.56% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 | 473.06 | 365.51 | -22.73% | | multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) | 59.81 | 46.43 | -22.36% | | multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) | 59.79 | 46.43 | -22.34% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/2 | 48.53 | 37.73 | -22.25% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 | 252.32 | 196.43 | -22.15% | | multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) | 47.33 | 36.85 | -22.14% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1 | 39.15 | 30.78 | -21.38% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 | 141.72 | 111.78 | -21.13% | | scalar benches/Scalar inversion | 12.81 | 16.17 | 26.21% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 | 85.91 | 68.35 | -20.44% | | multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) | 44.73 | 35.68 | -20.23% | | multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) | 44.72 | 35.68 | -20.20% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 | 57.79 | 46.42 | -19.68% | | multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) | 37.07 | 29.93 | -19.27% | | multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) | 37.08 | 29.94 | -19.27% | | multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) | 37.08 | 29.95 | -19.23% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 | 43.61 | 35.52 | -18.54% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 | 36.41 | 29.77 | -18.25% | | edwards benches/Variable-time aA+bB A variable B fixed | 40.69 | 33.53 | -17.60% | | montgomery benches/Constant-time fixed-base scalar mul | 15.57 | 17.22 | 10.62% | | scalar benches/Scalar multiplication | 91928000.0 | 99047000.0 | 7.74% | | edwards benches/Constant-time fixed-base scalar mul | 11.56 | 12.40 | 7.29% | | ristretto benches/Batch Ristretto double-and-encode/16 | 13.30 | 14.25 | 7.14% | | montgomery benches/Montgomery pseudomultiplication | 50.93 | 54.34 | 6.68% | | scalar benches/Scalar addition | 23511000.0 | 22191000.0 | -5.61% | | ristretto benches/Batch Ristretto double-and-encode/8 | 8.70 | 9.21 | 5.90% | | ristretto benches/Batch Ristretto double-and-encode/4 | 6.35 | 6.60 | 3.86% | | scalar benches/Batch scalar inversion/16 | 15.80 | 16.32 | 3.29% | | ristretto benches/RistrettoPoint compression | 4.57 | 4.69 | 2.67% | | scalar benches/Batch scalar inversion/8 | 14.36 | 14.72 | 2.50% | | ristretto benches/Batch Ristretto double-and-encode/2 | 5.18 | 5.31 | 2.47% | | ristretto benches/RistrettoPoint decompression | 4.61 | 4.71 | 2.23% | | scalar benches/Batch scalar inversion/4 | 13.65 | 13.95 | 2.18% | | edwards benches/EdwardsPoint decompression | 4.31 | 4.39 | 1.72% | | scalar benches/Batch scalar inversion/1 | 13.10 | 13.31 | 1.61% | | scalar benches/Batch scalar inversion/2 | 13.28 | 13.47 | 1.45% | | ristretto benches/Batch Ristretto double-and-encode/1 | 4.58 | 4.64 | 1.35% | | scalar benches/Scalar subtraction | 21408000.0 | 21139000.0 | -1.26% | | edwards benches/EdwardsPoint compression | 3.95 | 3.95 | 0.05% |

AVX2 vs AVX512

| testcase | avx2 | avx512 | % | |---------------------------------------------------------------------------------|--------------------|--------------------------|---------| | multiscalar benches/Variable-time variable-base multiscalar multiplication/768 | 3903.9 | 2529.79 | -35.20% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/384 | 2246.9 | 1475.7 | -34.32% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/768 | 7778.8 | 5123.8 | -34.13% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/512 | 5194.8 | 3427.9 | -34.01% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/384 | 3886.29 | 2567.0 | -33.95% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/256 | 2601.79 | 1718.6 | -33.95% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 | 10432.0 | 6908.5 | -33.78% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 | 4906.0 | 3251.79 | -33.72% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/128 | 1315.19 | 872.49 | -33.66% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/512 | 2794.1 | 1854.60 | -33.62% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/64 | 669.14 | 449.67 | -32.80% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/256 | 1621.3 | 1104.0 | -31.91% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/32 | 342.09 | 236.33 | -30.92% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/16 | 179.74 | 128.61 | -28.45% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/8 | 102.31 | 75.82 | -25.88% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/4 | 63.62 | 49.58 | -22.06% | | scalar benches/Scalar inversion | 12.92 | 16.17 | 25.11% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/2 | 44.24 | 35.94 | -18.76% | | edwards benches/Constant-time variable-base scalar mul | 35.43 | 29.97 | -15.41% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) | 6320.5 | 5371.8 | -15.01% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 | 6339.5 | 5398.5 | -14.84% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) | 6565.1 | 5599.5 | -14.71% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1 | 34.53 | 29.46 | -14.68% | | multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) | 4667.6 | 3998.3 | -14.34% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) | 6894.3 | 5921.59 | -14.11% | | multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) | 4864.0 | 4180.5 | -14.05% | | multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) | 5141.4 | 4435.2 | -13.74% | | multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) | 3090.6 | 2666.6 | -13.72% | | multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) | 2320.3 | 2007.1 | -13.50% | | multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) | 3218.6 | 2785.29 | -13.46% | | multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) | 2419.79 | 2095.1 | -13.42% | | multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) | 3409.60 | 2954.0 | -13.36% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 | 4683.5 | 4059.99 | -13.31% | | multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) | 1553.39 | 1347.39 | -13.26% | | multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) | 2562.60 | 2223.6 | -13.23% | | multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) | 1620.9 | 1406.9 | -13.20% | | multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) | 1712.1 | 1490.89 | -12.92% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 | 3099.8 | 2708.5 | -12.62% | | multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) | 821.04 | 719.19 | -12.40% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 | 2328.6 | 2039.8 | -12.40% | | multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) | 786.7 | 689.65 | -12.34% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 | 1558.6 | 1368.4 | -12.20% | | multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) | 866.8 | 763.06 | -11.97% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 | 788.72 | 699.73 | -11.28% | | multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) | 421.92 | 374.48 | -11.24% | | multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) | 446.53 | 396.4 | -11.23% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/128 | 938.67 | 835.7 | -10.97% | | multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) | 404.41 | 360.78 | -10.79% | | multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) | 237.38 | 212.54 | -10.46% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/64 | 481.67 | 431.84 | -10.35% | | multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) | 224.61 | 201.88 | -10.12% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 | 405.01 | 365.51 | -9.75% | | multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) | 215.51 | 194.85 | -9.59% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/32 | 253.95 | 230.12 | -9.38% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 | 215.53 | 196.43 | -8.86% | | multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) | 130.68 | 119.31 | -8.70% | | multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) | 124.24 | 114.76 | -7.63% | | scalar benches/Scalar multiplication | 91662000.0 | 99047000.0 | 8.06% | | ristretto benches/Batch Ristretto double-and-encode/16 | 13.21 | 14.25 | 7.88% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/16 | 139.12 | 129.05 | -7.24% | | multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) | 119.67 | 111.45 | -6.87% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 | 119.42 | 111.78 | -6.40% | | multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) | 77.43 | 72.67 | -6.15% | | ristretto benches/Batch Ristretto double-and-encode/8 | 8.66 | 9.21 | 6.33% | | multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) | 73.28 | 69.44 | -5.24% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/8 | 81.73 | 77.50 | -5.17% | | multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) | 71.62 | 68.24 | -4.72% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 | 71.62 | 68.35 | -4.58% | | ristretto benches/Batch Ristretto double-and-encode/4 | 6.34 | 6.60 | 4.10% | | multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) | 50.92 | 48.93 | -3.92% | | montgomery benches/Constant-time fixed-base scalar mul | 16.56 | 17.22 | 4.01% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/4 | 52.98 | 51.08 | -3.58% | | scalar benches/Scalar addition | 22964000.0 | 22191000.0 | -3.37% | | edwards benches/Variable-time aA+bB A variable B fixed | 34.62 | 33.53 | -3.16% | | multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) | 47.89 | 46.43 | -3.06% | | multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) | 47.89 | 46.43 | -3.05% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 | 47.82 | 46.42 | -2.93% | | ristretto benches/RistrettoPoint compression | 4.57 | 4.69 | 2.67% | | scalar benches/Batch scalar inversion/16 | 15.92 | 16.32 | 2.52% | | edwards benches/Constant-time fixed-base scalar mul | 12.72 | 12.40 | -2.45% | | ristretto benches/Batch Ristretto double-and-encode/2 | 5.18 | 5.31 | 2.48% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1 | 31.54 | 30.78 | -2.40% | | montgomery benches/Montgomery pseudomultiplication | 55.64 | 54.34 | -2.35% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/2 | 38.59 | 37.73 | -2.22% | | ristretto benches/RistrettoPoint decompression | 4.61 | 4.71 | 2.16% | | multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) | 37.52 | 36.85 | -1.78% | | edwards benches/EdwardsPoint decompression | 4.32 | 4.39 | 1.69% | | scalar benches/Batch scalar inversion/8 | 14.48 | 14.72 | 1.67% | | scalar benches/Scalar subtraction | 21485000.0 | 21139000.0 | -1.61% | | ristretto benches/Batch Ristretto double-and-encode/1 | 4.58 | 4.64 | 1.41% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 | 30.16 | 29.77 | -1.32% | | scalar benches/Batch scalar inversion/4 | 13.76 | 13.95 | 1.32% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 | 35.93 | 35.52 | -1.14% | | multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) | 30.24 | 29.93 | -1.04% | | multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) | 30.24 | 29.94 | -1.00% | | multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) | 30.24 | 29.95 | -0.96% | | multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) | 35.99 | 35.68 | -0.86% | | multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) | 35.98 | 35.68 | -0.84% | | scalar benches/Batch scalar inversion/1 | 13.21 | 13.31 | 0.72% | | scalar benches/Batch scalar inversion/2 | 13.38 | 13.47 | 0.70% | | edwards benches/EdwardsPoint compression | 3.95 | 3.95 | -0.07% |

burdges commented 1 year ago

I think the lower level crypto optimizations should be somewhat orthogonal. It's clear they'll happen over the next couple months regardless, so..

If you want to focus on our real protocol optimization that's fine too. Ask @eskimor what he thinks. We do need to know when the crypto optimizations merge of course, so that we do not mistakenly conclude some protocol optimization did stuff it did not do.

We can otoh quickly cut a schnorrkel version without the syscall if you want to see if that's causing threads to be bumped or whatever.

alexggh commented 1 year ago

I think the lower level crypto optimizations should be somewhat orthogonal. It's clear they'll happen over the next couple months regardless, so..

Yes, I agree they are orthogonal. I just went this route because I wanted to understand where we spent most of the time. With that in mind, I think in the order of impact we should do the following:

Implement https://github.com/paritytech/polkadot-sdk/issues/701 because it is clear that if we do less signature checks we should gain some performance.
After we get 1) implemented measure it together with the crypto improvements we have in the pipeline and decide if parallelization of the work is actually what will give use the desired gaines.

Don't you agree @sandreim @eskimor @burdges ?

sandreim commented 1 year ago

In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?

At a scale of 1k paravalidators and 200 parachains, under normal operation with 30 needed approvals and 6 vrf modulo samples, the question we are asking is basically - Can we check 1k approvals faster in a single thread than 6k on multiple threads (assuming we implement parallel processing) ?

burdges commented 1 year ago

In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?

Yes this was always my question. :)

I suspect this syscall in schnorrkel is a major cost since it lets linux bump the thread.

Can we check 1k approvals faster in a single thread than 6k on multiple threads (assuming we implement parallel processing) ?

Not really, but we could use those other threads for checking more parachain blocks or whatever.

At a high level, we're not exactly in the shoes of a web2 company who once they parallelize then they can simply throw more hardware at the problem. We have the hardware we told validators they could buy, and upgrading that requires community work.

sandreim commented 1 year ago

TBH, not sure that the syscall would result in the CPU being given to another thread, but we could actually test the 2 scenarios and see what the result is.

Totally agree with you, the thing is that with current architecture we cannot dynamically scale to process stuff on more than 1 CPU to handle bursty workloads. Assuming 1k para valdiators and 200 paras, I think we need to worry about the specs since that means at most 6-7 candidates to check (1 to back, 6 to approve) and 2x times that with async backing. We could add more validators there so the amount of checks per validator goes down, but that means we need to be able to handle more gossip traffic still.

alexggh commented 1 year ago

In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?

At least in approval-voting subsystem from 1 second of useful work it spends around 700 millis doing just the crypto operations for either assignment or approval.

I'll get back when I have some number splits for approval-distribution as well.

alexggh commented 1 year ago

Some measurements regarding where the time is spent in approval-distribution, in average from 1 second of work 750ms is spent waiting for the approval-voting to check the signatures. So, I would say our book-keeping is not where the main part of the work is happening. Example of 1s breakdown at peak :

15K messages coming in from peers
938 assignments checked in 379ms
1373 approvals checked in 338 ms
the rest(~250ms) book-keeping and forwarding message to the network bridge.

One thing I noticed while performing this measurements is that we've got cases where we get multiple assignments & approvals from peers in the same message, here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L568 and here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L607 but because we process those assignments one by one we are going to fan-out those message into multiple messages sent to the same peer. In average we have 10-15 approvals in those messages, but I see some case where we even get 328 approvals in it all those might translate to a single message for each approval.

I think this inefficiency, is a low-hanging fruit, and should help us with at least reducing the network traffic/load.

sandreim commented 1 year ago

Some measurements regarding where the time is spent in approval-distribution, in average from 1 second of work 750ms is spent waiting for the approval-voting to check the signatures. So, I would say our book-keeping is not where the main part of the work is happening. Example of 1s breakdown at peak :

15K messages coming in from peers

938 assignments checked in 379ms

1373 approvals checked in 338 ms

the rest(~250ms) book-keeping and forwarding message to the network bridge.

Thanks for the numbers @alexggh, it appears that book keeping is not eating a lot, but we might be able to improve there as well.

One thing I noticed while performing this measurements is that we've got cases where we get multiple assignments & approvals from peers in the same message, here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L568 and here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L607 but because we process those assignments one by one we are going to fan-out those message into multiple messages sent to the same peer. In average we have 10-15 approvals in those messages, but I see some case where we even get 328 approvals in it all those might translate to a single message for each approval.

I think this inefficiency, is a low-hanging fruit, and should help us with at least reducing the network traffic/load.

Sounds like a good idea, it should clearly reduce the notification counts.

alexggh commented 1 year ago

Putting this on-hold in favor of: https://github.com/paritytech/polkadot-sdk/issues/701

alexggh commented 5 months ago

Will be addressed with https://github.com/paritytech/polkadot-sdk/issues/1617

paritytech / polkadot-sdk

`approval-distribution`: process assignments and votes in parallel #732