Closed sandreim closed 2 months ago
I am still in favor of trying real hard to improve performance/reduce work somehow. Using multiple cores just for signature checking in approval voting ... maxing out multiple cores, even when all parachains are 100% idle? That does sound like something we should be able to improve.
Yes, I created https://github.com/paritytech/polkadot-sdk/issues/729 and if I understand correctly it does exactly what you mean and helps a lot.
I think we should implement improvements at the protocol/cryptography level as well as architectural. While in perfect network conditions maybe the changes suggested in paritytech/polkadot-sdk#729 might be enough to scale to 500+ validators, but if we go higher or upper tranches trigger or disputes happen then we clearly need a solution to be able to handle bursts of assignments and votes efficiently and scale beyond 1 CPU.
maxing out multiple cores, even when all parachains are 100% idle? That does sound like something we should be able to improve.
FWIW, we just have these constant overheads for every parachain block. The flip side is that they don't get worse when parachains get heavier.
Spent some time introspecting and profiling the approval voting to understand why we see the high ToF
in approval-distribution
, here are my findings, in the order of the impact:
We are processing a lot of assignments and votes, which come in bursts, and the crypto for checking them takes 2-3 hundreds of micro-seconds(with rare occasions where it takes 2-3millis) , that capes the full capacity of approval-voting
assignments/votest processing to about 3k checks.
At peak performance that's how many assignments/approval we are processing, so if the messages arrive in the queue at about the same time in the approval-distribution queue, just by the nature of processing them serially we are going to see higher than 1s ToF. There are also a lot of messages that contain more than one assignments, so there will be messages that keep approval-distribution busy for a few milliseconds, just because processing all the assignments/votes in the message take time to be processed, see process_incoming_peer_message. Besides this, our system doesn't process assignments/approvals at peak capacity because of the items bellow.
Reducing the assignments from this https://github.com/paritytech/polkadot/pull/6782 will definitely help with the number of messages we have to process.
There are other operations that keep the approval-voting
main loop busy, occasionally:
handle_actions(issue_approval
) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.
ActiveLeaves - takes about 20 millis every 6 seconds.
Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds
While this operations happen relatively rarely, they do happen at peak time, so they might be just the straw that break the camel's back.
Great work! On signature checking, do we know how much of it is approvals, vs. assignments?
ToF
?
We need better benchmarks in schnorrkel but verifying an approval should take roughly 40 microseconds, so like ed25519, and verifying an assignment should be less than 3*25 + 40 = 115 micro seconds, so in 1 millisecond on one core you'd maybe 25 approvals, or 8 assignments, or 6 of each.
I've some tricks to speed up the VRF parts..
An elligator hash-to-curve should be fast, but we could cache the hash-to-curve if we uniformized them by excluding the validator public key from the hashed-to-curve. As the extreme we could compute a base point table for those uniformized hash-to-curves, which saves 10 microseconds if not batching.
I'd think sr25519 and ed25519 cost the same, so we'd only speed up approvals verification though either batching or else by adopting a considerably larger but non-batchable signature, like Rabin-Williams. We no longer think network throughout bottlenecks us here?
We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?
handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.
Why 6? An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here? We need roughly 25+25+15 = 65 microseconds for VRF signing, btw.
ActiveLeaves - takes about 20 millis every 6 seconds.
Is this querying approvals data or something about the chain?
Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds
mmap fixes this one. approvals-voting db needs to be serialized in memory for this, but that's not really problematic.
Spent some time introspecting and profiling the approval voting to understand why we see the high
ToF
inapproval-distribution
, here are my findings, in the order of the impact:
Nice work @alexggh
- We are processing a lot of assignments and votes, which come in bursts, and the crypto for checking them takes 2-3 hundreds of micro-seconds(with rare occasions where it takes 2-3millis) , that capes the full capacity of
approval-voting
assignments/votest processing to about 3k checks. At peak performance that's how many assignments/approval we are processing, so if the messages arrive in the queue at about the same time in the approval-distribution queue, just by the nature of processing them serially we are going to see higher than 1s ToF. There are also a lot of messages that contain more than one assignments, so there will be messages that keep approval-distribution busy for a few milliseconds, just because processing all the assignments/votes in the message take time to be processed, see process_incoming_peer_message. Besides this, our system doesn't process assignments/approvals at peak capacity because of the items bellow.Reducing the assignments from this paritytech/polkadot#6782 will definitely help with the number of messages we have to process.
Yes! But this improvement becomes less useful if we just scale the validators up. It is more useful when we actually need to do more work per validator - more parachains.
- There are other operations that keep the
approval-voting
main loop busy, occasionally:
- handle_actions(
issue_approval
) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.- ActiveLeaves - takes about 20 millis every 6 seconds.
- Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds
While this operations happen relatively rarely, they do happen at peak time, so they might be just the straw that break the camel's back.
- Occasionally(every few seconds), there seems to be situations where approval-distribution is waiting after approval-voting for 3-4 millis, but approval-voting doesn't do any work it is just stuck in the main select my theory is that in those cases the subsystem future doesn't execute because the system is busy executing other tasks. (Do you guys have any other theory) ?
I also observed this some time ago. My theory back then:
ToF?
Time of flight and it defines how much time a message arrived in the approval-distribution waits till it gets picked up for processing.
do we know how much of it is approvals, vs. assignments?
They seem to be evenly spread, e.g 1 second snapshot: 1717
assignments adding up to 640ms of processing time and 2040
approvals adding up to 360ms.
We need better benchmarks in schnorrkel but verifying an approval should take roughly 40 microseconds, so like ed25519, and verifying an assignment should be less than 3*25 + 40 = 115 micro seconds, so in 1 millisecond on one core you'd maybe 25 approvals, or 8 assignments, or 6 of each
I'll dig a bit deeper here, but in both my instrumentation on the tracing we have here assignment imports seem to take 200-300 micro seconds and approvals imports between 100 and 200 micro seconds. Also, I noticed rare cases where checking of a RelayVRFDelay
assignment takes 1-2 ms.
handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.
Why 6?
It seems we are doing it per candidate, so I think we have more at the same time.
An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here?
What problems are you thinking about ?
We need roughly 25+25+15 = 65 microseconds for VRF signing, btw.
Will re-check it.
ActiveLeaves - takes about 20 millis every 6 seconds.
Is this querying approvals data or something about the chain?
Yes, it is asking the runtime for block information a few times.
I've some tricks to speed up the VRF parts..
Actually use all the features correctly, ala https://github.com/dalek-cryptography/curve25519-dalek/pull/520
Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.
We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?
Are you referring to this https://github.com/paritytech/polkadot/pull/6782, if yes. The answer is no, I wasn't running that PR when I did the measurements.
We no longer think network throughout bottlenecks us here?
The messages are already in the approval-distribution
queue, so I would say the network shouldn't be the bottleneck here.
We no longer think network throughout bottlenecks us here?
The messages are already in the approval-distribution queue, so I would say the network shouldn't be the bottleneck here.
If I understand, we do not currently know what performance issues the networking itself brings, so we should not yet assume we can make life harder for the networking either.
Actually use all the features correctly, ala https://github.com/dalek-cryptography/curve25519-dalek/pull/520
Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.
Ask @koute maybe?
Is this querying approvals data or something about the chain?
Yes, it is asking the runtime for block information a few times.
Relay VRF preoutputs probably, maybe something else. We copy & keep the relay VRF output outside the chain once we start approvals work, no?
An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here?
What problems are you thinking about ?
I've no idea how the keystore works internally, but unrealted..
I noticed one reason signing might be slow or unpredictable: We do a syscall every time we sign anything in https://github.com/w3f/schnorrkel/blob/master/src/lib.rs#L228 which gives linux an excuse to bump the thread. I'd previously used ThreadRng but removed it to simplify maintenance in https://github.com/w3f/schnorrkel/commit/c333370c164746c6e4c2e74fd9b7a53c72dd10f9 I'll revert that commit, fix rand, and make some other changes.
They seem to be evenly spread, e.g 1 second snapshot: 1717 assignments adding up to 640ms of processing time and 2040 approvals adding up to 360ms.
So 3x and 4x longer than my guestimates for the bare crypto, respectively. We do need better benchmarks on schnorrkel of course, like the rand issue for example.
We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?
Are you referring to this https://github.com/paritytech/polkadot/pull/6782, if yes. The answer is no, I wasn't running that PR when I did the measurements.
No. We discussed batching together other approvals messages too, but maybe no issue exists yet. If we're already signing 6 around the same time then maybe this saves lots.
Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.
It's not. This still requires passing a special flag to the compiler when compiling polkadot so that the SIMD instructions actually get used.
But I have another PR to curve25519-dalek
which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel
.
The expected speedup thanks to this is somewhere between ~30%-50%, depending on the hardware.
No. We discussed batching together other approvals messages too, but maybe no issue exists yet. If we're already signing 6 around the same time then maybe this saves lots.
It doesn't exists yet, I think it is this one: https://github.com/paritytech/polkadot-sdk/issues/701
Relay VRF preoutputs probably, maybe something else. We copy & keep the relay VRF output outside the chain once we start approvals work, no?
Yes, we do that.
used ThreadRng but removed it to simplify maintenance in https://github.com/w3f/schnorrkel/commit/c333370c164746c6e4c2e74fd9b7a53c72dd10f9 I'll revert that commit, fix rand, and make some other changes.
Thank you, I'll try to see if I can obtain some numbers with that reverted.
But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.
@koute Could you point me to the PR, so I can bring it locally and get out some numbers.
But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.
@koute Could you point me to the PR, so I can bring it locally and get out some numbers.
It's merged to curve25519-dalek
's main
, so if you pull that in it'll automatically work as long as the simd
cargo feature is enabled. (Note: there are two backends, and the faster avx512 one requires a nightly compiler; when compiled on a non-nightly compiler it will be silently disabled. The crate will automatically pick the fastest available one at runtime.)
That said, actually integrating it is a little bit more complex because the new version of curve25519-dalek
has breaking changes and it needs to be replaced in a few crates (schnorrkel
, ed25519-zebra
, sp-io
), so it's not just a simple job of bumping the deps. I don't know whether you actually want to invest your time into this right now. (Hopefully there will be a new official release of curve25519-dalek
soon-ish and then I'll be making the relevant PRs myself to pull it in.)
If you're interested here are some performance numbers from curve25519-dalek
's own benchmarks for its two available SIMD backends. I took these on a GCP VM.
I think the lower level crypto optimizations should be somewhat orthogonal. It's clear they'll happen over the next couple months regardless, so..
If you want to focus on our real protocol optimization that's fine too. Ask @eskimor what he thinks. We do need to know when the crypto optimizations merge of course, so that we do not mistakenly conclude some protocol optimization did stuff it did not do.
We can otoh quickly cut a schnorrkel version without the syscall if you want to see if that's causing threads to be bumped or whatever.
I think the lower level crypto optimizations should be somewhat orthogonal. It's clear they'll happen over the next couple months regardless, so..
Yes, I agree they are orthogonal. I just went this route because I wanted to understand where we spent most of the time. With that in mind, I think in the order of impact we should do the following:
Don't you agree @sandreim @eskimor @burdges ?
In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?
At a scale of 1k paravalidators and 200 parachains, under normal operation with 30 needed approvals and 6 vrf modulo samples, the question we are asking is basically - Can we check 1k approvals faster in a single thread than 6k on multiple threads (assuming we implement parallel processing) ?
In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?
Yes this was always my question. :)
I suspect this syscall in schnorrkel is a major cost since it lets linux bump the thread.
Can we check 1k approvals faster in a single thread than 6k on multiple threads (assuming we implement parallel processing) ?
Not really, but we could use those other threads for checking more parachain blocks or whatever.
At a high level, we're not exactly in the shoes of a web2 company who once they parallelize then they can simply throw more hardware at the problem. We have the hardware we told validators they could buy, and upgrading that requires community work.
TBH, not sure that the syscall would result in the CPU being given to another thread, but we could actually test the 2 scenarios and see what the result is.
Totally agree with you, the thing is that with current architecture we cannot dynamically scale to process stuff on more than 1 CPU to handle bursty workloads. Assuming 1k para valdiators and 200 paras, I think we need to worry about the specs since that means at most 6-7 candidates to check (1 to back, 6 to approve) and 2x times that with async backing. We could add more validators there so the amount of checks per validator goes down, but that means we need to be able to handle more gossip traffic still.
In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?
At least in approval-voting subsystem from 1 second of useful work it spends around 700 millis doing just the crypto operations for either assignment or approval.
I'll get back when I have some number splits for approval-distribution
as well.
Some measurements regarding where the time is spent in approval-distribution
, in average from 1 second of work 750ms is spent waiting for the approval-voting
to check the signatures. So, I would say our book-keeping is not where the main part of the work is happening. Example of 1s breakdown at peak :
One thing I noticed while performing this measurements is that we've got cases where we get multiple assignments & approvals from peers in the same message, here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L568 and here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L607 but because we process those assignments one by one we are going to fan-out those message into multiple messages sent to the same peer. In average we have 10-15 approvals in those messages, but I see some case where we even get 328
approvals in it all those might translate to a single message for each approval.
I think this inefficiency, is a low-hanging fruit, and should help us with at least reducing the network traffic/load.
Some measurements regarding where the time is spent in
approval-distribution
, in average from 1 second of work 750ms is spent waiting for theapproval-voting
to check the signatures. So, I would say our book-keeping is not where the main part of the work is happening. Example of 1s breakdown at peak :
- 15K messages coming in from peers
- 938 assignments checked in 379ms
- 1373 approvals checked in 338 ms
- the rest(~250ms) book-keeping and forwarding message to the network bridge.
Thanks for the numbers @alexggh, it appears that book keeping is not eating a lot, but we might be able to improve there as well.
One thing I noticed while performing this measurements is that we've got cases where we get multiple assignments & approvals from peers in the same message, here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L568 and here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L607 but because we process those assignments one by one we are going to fan-out those message into multiple messages sent to the same peer. In average we have 10-15 approvals in those messages, but I see some case where we even get
328
approvals in it all those might translate to a single message for each approval.I think this inefficiency, is a low-hanging fruit, and should help us with at least reducing the network traffic/load.
Sounds like a good idea, it should clearly reduce the notification counts.
Putting this on-hold in favor of: https://github.com/paritytech/polkadot-sdk/issues/701
Will be addressed with https://github.com/paritytech/polkadot-sdk/issues/1617
Follow up from some earlier scaling tests with https://github.com/paritytech/polkadot/pull/6247. The initial attempts did made it apparent that
approval-distribution
andapproval-voting
are bottlenecks increasing approval checking finality lag in our tests with 250 paravalidators and 50 parachains at the time. That argument still holds true, but only at much larger scale than what we were experimenting at the time. The real issue was https://github.com/paritytech/polkadot/issues/6400 which was solved.Bumping up to 350+ para validators requires that we make
approval-voting
andapproval-distribution
do more work in the same amount of time. The first step is to implement the parallelization changes from https://github.com/paritytech/polkadot/pull/6247 which would result in maxing out cpu usage inapproval-voting
which sits right now at 80% due to the serial processing and waiting for completion of vote imports issued byapproval-distribution
. This will enable us to further improve how fast the subsystem churns through work by makingapproval-voting
parallelise signature checks.Important note: The implementation will make use of internal queueing of messages. We have to be careful to preserve the backpressure so we need these internal queues to be bounded, for example by a fixed number of imports that can be pending at a given time.