PoH service takes variable amount of time to generate hashes

pgarg66 commented 4 years ago

Problem

PoH service generates a certain number of hashes (configured via Genesis block, as hashes_per_tick) before registering a tick. The service thread sets the CPU affinity, but due to OS scheduling, there is no guarantee that no other thread will get scheduled on the CPU core. This causes different node types (leader/validator) to take different amount of time to compute the same number of hashes. This causes network drift as nodes complete their respective slots at different times. End result can be leader timeouts, and dropped slots.

Proposed Solution

I tried setting the scheduling policy for PoH service thread to realtime, with FIFO policy. This helps align the PoH time for all nodes.

Running a 5 node network, I plotted the time it took to do 8 ticks on the cluster. The following graphs is for tip of master

This graph is with the scheduling change

It can be seen that PoH ticks are more consistent with the change.

Challenges

The thread itself is not able to set it's scheduling policy due to lack of privileges. It needs superuser privileges. I tried using thread_priority crate, and ran into the permissions issue.
We can use chrt to update the scheduling policy from shell. It has to be run using sudo. The above graphs were captured using this approach. I added the following code to remote-node.sh sudo chrt -r -p 99 `ps -eT | grep solana-poh-serv | awk '{print $2}' This approach can not be used for external nodes, as the node boots up directly from the native solana-validator program.
Just changing the priority of the thread (using renice) did not help with the problem.

mvines commented 4 years ago

hey @pgarg66, we're going to want this for SLP1 after all :)

mvines commented 4 years ago

cc: #6519

solana-labs / solana