Closed ryoqun closed 2 years ago
Please check https://github.com/solana-labs/solana/pull/21019
This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.
Problem
http worker thread are scheduled equally as other threads. This can stall the validator pretty easily by indirectly chogging the machine via cpu usage saturation.
If I manually
renice
d the http threads*, I observed generally more favorable validator sync status even under the heavy load of rpc reqs.*:
$ ps -e -T | grep http | awk '{print $2}' | while read pid; do sudo renice -n 20 -p $pid; done
Proposed Solution
Just lower the thread priority via some platform-dependant crate.
Be careful a bit for the lock priority inversion; but it should generally better than nothing. At least, we just need dictonomy of critical replay threads (including account background service) and other threads (http worker) and make sure replay threads never depend on other threads. (strictly, it does still via AccountsDB locks).
Also, as far as I checked our rpc generally doesn't lock except obvious one (like getProgramAccounts). So, not that super high priority. Anyway, I managed the validator with getConfirmedBlock, so lowering serialization done by the http worker thread will be one of the few wins from here.