query: implement likelihood sampling based peer scheduling and add work stealing across peers

Roasbeef commented 8 months ago

This issue is spun off from some findings we've seen with neutrino over time, either when we have a lot of active peers, or some of the peers are connected over tor. This'll serve as a mastering tracking issue for a number of improvements we can make to this area of the code base. I also consider much of this to be preparatory work before we begin parallel block download in btcd in earnest. This was spun off discussions in https://github.com/lightningnetwork/lnd/issues/8250.

Sampling Based Peer Ranking

Today the default peer tanking algorithm will just re-order the peers: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/peer_rank.go#L20-L26) in a map. Each time we go to schedule something new, we use the ranking to sort a slice of the peer, then go attempt to give the work to the next free peer: https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L226-L256.

As peers start to fill up, then we'll allocate a task to them next peer in line. In the wild, we've seen that this'll end up allocating tasks to slow peers.

To get around this, we should look at using a sampling based method. So we'll map the rankings into a probability that the peer will respond properly instead.

Dynamic Ping Tuned Timeouts

Today we have a hard coded 2 second timeout for all requests. This works in settings of very good network connections, but for tor nodes, this might be too slow.

Instead, we should use the ping latency to tune the timeout for a given peer, based on the extrapolated RTT. This can also be fed into the peer ranking itself. We should also refresh this metric in the background (as network jitter can change perceived latency, especially on tor).

Work Stealing Workers

We should add work stealing to ensure that fast peers are able to dynamically increase the load to increase overall throughput. With work stealing peers would check at the top of their loop if another peer has work that has yet to be claimed, then adding that to its own queue. The Go scheduler also uses work stealing itself to ensure all processors are utilized: https://rakyll.org/scheduler/.

Along the way, we should modify the work queue to add better type safety via type params.

Chinwendu20 commented 6 months ago

As peers start to fill up, then we'll allocate a task to them next peer in line. In the wild, we've seen that this'll end up allocating tasks to slow peers.

I do not think so, currently, tasks are assigned to a peer one at a time and if the peer takes too long, it times out and the task is reassigned to another worker (peer). In my understanding with the current set up, there is really "no work to steal"

I think the dynamic ping time-out would be useful though in determining the most "realistic" time for a time out giving the RTT.

So we'll map the rankings into a probability that the peer will respond properly instead.

Currently, peers are awarded a score based on their history of responsiveness that reduces or increases their likelihood of being assigned a task, do you think that would suffice?

Along the way, we should modify the work queue to add better type safety via type params.

Care to throw more light on this one?

Roasbeef commented 6 months ago

I do not think so, currently, tasks are assigned to a peer one at a time and if the peer takes too long, it times out and the task is reassigned to another worker (peer

So this is about more active queue management. That timeout is done by the scheduler, rather than a peer stealing the work directly. What we have right now is a sorted list, and we assign to directly to that, not being aware of any active load.

Notice how the main loop can block when sending work to a peer if the queue is full: https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L237-L258

Instead, it should recognize that the queue is full, to allocate to another peer. We should move to a thread-safe queue instead of the current channel based mechanism.

Check out that link I posted at the end, re how work stealing works in the Go compiler.

Currently, peers are awarded a score based on their history of responsiveness that reduces or increases their likelihood of being assigned a task, do you think that would suffice?

Re-reading this, I think it can be even simpler:

Select two peers that are capable of taking on the work.
Pick one of them at random to assign
Have the peers work steal in the background to smooth things out.

Along the way, we should modify the work queue to add better type safety via type params.

Get rid of all the type assertions: https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L215.

lightninglabs / neutrino