lightningdevkit / ldk-sample

Sample node implementation using LDK
Apache License 2.0
166 stars 94 forks source link

Switch missed tick behavior to delay instead of default burst #108

Closed domZippilli closed 1 year ago

domZippilli commented 1 year ago

We encountered an interesting bug that we fixed and felt we should upstream, as it presents itself in the sample as well.

We observed that seemingly randomly, this loop (or rather, our similar but modified one) would get "hot" and run as fast as possible. This would consume a lot of CPU and other resources, and of course meant a lot of SYN packets being sent somewhere proximal to the peer. These "reconnect storms" could last as little as a few seconds, or up to hours, after which they would spontaneously resolve.

We could observe this because we'd added logging to the reconnect attempts, so all we knew was that some loop was hot, but not which of them. We initially tried rewriting the loop to use combinators with take and maximum attempt values, but this did not fix the problem. This suggested that the outer loop was the issue, but it took us some time to figure out how that could be.

The interval here uses the default missed tick behavior, which is Burst:

https://docs.rs/tokio/latest/tokio/time/enum.MissedTickBehavior.html

We believe that with a large number of peers offline (we sometimes had >5), this loop could run for longer than the Interval duration of one second. When this happened, the Interval would enter Burst mode and try to re-establish an exact cadence of running this loop on a whole multiple of the interval from the start.

Expected ticks: |     1     |     2     |     3     |     4     |     5     |     6     |
Actual ticks:   | work -----|          delay          | work | work | work -| work -----|

We switched the interval to instead use the Delay behavior, which should cause this loop to wait the interval's duration (1 second) each time tick() is called, which is probably what the author (@TheBlueMatt) intended.

Expected ticks: |     1     |     2     |     3     |     4     |     5     |     6     |
Actual ticks:   | work -----|          delay          | work -----| work -----| work -----|
domZippilli commented 1 year ago

Not sure if there's a permissions thing but I can't see this question I got emailed with a link here:

@domZippilli Out of curiosity, was there a particular reason why you chose MissedTickBehavior::Delay over MissedTickBehavior::Skip?

Not particularly, either will get the job done of not-doing-burst and trying to do this on a specific cadence. I thought the Delay behavior is more like what most people would expect, in that writing a loop like this in a single-threaded program, most people would have a sleep at one end of the loop.

tnull commented 1 year ago

Not sure if there's a permissions thing but I can't see this question I got emailed with a link here:

No, excuse the confusion, after posting I quickly reconsidered and deleted the question as I figured the difference between Skip and Delay is really minuscule and it's not really worth your time explaining why you chose one over the other.

Not particularly, either will get the job done of not-doing-burst and trying to do this on a specific cadence. I thought the Delay behavior is more like what most people would expect, in that writing a loop like this in a single-threaded program, most people would have a sleep at one end of the loop.

Alright, thank you! That's what I figured, too. :)