lf-edge / eve

EVE is Edge Virtualization Engine
https://www.lfedge.org/projects/eve/
Apache License 2.0
478 stars 164 forks source link

Don't let deferred queue cause avalanche send, kickoff within minute #4218

Closed rouming closed 2 months ago

rouming commented 2 months ago

Once deferred queue is stalled due to a connectivity problems don't let avalanche of messages being sent to the controller from many nodes once connectivity is restored. This patch replaces the immediate timer kick on the timer kick within random duration within a reasonable one minute time.

Patch should help to reduce load on controller after down time.

CC: @milan-zededa

rouming commented 2 months ago

~Heading the office~ Heading to the office, will test there. PS. updated to more modest statement

OhmSpectator commented 2 months ago

Is it a fix for some observed by a user issue?

So, if I understand it right, the change is that instead of ticking once immediately after the connection restoration, it will update the ticker with a once-set random interval (although random, the intervals are always the same) until all the deferred messages are sent and then restore it to the original ticker with exponentially increasing randomized ticks.

It solves the problem of the avalanche sending on connection restoration, yep. But it changes the behaviour of handling the deferred queue until it's drained. I don't think it's critical, but I would solve it in a less changing way: defer once the call of KickTimerNow() with a random timeout, without reconfiguring the main ticker.

milan-zededa commented 2 months ago

Is it a fix for some observed by a user issue?

The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)

OhmSpectator commented 2 months ago

Is it a fix for some observed by a user issue?

The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)

Yeah, I understand. I just want to understand if we have already seen it.

milan-zededa commented 2 months ago

Is it a fix for some observed by a user issue?

The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)

Yeah, I understand. I just want to understand if we have already seen it.

Yes, we see this problem on one of the clusters which has >10K devices

OhmSpectator commented 2 months ago

Is it a fix for some observed by a user issue?

The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)

Yeah, I understand. I just want to understand if we have already seen it.

Yes, we see this problem on one of the clusters which has >10K devices

Ok, I got it. So the intent is to fix it ASAP, yep?

milan-zededa commented 2 months ago

Is it a fix for some observed by a user issue?

The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)

Yeah, I understand. I just want to understand if we have already seen it.

Yes, we see this problem on one of the clusters which has >10K devices

Ok, I got it. So the intent is to fix it ASAP, yep?

Yes, although it will take long time anyway until a significant number of devices are upgraded to contain this patch. And from the EVE-user perspective this is our problem (our = controller provider), not theirs, so they might be reluctant to upgrade.

rouming commented 2 months ago

Converted to draft, want to rework the exponential backoff implementation.

OhmSpectator commented 2 months ago

As I already mentioned, the PR looks good to me, and the tests are OK, so I would like to merge them. Nevertheless, I'm not sure @eriknordmark is okay with replacing the algorithm.