Closed rouming closed 2 months ago
~Heading the office~ Heading to the office, will test there. PS. updated to more modest statement
Is it a fix for some observed by a user issue?
So, if I understand it right, the change is that instead of ticking once immediately after the connection restoration, it will update the ticker with a once-set random interval (although random, the intervals are always the same) until all the deferred messages are sent and then restore it to the original ticker with exponentially increasing randomized ticks.
It solves the problem of the avalanche sending on connection restoration, yep. But it changes the behaviour of handling the deferred queue until it's drained. I don't think it's critical, but I would solve it in a less changing way: defer once the call of KickTimerNow()
with a random timeout, without reconfiguring the main ticker.
Is it a fix for some observed by a user issue?
The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)
Is it a fix for some observed by a user issue?
The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)
Yeah, I understand. I just want to understand if we have already seen it.
Is it a fix for some observed by a user issue?
The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)
Yeah, I understand. I just want to understand if we have already seen it.
Yes, we see this problem on one of the clusters which has >10K devices
Is it a fix for some observed by a user issue?
The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)
Yeah, I understand. I just want to understand if we have already seen it.
Yes, we see this problem on one of the clusters which has >10K devices
Ok, I got it. So the intent is to fix it ASAP, yep?
Is it a fix for some observed by a user issue?
The problem is that if cloud goes down and then later is restored, it is stormed by all devices publishing their enqueued messages all at the same time. The poor cloud coming back from downtime cannot catch a breath :)
Yeah, I understand. I just want to understand if we have already seen it.
Yes, we see this problem on one of the clusters which has >10K devices
Ok, I got it. So the intent is to fix it ASAP, yep?
Yes, although it will take long time anyway until a significant number of devices are upgraded to contain this patch. And from the EVE-user perspective this is our problem (our = controller provider), not theirs, so they might be reluctant to upgrade.
Converted to draft, want to rework the exponential backoff implementation.
As I already mentioned, the PR looks good to me, and the tests are OK, so I would like to merge them. Nevertheless, I'm not sure @eriknordmark is okay with replacing the algorithm.
Once deferred queue is stalled due to a connectivity problems don't let avalanche of messages being sent to the controller from many nodes once connectivity is restored. This patch replaces the immediate timer kick on the timer kick within random duration within a reasonable one minute time.
Patch should help to reduce load on controller after down time.
CC: @milan-zededa