jamesmh / coravel

Near-zero config .NET library that makes advanced application features like Task Scheduling, Caching, Queuing, Event Broadcasting, and more a breeze!
https://docs.coravel.net/Installation/
MIT License
3.63k stars 243 forks source link

Unwanted Clock Drift #384

Closed InteXX closed 1 month ago

InteXX commented 1 month ago

Describe the bug I'm seeing significant clock drift in an EveryMinute job—as much as a full second of delay in as little as seventy-two hours. I frequently encounter this entry in my website's application logs:

Coravel's scheduler is behind 1 ticks and is catching-up to the current tick

This often appears in groups of five to up to twelve occurrences, with millisecond resolution.

Affected Coravel Feature Scheduling

Expected behaviour I'd hoped to see the job fire at the zero second mark reliably.

Is there something that can be done to mitigate this problem?

jamesmh commented 1 month ago

Coravel uses a Timer under the covers, which doesn't necessarily fire at exactly the right moment. This generally is affected by resources given to the respective process (e.g. CPU, memory pressure which affects CPU), how much load the process/system is under, etc.

This issue has existed ever since the ability to schedule seconds was introduced (something I originally didn't want to do due to complexities it introduces - like this). A few weeks ago a final fix for this issue was introduced.

So yes, there will be times - notably when scheduling to the second, when there will be drift. That's not something Coravel can control. There are actions you can take such as keeping schedule processing on a dedicated process, container, etc. to keep dedicated resources to scheduling. Or, make sure that given container/machine isn't limited.

For example, this commonly occurs in kubernetes pods that have a tiny amount of resources allocated to it. The process just doesn't have enough resources to do all the work it needs to do on a timely basis 🤷.

The difference now (with the fix) is that Coravel will "catch-up" if the Timer is triggered, but if there were missed intervals (usually one or a few seconds) then Coravel will play back all the missed times and run schedules that were due.

So at this point, my advice is to take a look at the resources on the process.

You can also look into trying out schedule workers on some of your heavier tasks to see if that helps?

InteXX commented 1 month ago

That makes sense, thanks for the detailed explanation.

In my case, at least, I was able to mitigate the problem by backing off of per-second resolution in my app's logic. So it's no longer an issue here, and I'll keep an eye out for it in the future.

And yes—this is an Azure App Service running on the Basic plan, so resource availability likely comes into play.