Scheduler overlaps on Daily schedule when PreventOverlapping is enabled

RValue1 commented 6 months ago

Description: The PreventOverlapping works fine for the minute or hourly schedule but does not seem to work for a daily schedule.

I have a long running task that runs longer than 24 hours. Yet, a new overlapping task is created even though the process is still running.

I prepare the overlap prevention the following way:

//This is actually defined in a settings file, but added it here as a local variable //cronSchedule = "00 03 " fails //cronScheduleEveryMinute = " " works //cronScheduleEveryHour = "00 /1 " works

var string cronSchedule = "00 03 *";

builder.Services.UseScheduler(scheduler => { scheduler.Schedule().Cron(cronSchedule).Zoned(TimeZoneInfo.Local) .PreventOverlapping(nameof(MyJob));

});

Expected behavior Expected that another task should not be created since the previous one is still running.

Actual behavior A new task is created which overlaps with the previous one.

Steps to reproduce //start the app with the cronSchedule, get it to trigger once and then change the date and time on your system one day forward while //the previous task still runs

var string cronSchedule = "00 03 *";

builder.Services.UseScheduler(scheduler => { scheduler.Schedule().Cron(cronSchedule).Zoned(TimeZoneInfo.Local) .PreventOverlapping(nameof(MyJob));

});

Exception(s) No exception thrown

Coravel version 5.0.2

.NET Version .Net 6.0

jamesmh commented 5 months ago

Thanks for reaching out + details on the issue. Will get back once I confirm the issue (I have an idea of what it is).

IuryBaranowski commented 3 months ago

Hey guys, did you find a solution for this bug? i am looking for this

toutas commented 3 months ago

I believe the issue lies at this mutex

https://github.com/jamesmh/coravel/blob/a0c7a13bff297b5af5bb32036bfa576cd25c2482/Src/Coravel/Scheduling/Schedule/Scheduler.cs#L163C1-L163C21

24 hour maximum duration

jamesmh commented 3 months ago

@toutas you're correct. The issue is related (a) the 24-hour timeout and/or (b) the underlying storage mechanism (e.g. .NET Core in memory cache). I suspect there are issues where the cache entry is being evicted too soon or not being evicted when it should - which lead to 2 different issues (this one and another one where a job that did complete is still holding the mutex.

toutas commented 3 months ago

@toutas you're correct. The issue is related (a) the 24-hour timeout and/or (b) the underlying storage mechanism (e.g. .NET Core in memory cache). I suspect there are issues where the cache entry is being evicted too soon or not being evicted when it should - which lead to 2 different issues (this one and another one where a job that did complete is still holding the mutex.

looking through the code seems like the 24-hour timeout would be the sole culprit for this case, as it will ensure any locks being held for more than 1440 minutes are freed. not really knowledgeable about the dynamics of the solution, but maybe the timeout could have a per-scheduledEvent duration defaulting to 1440 minutes? one that could be set alongside the scheduledEvent.

jamesmh commented 3 months ago

@toutas Here's the way things should work:

Happy Path

A job runs and grabs a lock if it's available (if not, then that means another run of the same job is preventing overlap).
Once that job is completed the lock is released (inside of a catch/finally)
Or, if the job throws an exception, then the lock is also released.

Not So Happy Path

There are cases when certain types of .NET exceptions are not caught by catch statements or by finally statements. It sounds crazy - but it is what it is 🙂.

For example, a StackOverflowException would not be caught, the lock would not be released, yet the application might continue since this is performed on a separate thread from the main thread. (I would need to test this to be 100% sure but that's the logic - maybe I should make a demo of this scenario...).

The 24-hour timeout helps in this situation where a bad run of the job won't cause the lock to never unlock.

Also, there's a scenario where jobs who are running in a poor state and have been running "too long" can "hog" the lock. Over an extended period (days), you might get a job that just hogs the lock and no jobs are ever run... So this is a trade-off -> should that job just hold the lock forever? Or should there be some play in this?

So it's not a really black/white answer or solution.

The same applies to if/when Coravel supported distributed locks. What if the application crashes before it can release a lock? The same applies...

Potential Solutions

Like @toutas mentioned, putting the ability to change that lock timeout could help. Even better, the ability to configure this per invocable type would be good.
Reduce the timeout
Make the timeout longer
Remove the timeout
Not using the built-in .NET Core cache as the storage mechanism for the mutex.
- These nice thing about the cache is the cache eviction based on the timeout value
- However, with a custom mutex store, there's an opportunity to track how long a lock has been held and then have options to either log this or do something about it (e.g. like an alert, or something else?)
- It looks like there are also cases when the .NET cache evicts entries before the lock timeout has completed or the mutex has released it. Edge-case, but there are other issues on the repo that might be related to this behavior.

Again, this becomes not a very simple / straightforward solution to the problem.

The quickest thing is to extend the 24-hour timeout to something longer, but that just pushes the issue - or to remove the timeout altogether.

I'd favor removing the timeout - which means invocables that never complete or for some reason hog a mutex just never let it go... and that's a user error which needs to be fixes?

But then - the issue will re-appear for distributed locks - something to consider if we want to keep the interface of the mutex the same but have the underlying storage (e.g. database, redis, etc.) do what it needs to do.

Would appreciate any thoughts or input! 🤷‍♂️

jamesmh commented 3 months ago

There's also the distinction between a normal job and a "long running job". Maybe there's an opportunity to mark an invocable as a "long running job" where under the covers we don't timeout the mutex only for these... Sounds close to the "configure mutex timeout per invocable" solution - which is my favorite solution at the moment 🤷‍♂️

toutas commented 3 months ago

There's also the distinction between a normal job and a "long running job". Maybe there's an opportunity to mark an invocable as a "long running job" where under the covers we don't timeout the mutex only for these... Sounds close to the "configure mutex timeout per invocable" solution - which is my favorite solution at the moment 🤷‍♂️

Personally I think a breaking change of some sort is favorable in this condition, as there is no way to know whether an invocable can be long-running or not unless the user puts a maximum expected time on it. And not requiring it to be set on every invocable will inevitably lead to people discovering it the hard way.

As for the issues with .NET Core early cache evictions, non-catched exceptions and so on; if you are able to reproduce it or get something more tangible since you aren't certain of the behaviour, I would like to look into it.

jamesmh / coravel