hangfire-postgres / Hangfire.PostgreSql

PostgreSql Storage Provider for Hangfire
Other
356 stars 132 forks source link

Distributed Lock is causing issues #329

Open phillijw opened 11 months ago

phillijw commented 11 months ago

The default value of the lock timeout is 10min but if a job takes longer than 10min, it will spawn a new cron job, assuming that the lock timed out. So we set our lock to be more like 6 hours instead. Of course, when the application shuts down, this lock doesn't get cleared which means that job can't run until that lock gets cleared when it expires.

This seems like a bug. Can anyone provide more info on how this works vs how the mssql version works and how to handle it better from my point of view?

dmitry-vychikov commented 11 months ago

Hi @phillijw

I have just tested with a small app on my side. Recurring jobs work fine, and no locks get stuck.

Can anyone provide more info on how this works

If you are asking about Recurring jobs, AFAIR, distributed lock is used to ensure only one worker executes the cron scheduling part. When it is time to trigger a recurring job, it takes the lock, and creates a normal (non-recurring) job in the queue which will do the actual calculation, then releases the lock immediately.

In other words, job execution time and lock timeout are NOT related. This should be working normally without you having to set 6 hours time outs.

phillijw commented 11 months ago

I think the lock is actually coming from the DisableConcurrentExecution attribute. I'm still trying to repro it properly and I think that's the trick. You have to add that attribute to the method to attempt to avoid overlapping jobs.

phillijw commented 11 months ago

Ok, yes, so the main issue I'm seeing is that when I set DisableConcurrentExecution and I stop the process in the middle of a job, that lock doesn't get removed during shutdown. It stays there even after the application restarts and then it won't run the job again until that times out.

dmitry-vychikov commented 11 months ago

@phillijw

Ok, yes, so the main issue I'm seeing is that when I set DisableConcurrentExecution and I stop the process in the middle of a job, that lock doesn't get removed during shutdown. It stays there even after the application restarts and then it won't run the job again until that times out.

It makes sense now.

In terms of distributed locking, hangfire sql and postgres are different. SqlServer uses application locks which are cleared automatically when connection closes. In comparison, Postgres uses a simple table to store locks, which cleans up only inside of code. Postgres is vulnerable to cases when job is interrupted ungracefully.

There were discussions in this repo about improving lock mechanism, but as I understand it got stuck with a major issue which was not solved so far.

Anyway, for your scenario consider the following recommendations:

VmPeter commented 6 months ago

Hi all, just wanted to respond to this!

In comparison, Postgres uses a simple table to store locks, which cleans up only inside of code. Postgres is vulnerable to cases when job is interrupted ungracefully

We're using Hangfire but with MSSQL. We specifically looked into this, because we wanted to make sure app crashes did not leave any locks hanging. MSSQL uses sp_getapplock, which is tied to the session or transaction, so even killing (kill -9, we did not test power-outage, but I hope it suffices) should release lock. https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-getapplock-transact-sql?view=sql-server-ver16

I'm NOT a postgres guy, so this is just me suggesting something (because I want to help).

For postgres I see there's something called pg_try_advisory_lock and pg_advisory_unlock which should work on the session if I read https://www.postgresql.org/docs/9.1/functions-admin.html correctly.

There's also one tied to transaction (see link).

We have also been using DistributedLock nuget from madelson, but the SQL version. There's a postgres version which uses these advisory locks: https://github.com/madelson/DistributedLock/blob/master/docs/DistributedLock.Postgres.md so maybe their approach could be a guiding hand to implementing it the same as the MSSQL locks?

Hope this can help. Take care.