hangfire-postgres / Hangfire.PostgreSql

PostgreSql Storage Provider for Hangfire
Other
354 stars 132 forks source link

Could not place a lock on the resource: Lock timeout #119

Closed urossavelj closed 1 year ago

urossavelj commented 5 years ago

It seems that you already fixed this issue in version 1.4.0, but we're getting a lot of errors like this every day, Not sure what is causing this, could it be connected with Npgsql package or maybe something else?

Hangfire.PostgreSql.PostgreSqlDistributedLockException: Could not place a lock on the resource 'HangFire:recurring-jobs:lock': Lock timeout.

at Hangfire.PostgreSql.PostgreSqlDistributedLock.PostgreSqlDistributedLock_Init_Transaction(String resource, TimeSpan timeout, IDbConnection connection, PostgreSqlStorageOptions options)

at Hangfire.PostgreSql.PostgreSqlDistributedLock..ctor(String resource, TimeSpan timeout, IDbConnection connection, PostgreSqlStorageOptions options)

at Hangfire.PostgreSql.PostgreSqlConnection.AcquireDistributedLock(String resource, TimeSpan timeout)

at Hangfire.Server.RecurringJobScheduler.UseConnectionDistributedLock(JobStorage storage, Func`2 action)

at Hangfire.Server.RecurringJobScheduler.EnqueueNextRecurringJobs(BackgroundProcessContext context)

at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context)

at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)

at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)

We have the latest version of the Hangfire.PostgreSql package: 1.6.0 and Npgsql.EntityFrameworkCore.PostgreSQL version 2.1.2.

Is there any known compatibility problem between those two?

If you need any more specific information I'll try to add it.

jhonnyelhelou91 commented 5 years ago

Hello,

To avoid pushing the same recurring job more than once, a lock is created with a timeout. When another lock with the same resource name exists, the timeout expires and you receive this error. Please check the source code for more information: RecurringJobScheduler.cs

I think it is because the PostgreSqlDistributedLockException should extend from DistributedLockTimeoutException so that lock exceptions are handled.

@urossavelj Locks usually remain in the database when the application is not gracefully shutdown. If you are debugging locally, make sure to close the application from the CMD window if it is a console or the browser in case your server is in a web application.

urossavelj commented 5 years ago

So this is expected behaviour and nothing to worry about?

We're also having issue #111 ,will that be fixed soon?

a-beliaev-altatec commented 4 years ago

Encountering the same issue, please help

a-beliaev-altatec commented 4 years ago

Emptying the lock table from hangfire database kinda "fixes" the issue.

xklonx commented 4 years ago

The problem is that 'HangFire:recurring-jobs:lock' blocks the addition of jobs to the queue. Its not about any job its about RecurringJobScheduler itself. Here is an linqpad example, after couple of minutes i have this lock and jobs not queued anymore until default lock timeout of 10 minues is passed. hangfire test.zip

Looks like, in my case, lock is not released because of timeout, which I do not know what to do with

8 02:05:10.5794 | Hangfire.Server.RecurringJobScheduler | 35 | TRACE | Recurring job 'CreateSites' is being updated. RecurringJob: (Queue:default;Cron:* * * * *;TimeZoneId:UTC;Job:{"Type":"***","Method":"CreateSites","ParameterTypes":"[]","Arguments":"[]"};CreatedAt:2019-09-13T18:50:38.6914661Z;V:2;Running:yes;LastJobId:833806;LastExecution:2020-03-27T23:04:09.8281272Z;NextExecution:2020-03-27T23:05:00.0000000Z), Changes: (LastExecution:2020-03-27T23:05:10.4896455Z;NextExecution:2020-03-27T23:06:00.0000000Z;LastJobId:833810), NextExecution: (27.03.2020 23:06:00) 
2020-03-28 02:05:10.6947 | Hangfire.Server.RecurringJobScheduler | 35 | WARN | Recurring job 'CreateSites' can not be scheduled due to an exception. 
System.Transactions.TransactionAbortedException: The transaction has aborted. ---> System.TimeoutException: Transaction Timeout
   --- End of inner exception stack trace ---
   at System.Transactions.TransactionStateAborted.BeginCommit(InternalTransaction tx, Boolean asyncCommit, AsyncCallback asyncCallback, Object asyncState)
   at System.Transactions.CommittableTransaction.Commit()
   at System.Transactions.TransactionScope.InternalDispose()
   at System.Transactions.TransactionScope.Dispose()
   at Hangfire.PostgreSql.PostgreSqlWriteOnlyTransaction.Commit()
   at Hangfire.Server.RecurringJobScheduler.EnqueueBackgroundJob(BackgroundProcessContext context, IStorageConnection connection, String recurringJobId, DateTime now)
   at Hangfire.Server.RecurringJobScheduler.TryEnqueueBackgroundJob(BackgroundProcessContext context, IStorageConnection connection, String recurringJobId, DateTime now)
xklonx commented 4 years ago

@davidroth Can you tell why TransactionScope usage was so hardcoded in PostgreSqlWriteOnlyTransaction, even if entity framework devs (as i understand it) advised to use native transactions? https://docs.microsoft.com/en-us/ef/ef6/saving/transactions It will throw exceptions if you issue any DDL and have not enabled distributed transactions through the MSDTC Service. We recommend using the approach outlined in the previous sections instead where possible. And its ok that default timeout used there TransactionSynchronisationTimeout so small, 500ms? There is no chance now to avoid usage of this when i dont want to use it? Main reason is i want to avoid MSDTC usage. May be its better, by default, when EnableTransactionScopeEnlistment is false to use old (1.6.3) native npgsql transaction logic?

davidroth commented 4 years ago

@xklonx If EnableTransactionScopeEnlistment is set to false (which is the default for backwards compat reasons), the transaction scope is created using TransactionScopeOption.RequiresNew.

This means that a new transaction is used and therefore there is no enlisting in an existing ambient transaction (if any).

https://docs.microsoft.com/en-us/ef/ef6/saving/transactions It will throw exceptions if you issue any DDL and have not enabled distributed transactions through the MSDTC Service.

I dont know what the implementation of PostgreSqlWriteOnlyTransaction has to do with EF6?

And its ok that default timeout used there TransactionSynchronisationTimeout so small, 500ms? There is no chance now to avoid usage of this when i dont want to use it? Main reason is i want to avoid MSDTC usage.

The timeout is configurable via TransactionSynchronisationTimeout. The timeout was already there before the TransactionScope has been added. See https://github.com/frankhommers/Hangfire.PostgreSql/blob/master/src/Hangfire.PostgreSql/PostgreSqlStorageOptions.cs#L38

I did not come up with the 500ms.

Main reason is i want to avoid MSDTC usage.

Avoiding MSDTC is great. Yet i dont know what your issue is with the current implementation and how this affects your app. Please show me how the current implementation of PostgreSqlWriteOnlyTransaction triggers MSDTC escalation in your application.

May be its better, by default, when EnableTransactionScopeEnlistment is false to use old (1.6.3) native npgsql transaction logic?

IMO no because it would add additional code complexity without any benefit.

xklonx commented 4 years ago

This means that a new System Transaction (not postgres transaction) is used even if i dont need it and it uses new timeout of 500ms and gives me timeout errors (https://github.com/frankhommers/Hangfire.PostgreSql/issues/119#issuecomment-605337122) when RecurringJobScheduler tries to update a job, because of that, lock HangFire:recurring-jobs:lock will remain and dont give a chance for RecurringJobScheduler to schedule new recurring jobs (example https://github.com/frankhommers/Hangfire.PostgreSql/issues/119#issuecomment-605337122). Before that change TransactionSynchronisationTimeout was used only in other class

if(executionTimer.Elapsed > _options.TransactionSynchronisationTimeout)
                    throw new TimeoutException("SetRangeInHash experienced timeout while trying to execute transaction");

which was executed only after first try, nevermind how long it was. Now that new system transaction is always exists after using (var transaction = new TransactionScope(scopeOption, transactionOptions)), you can check it in linqpad by

var transactionOptions = new TransactionOptions()
    {
        IsolationLevel = System.Transactions.IsolationLevel.RepeatableRead,
        Timeout = TimeSpan.FromMilliseconds(500)
    };
    using (var transaction = new TransactionScope(TransactionScopeOption.RequiresNew, transactionOptions))
    {
        Transaction.Current.Dump();
    }

connection always enlisted by call _connection.EnlistTransaction(Transaction.Current); code of EnlistTransaction

public override void EnlistTransaction(Transaction transaction)
#nullable restore
        {
            if (EnlistedTransaction != null)
            {
                if (EnlistedTransaction.Equals(transaction))
                    return;
                try
                {
                    if (EnlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
                        throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})");
                }
                catch (ObjectDisposedException)
                {
                    // The MSDTC 2nd phase is asynchronous, so we may end up checking the TransactionInformation on
                    // a disposed transaction. To be extra safe we catch that, and understand that the transaction
                    // has ended - no problem for reenlisting.
                }
            }

            var connector = CheckReadyAndGetConnector();

            EnlistedTransaction = transaction;
            if (transaction == null)
                return;

            // Until #1378 is implemented, we have no recovery, and so no need to enlist as a durable resource manager
            // (or as promotable single phase).

            // Note that even when #1378 is implemented in some way, we should check for mono and go volatile in any case -
            // distributed transactions aren't supported.

            transaction.EnlistVolatile(new VolatileResourceManager(this, transaction), EnlistmentOptions.None);
            Log.Debug($"Enlisted volatile resource manager (localid={transaction.TransactionInformation.LocalIdentifier})", connector.Id);
        }

PostgreSqlWriteOnlyTransaction has nothin to do with ef but by default it must be based on npgsql transactions, like it was before, and now it not. You can check here https://stackoverflow.com/questions/1690892/transactionscope-automatically-escalating-to-msdtc-on-some-machines for example, when transactionscope escalates dtc usage. Simple way is just to move your database to another server. I dont understand why all the people who dont need transactionscope must deal with this. It can be real benefit for users to not deal with those timeout exceptions and dtc escalations. @frankhommers can you tell what you think about that?

davidroth commented 4 years ago

@xklonx

This means that a new System Transaction (not postgres transaction) is used even if i dont need it and it uses new timeout of 500ms and gives me timeout errors (#119 (comment)) when RecurringJobScheduler tries to update a job, because of that, lock HangFire:recurring-jobs:lock will remain and dont give a chance for RecurringJobScheduler to schedule new recurring jobs (example #119 (comment)). Before that change TransactionSynchronisationTimeout was used only in other class

Well then we have a timeout issue. This has nothing todo with MSDTC. We could increase the timeout here so that it matches the previous timeout (which was the default npgslconnection timeout).

PostgreSqlWriteOnlyTransaction has nothin to do with ef but by default it must be based on npgsql transactions, like it was before, and now it not.

TransactionScope is only the facade. During enlistment a NpgsqlTransaction is used underneath.

You can check here https://stackoverflow.com/questions/1690892/transactionscope-automatically-escalating-to-msdtc-on-some-machines for example, when transactionscope escalates dtc usage.

I still dont understand why you are talking about MSDTC. The post behind the link is all about SqlConnection/SqlServer/Windows. You can hit MSDTC escalations when using Sql-Server on windows and spanning a transaction scope over multiple active connections. Since this issue is about the postgres adapter MSDTC is not a thing here. Sure, distributed transaction escalation exists also in this world, but here it is called "prepared transactions" (https://www.postgresql.org/docs/8.4/sql-prepare-transaction.html).

You still have not demonstrated that a prepared transaction escalation occurs with the current code.

It can be real benefit for users to not deal with those timeout exceptions and dtc escalations.

If the timeout is a problem we could increase/remove it. But this has nothing to do with msdtc.

xklonx commented 4 years ago

Now i see that VolatileResourceManager uses _localTx = connection.BeginTransaction(ConvertIsolationLevel(_transaction.IsolationLevel)); and hangfire.sqlserver uses transactionscope too. As hangfire do all the stuff in separate threads it would be hard to intervene and escalate to distributed transaction. But i still dont understand decision to use TransactionScope. The timeout should definitely be increased to that used in hangfire.sqlserver TransactionTimeout = TimeSpan.FromMinutes(1);

davidroth commented 4 years ago

Yep the timeout can be increased. That should help with your issue.

davidroth commented 4 years ago

MR created: https://github.com/frankhommers/Hangfire.PostgreSql/pull/147

vytautask commented 4 years ago

I've merged the PR. It's included in the 1.6.4.2 version. Does it help? Can we close the issue?

xklonx commented 4 years ago

For me, yes, it would be interesting to know someone else it helped or not.

davidroth commented 4 years ago

Please note that this issue was created long before TransactionScope support has been added. So Hangfire.PostgreSql.PostgreSqlDistributedLockException isnt something that was introduced with TransactionScope support. It was just more likely to hit the exception with the new code because of the 500ms timeout (until MR #147).

jayasurya-jeyakodi commented 4 years ago

I am using 1.6.4.2 and i face the same issue, is there anyway to remove the lock? This is preventing me to queue new jobs.

xklonx commented 4 years ago

You can remove it in lock table. Check in your logs due to which error the lock was not released, because it will surely be back again.

dpisarevskiy commented 4 years ago

Hello! how remove all lock on startup? Because server can't start with lock. Hangfire.PostgreSql.PostgreSqlDistributedLockException: 'Could not place a lock on the resource 'HangFire:lock:recurring-job:Notifications': Lock timeout.'

frankhommers commented 3 years ago

It's above your question:

Hello! how remove all lock on startup? Because server can't start with lock. Hangfire.PostgreSql.PostgreSqlDistributedLockException: 'Could not place a lock on the resource 'HangFire🔒recurring-job:Notifications': Lock timeout.'

This:

You can remove it in lock table. Check in your logs due to which error the lock was not released, because it will surely be back again.

frankhommers commented 3 years ago

It seems fixed now.

jenswachtel commented 3 years ago

We have latest versions (1.7.19) and Hangfire.PostgreSQL (1.8.1) still same issues. Anybody have an idea how we can fix this? It's odd, it just suddenly started for us on several independent systems

natalie-o-perret commented 3 years ago

We have latest versions (1.7.19) and Hangfire.PostgreSQL (1.8.1) still same issues. Anybody have an idea how we can fix this? It's odd, it just suddenly started for us on several independent systems

Same, not fixed on our side.

hadevnet commented 3 years ago

Downgrade Hangfire to version (1.7.12) and Hangfire.PostgreSql to version (1.7.1). It works with dotnet 5!

JarrodJ83 commented 3 years ago

I am running Hangfire.Postgres v1.8.2 and Hangfire.Core v1.7.22 and this is an issue when using TransactionScope. @frankhommers would you mind re-opening this issue? This a pretty big problem if we cannot enqueue work into Hangfire and update application state within a transaction.

alexrosenfeld10 commented 3 years ago

@frankhommers I'm seeing this too, we should reopen this issue.

cl-msmolcic commented 3 years ago

Same issue here.

akshaybheda commented 3 years ago

Same issue

alexrosenfeld10 commented 3 years ago

I get this on the latest release:

Hangfire.PostgreSql.PostgreSqlDistributedLock - HangFire:recurring-jobs:lock: Failed to lock with transaction 40001: could not serialize access due to concurrent update 
alexrosenfeld10 commented 3 years ago

Related to the recent changes / releases? @vytautask

vytautask commented 3 years ago

Related to the recent changes / releases? @vytautask

I do not think so. At least not my changes :)

Let's continue discussion in #191 as I have high hopes this could be solved via already proposed mechanism (only PR is missing 🙄 ).

Frazi1 commented 3 years ago

Related to the recent changes / releases? @vytautask

I do not think so. At least not my changes :)

Let's continue discussion in #191 as I have high hopes this could be solved via already proposed mechanism (only PR is missing 🙄 ).

@vytautask PostgreSqlWriteOnlyTransaction and PostgreSqlDistributedLock set the isolation level separately.

I'm going to fix PostgreSqlWriteOnlyTransaction as a part of #191, however, my change will not affect #119.

It is better to do #119 as a separate PR. It may require that we also adjust the sql query.

vytautask commented 3 years ago

Guys, is this still an issue? I saw that @frankhommers did some changes on how conflicts are handled https://github.com/frankhommers/Hangfire.PostgreSql/commit/07362ea2c47e358e747772fd119712d077f0d6a6 , maybe that fixed it?

Changes should be in the 1.5.8.2 version.

JarrodJ83 commented 3 years ago

I just updated to v1.5.8.3 and this is still an issue for me.

romerobrjp commented 3 years ago

I just updated to v1.5.8.4 and this is still an issue for me.

uberubert commented 3 years ago

Met this issue and updated to v.1.5.8.4 and the issue persisted. Luckily this occured in my localhost test-environment. A quick restart of the postgres docker container and the issue resolved itself.

KhorAMus commented 3 years ago

I get this on the latest release:

Hangfire.PostgreSql.PostgreSqlDistributedLock - HangFire:recurring-jobs:lock: Failed to lock with transaction 40001: could not serialize access due to concurrent update 

I have the same error when I use two servers in one schema in one process image if I close my web app and fastly start it again then error occurs alomost on start (I add some recurringJobs on start). On version 1.4.0 I receive errors in this situation too but different errors.

When I use SqlServer I have no errors

peteschmitz commented 2 years ago

We started encountering a very similar exception in our production environment over the weekend. There's no immediate rhyme or reason, since our last deployment was over a week ago and our experiences with this PostgreSql plugin have been awesome/stable for several months.

Our stack trace is slightly different than the OP's, referencing state-lock in the Exception message (specifically, this line is throwing):

Could not place a lock on the resource 'HangFire:job:3856:state-lock': Lock timeout.
   at Hangfire.PostgreSql.PostgreSqlDistributedLock.PostgreSqlDistributedLock_Init_Transaction(String resource, TimeSpan timeout, IDbConnection connection, PostgreSqlStorageOptions options)
   at Hangfire.PostgreSql.PostgreSqlDistributedLock..ctor(String resource, TimeSpan timeout, IDbConnection connection, PostgreSqlStorageOptions options)
   at Hangfire.PostgreSql.PostgreSqlConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   at Hangfire.Storage.StorageConnectionExtensions.AcquireDistributedJobLock(IStorageConnection connection, String jobId, TimeSpan timeout)
   at Hangfire.States.BackgroundJobStateChanger.ChangeState(StateChangeContext context)
   at Hangfire.Server.Worker.TryChangeState(BackgroundProcessContext context, IStorageConnection connection, IFetchedJob fetchedJob, IState state, String[] expectedStates, CancellationToken initializeToken, CancellationToken abortToken)

Packages

    <PackageReference Include="Hangfire.AspNetCore" Version="1.7.19" />
    <PackageReference Include="Hangfire.Core" Version="1.7.19" />
    <PackageReference Include="Hangfire.PostgreSql" Version="1.7.1" />

Our services-configurations:

        services.AddHangfire(configuration => configuration
                .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
                .UseSimpleAssemblyNameTypeSerializer()
                .UseRecommendedSerializerSettings()
                .UsePostgreSqlStorage(postgresConnectionString, new PostgreSqlStorageOptions
                {
                    SchemaName = postgresSchemaString,
                    UseNativeDatabaseTransactions = true,
                    InvisibilityTimeout = TimeSpan.FromMinutes(5),
                    QueuePollInterval = TimeSpan.FromMinutes(1),
                    PrepareSchemaIfNecessary = true
                }));

            services.AddHangfireServer(options =>
            {
                options.SchedulePollingInterval = TimeSpan.FromMinutes(1);
                options.HeartbeatInterval = TimeSpan.FromMinutes(1);
                options.ServerCheckInterval = TimeSpan.FromMinutes(3);
            });

Information from our state table:

image

Are there advisable changes to our configurations that we can adjust to overcome the noise being generated?

We have some constraints with transient package-dependencies, so we're attempting to discover a solution which doesn't involve package updates to Hangfire.PostgreSql.

(Edit: Similar discussion in https://github.com/HangfireIO/Hangfire/issues/1799 )

(Edit 2: Potentially corrected in #188 - release 1.8.4?)

frankhommers commented 2 years ago

It is normal that a distributed lock isn't always succesfull.

We could swallow the exception based on the state and the constraint name. What do you think @vytautask and @peteschmitz ? Would that solve the problem?

dmitry-vychikov commented 2 years ago

@frankhommers please check out my suggestion here: https://github.com/frankhommers/Hangfire.PostgreSql/pull/188#issuecomment-918763332

peteschmitz commented 2 years ago

Thank you for the quick response @frankhommers . After further review from my end - I don't believe changing the verbosity-level on logging is necessary.

My team's issue may be a result of using an older version of Hangfire.PostgreSql (1.7.0) and, in my opinion, this issue can likely be re-closed.


Some additional information, which I should have reviewed and included in my initial post:

Best, Pete

vytautask commented 2 years ago

If there are some problems still, let's re-open this.

hhyyrylainen commented 2 years ago

I think I just ran into this problem as well in my local dev setup while using dotnet watch run which causes my code to be recompiled and restarted a lot. Still it took almost an entire day for this issue to occur, and I haven't seen this before. So it seems very rare. I'm not using the latest version as I haven't updated to .NET 6 yet.

I had to wait the around 10 minutes locking timeout before I could restart my development server process successfully as I have some recurring jobs that are created on startup, so it failed to start with PostgreSqlDistributedLockException.

azygis commented 2 years ago

@hhyyrylainen there's the latest version for Npgsql5 as well now: Hangfire.PostgreSql.Npgsql5. Please try that one.

Regarding startup exceptions, probably the only way to fix that one quickly is deleting the locks manually in the database.

hhyyrylainen commented 2 years ago

I'll update all my nuget packages to the latest versions once I update to .NET 6. There's potentially one unit testing problem with blazor in that though I'm keeping an eye on (and I have some priority features I need to implement before I undertake major dependency version upgrades). I linked that potential problem in the issue I opened for myself to keep track of updating to .NET 6: https://github.com/Revolutionary-Games/ThriveDevCenter/issues/131

If I start getting this regularly, I'll probably add a catch block around the recurring job starting method I have so that I can at least get the server to start.

azygis commented 2 years ago

Just to clarify: this package does not target .NET 6 so the upgrade of it does not have a lot to do with it. There were no breaking changes since Hangfire.PostgreSql 1.8.6. In any case, if you wish to proceed with catching, it's entirely up to you.

hhyyrylainen commented 2 years ago

I currently have Hangfire.PostgreSql 1.8.6, I tried updating it before but I think I ran into a compatibility problem as I can't update past Hangfire.AspNetCore 1.7.26 so I went with the latest Hangfire.Postgresql version released when that Hangfire.AspNetCore version was released.

azygis commented 2 years ago

You might have hit the compatibility problem regarding Npgsql5, not Hangfire.AspNetCore.

Since the new Hangfire.PostgreSql 1.9+ depends on Npgsql6 now, you will get runtime exceptions if you try to use it while still on Npgsql5. That was the reason a "new" package is now being pushed to NuGet - Hangfire.PostgreSql.Npgsql5 (the only difference in it is a dependency of Npgsql 5 instead of 6).

I see you are using Npgsql.EntityFrameworkCore.PostgreSQL@5.0.10, which in itself drags a dependency of Npgsql5. If you tried to update only Hangfire.PostgreSql - it wouldn't work. And unfortunately there will never be support for Npgsql6 with EFCore 5.

Suggestion: switch Hangfire.PostgreSql dependency to Hangfire.PostgreSql.Npgsql5 for now (with the latest version), and when the time comes for .NET 6 for you, you just go back to the package without Npgsql5 suffix.

Another suggestion: as you have mentioned, try/catching, but it's really an unnecessary tech debt (even if temporary, still a bit of a debt).