umbraco / Umbraco-CMS

Umbraco is a free and open source .NET content management system helping you deliver delightful digital experiences.
https://umbraco.com
MIT License
4.49k stars 2.69k forks source link

Background job causes "Failed to acquire write lock for id: -333." #15445

Closed enkelmedia closed 9 months ago

enkelmedia commented 11 months ago

Which Umbraco version are you using? (Please write the exact version, example: 10.1.0)

12.3.5

Bug summary

We have been working on a migration project from v7 to v12 (over around 20k content items and 25k media) and were almost done when we started to see errors like this when saving content, document types and other "stuff":

Umbraco.Cms.Core.DistributedLocking.Exceptions.DistributedWriteLockTimeoutException, Umbraco.Core, Version=12.3.5.0, Culture=neutral, PublicKeyToken=null: Failed to acquire write lock for id: -333.

at Umbraco.Cms.Persistence.SqlServer.Services.SqlServerDistributedLockingMechanism.SqlServerDistributedLock..ctor(SqlServerDistributedLockingMechanism parent, Int32 lockId, DistributedLockType lockType, TimeSpan timeout)
   at Umbraco.Cms.Persistence.SqlServer.Services.SqlServerDistributedLockingMechanism.WriteLock(Int32 lockId, Nullable`1 obtainLockTimeout)
   at Umbraco.Cms.Core.Scoping.LockingMechanism.ObtainWriteLock(Int32 lockId, Nullable`1 timeout)
   at Umbraco.Cms.Core.Scoping.LockingMechanism.LockInner(Guid instanceId, Dictionary`2& locks, HashSet`1& locksSet, Action`2 obtainLock, Nullable`1 timeout, Int32 lockId)
   at Umbraco.Cms.Core.Scoping.LockingMechanism.EagerWriteLockInner(Guid instanceId, Nullable`1 timeout, Int32[] lockIds)
   at Umbraco.Cms.Core.Scoping.LockingMechanism.EnsureLocks(Guid scopeInstanceId)
   at Umbraco.Cms.Infrastructure.Scoping.Scope.get_Database()
   at Umbraco.Cms.Infrastructure.Persistence.Repositories.Implement.ContentTypeRepositoryBase`1.PersistUpdatedBaseContentType(IContentTypeComposition entity)
   at Umbraco.Cms.Infrastructure.Persistence.Repositories.Implement.ContentTypeRepository.PersistUpdatedItem(IContentType entity)
   at Umbraco.Cms.Core.Cache.FullDataSetRepositoryCachePolicy`2.Update(TEntity entity, Action`1 persistUpdated)
   at Umbraco.Cms.Infrastructure.Persistence.Repositories.Implement.EntityRepositoryBase`2.Save(TEntity entity)
   at Umbraco.Cms.Core.Services.ContentTypeServiceBase`2.Save(TItem item, Int32 userId)
   at Umbraco.Cms.Web.BackOffice.Controllers.ContentTypeController.<PostSave>b__34_1(IContentType type)
   at Umbraco.Cms.Web.BackOffice.Controllers.ContentTypeControllerBase`1.PerformPostSave[TContentTypeDisplay,TContentTypeSave,TPropertyType](TContentTypeSave contentTypeSave, Func`2 getContentType, Action`1 saveContentType, Action`1 beforeCreateNew)
   at Umbraco.Cms.Web.BackOffice.Controllers.ContentTypeController.PostSave(DocumentTypeSave contentTypeSave)
   at lambda_method951(Closure, Object, Object[])
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.SyncObjectResultExecutor.Execute(ActionContext actionContext, IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeActionMethodAsync()
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeNextActionFilterAsync()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeInnerFilterAsync()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeNextExceptionFilterAsync>g__Awaited|26_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)

I first thought it was an issue with some kind of notification handler so I disabled all of them. No luck. I've noticed that the error takes about 20-30 seconds to "appear".

While profiling the SQL Server, some time after the site has started we see a bunch of these:

exec sp_executesql N'SELECT [umbracoNode].[id] AS [Id], [umbracoNode].[uniqueId] AS [Key], [umbracoNode].[level] AS [Level], [umbracoNode].[path] AS [Path], [umbracoNode].[sortOrder] AS [SortOrder], [umbracoNode].[parentId] AS [ParentId], [umbracoNode].[createDate] AS [CreateDate], [umbracoNode].[nodeUser] AS [CreatorId]
, [umbracoContent].[contentTypeId] AS [ContentTypeId]
, [umbracoDocument].[published] AS [Published], [umbracoDocument].[edited] AS [Edited]
, [umbracoContentVersion].[id] AS [VersionId], [umbracoContentVersion].[text] AS [EditName], [umbracoContentVersion].[versionDate] AS [EditVersionDate], [umbracoContentVersion].[userId] AS [EditWriterId]
, [umbracoDocumentVersion].[templateId] AS [EditTemplateId]
, [pcver].[id] AS [PublishedVersionId], [pcver].[text] AS [PubName], [pcver].[versionDate] AS [PubVersionDate], [pcver].[userId] AS [PubWriterId]
, [pdver].[templateId] AS [PubTemplateId]
, [nuEdit].[data] AS [EditData]
, [nuPub].[data] AS [PubData]
, [nuEdit].[dataRaw] AS [EditDataRaw]
, [nuPub].[dataRaw] AS [PubDataRaw]
FROM [umbracoNode]
INNER JOIN [umbracoNode] [x]
ON (([umbracoNode].[id] = [x].[id]) OR ([umbracoNode].[path] LIKE concat([x].[path],'',%'')))
INNER JOIN [umbracoContent]
ON ([umbracoNode].[id] = [umbracoContent].[nodeId])
INNER JOIN [umbracoDocument]
ON ([umbracoNode].[id] = [umbracoDocument].[nodeId])
INNER JOIN [umbracoContentVersion]
ON (([umbracoNode].[id] = [umbracoContentVersion].[nodeId]) AND [umbracoContentVersion].[current] = @0)
INNER JOIN [umbracoDocumentVersion]
ON ([umbracoContentVersion].[id] = [umbracoDocumentVersion].[id])
LEFT JOIN [umbracoContentVersion] [pcver]
INNER JOIN [umbracoDocumentVersion] [pdver]
ON (([pcver].[id] = [pdver].[id]) AND ([pdver].[published] = @1))
ON ([umbracoNode].[id] = [pcver].[nodeId])
LEFT JOIN [cmsContentNu] [nuEdit]
ON (([umbracoNode].[id] = [nuEdit].[nodeId]) AND [nuEdit].[published] = @2)
LEFT JOIN [cmsContentNu] [nuPub]
ON (([umbracoNode].[id] = [nuPub].[nodeId]) AND ([nuPub].[published] = @3))
WHERE ((([umbracoNode].[nodeObjectType] = @4) AND ([umbracoNode].[trashed] = @5)))
AND (([x].[id] = @6))
ORDER BY [umbracoNode].[level], [umbracoNode].[parentId], [umbracoNode].[sortOrder]',N'@0 int,@1 int,@2 int,@3 int,@4 uniqueidentifier,@5 int,@6 int',@0=1,@1=1,@2=0,@3=1,@4='C66BA18E-EAF3-4CFF-8A22-41B16D66A972',@5=0,@6=33135

I've seen other similar issues like:

https://github.com/umbraco/Umbraco-CMS/issues/14195 and https://github.com/umbraco/Umbraco-CMS/issues/13804

To me, it feels like some background job starts after X seconds that keeps holding that lock on 333 which means that the edit actions can't get the lock to save changes. This happens both when I try to save content and document types.

I've tried to disable the "content version cleanup" using ContentVersionCleanupPolicy and maxed out DistributedLockingWriteLockDefaultTimeout but nothing works. While this background job is processing the content we can't write to the database.

I also tried the fix posted by @bergmania here: https://github.com/umbraco/Umbraco-CMS/issues/13804#issuecomment-1840366547 without success, I even disabled all the populators in the code sample but the background jobs that is blocking is still running.

I'm not sure what to do next?

Specifics

No response

Steps to reproduce

Create a site with 25k content and 25k media, use that database in a fresh folder of Umbraco.

Expected result / actual result

Saving should be possible even when background processing is running.

github-actions[bot] commented 11 months ago

Hi there @enkelmedia!

Firstly, a big thank you for raising this issue. Every piece of feedback we receive helps us to make Umbraco better.

We really appreciate your patience while we wait for our team to have a look at this but we wanted to let you know that we see this and share with you the plan for what comes next.

We wish we could work with everyone directly and assess your issue immediately but we're in the fortunate position of having lots of contributions to work with and only a few humans who are able to do it. We are making progress though and in the meantime, we will keep you in the loop and let you know when we have any questions.

Thanks, from your friendly Umbraco GitHub bot :robot: :slightly_smiling_face:

bergmania commented 11 months ago

The examine indexes are built just after startup on cold-boots in the background and indeed reads all content. We have been searching for a hanging lock for some time. Can you check if the query actually completes as expected.

enkelmedia commented 11 months ago

Hi!

Thanks for your answer @bergmania!

I was mistaken here, it was not related to Examine, but since the "symptoms" were, similar it was my initial suspicion.

I digged deeper and found out what was going on and I think that the root cause is probably not a bug but maybe something that could be addressed in a better way.

First I need to describe the setup that we have for development at the moment which is probably just because we "always had it like that" and it used to work well on older versions of Umbraco where the load balancing support wasn't great. We have a central database used for development and for CI/Test. Some days we might have two devs working against the db while the test-server is also connected. We do have the DisableElectionForSingleServer-setting in appSettings but to be honest I do not think we ever assign any of the machines a "role" as publisher or subscriber which we probably should. Not sure if this setup is even supported or recommended.

Anyway.

Yesterday we ran a big batch-job on the test server that imported content and performed save/publish. The symptoms that I mistook for being examine index population was actually my machine processing all the cache instructions created by the import.

I'm guessing that there were a lot of instructions in the queue and the central database is hosted with an external provider so some which add some network latency as well. So far things "make sense" given the setup and configuration we have.

It might be a problem or issue that the whole process in CacheInstructionService.ProcessInstructions() locks 333 until all cache instructions have been processed. I'm not sure but if the idea with the cache instructions is to update the local cache it doesn't really make sense to take a distributed lock while doing so? If a subscriber-machine that has been offline for a while wakes up and re-hydrates it's cache any work in the backoffice would fail until that processe is finished.

In our setup we do not store the NuCache files on disk (we have PublishedSnapshotServiceOptions { IgnoreLocalDb = true }) so we know that the cache will be fresh on each restart.

I think that there might be a couple of things that could be considered here:

I'm not sure if the particular circumstances around or setup are the "real" problem here or if these ideas might be something that could be helpful for others as well.

Let me know what you think.

Cheers!

chriskarkowsky commented 11 months ago

Hi!

Thanks for your answer @bergmania!

I was mistaken here, it was not related to Examine, but since the "symptoms" were, similar it was my initial suspicion.

I digged deeper and found out what was going on and I think that the root cause is probably not a bug but maybe something that could be addressed in a better way.

First I need to describe the setup that we have for development at the moment which is probably just because we "always had it like that" and it used to work well on older versions of Umbraco where the load balancing support wasn't great. We have a central database used for development and for CI/Test. Some days we might have two devs working against the db while the test-server is also connected. We do have the DisableElectionForSingleServer-setting in appSettings but to be honest I do not think we ever assign any of the machines a "role" as publisher or subscriber which we probably should. Not sure if this setup is even supported or recommended.

Anyway.

Yesterday we ran a big batch-job on the test server that imported content and performed save/publish. The symptoms that I mistook for being examine index population was actually my machine processing all the cache instructions created by the import.

I'm guessing that there were a lot of instructions in the queue and the central database is hosted with an external provider so some which add some network latency as well. So far things "make sense" given the setup and configuration we have.

It might be a problem or issue that the whole process in CacheInstructionService.ProcessInstructions() locks 333 until all cache instructions have been processed. I'm not sure but if the idea with the cache instructions is to update the local cache it doesn't really make sense to take a distributed lock while doing so? If a subscriber-machine that has been offline for a while wakes up and re-hydrates it's cache any work in the backoffice would fail until that processe is finished.

In our setup we do not store the NuCache files on disk (we have PublishedSnapshotServiceOptions { IgnoreLocalDb = true }) so we know that the cache will be fresh on each restart.

I think that there might be a couple of things that could be considered here:

  • If IgnoreLocalDb=true any historical instructions would not have to be processed since the updates is, already in the cache. Maybe clear cache-instructions at startup or something.
  • Would it be possible to run `CacheInstructionService.ProcessInstructions()´ without a distributed lock so that editors would be able to work in the backoffice while the cache is being populated on the subscriber?

I'm not sure if the particular circumstances around or setup are the "real" problem here or if these ideas might be something that could be helpful for others as well.

Let me know what you think.

Cheers!

This is interesting because we are coming to the same conclusion. Another symptom (we believe) of #14195 is now the cache instructions are no longer being populated on our Subscriber instance. Shutting off the Nucache db makes the site unbearably slow, so that is not an option. The only way we have found to fix the Subscriber instance is to restart the webapp :(.

Looking at the Load Balancing documentation it mentions:

The process is as follows:

I would love more insight on this process. It seems that the instructions are saved into umbracoCacheInstruction table from the Scheduler, and then picked up by the cacheInstructionService on each running Subscriber.

Another symptom we have seen is something like the following on the Subscriber:

{"@t":"2023-12-19T00:00:04.9007638Z","@mt":"Failed (will repeat).","@l":"Error","@x":"System.AggregateException: One or more errors occurred. (Exceptions were thrown by listed actions. (Could not find file '/home/site/wwwroot/umbraco/Data/TEMP/ExamineIndexes/ExternalIndex/_d.si'.))\n ---> System.AggregateException: Exceptions were thrown by listed actions. (Could not find file '/home/site/wwwroot/umbraco/Data/TEMP/ExamineIndexes/ExternalIndex/_d.si'.)\n ---> System.IO.FileNotFoundException: Could not find file '/home/site/wwwroot/umbraco/Data/TEMP/ExamineIndexes/ExternalIndex/_d.si'.\nFile name: '/home/site/wwwroot/umbraco/Data/TEMP/ExamineIndexes/ExternalIndex/_d.si'\n   at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirError)\n   at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, UnixFileMode openPermissions, Int64& fileLength, UnixFileMode& filePermissions, Func`4 createOpenException)\n   at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)\n   at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share)\n   at Lucene.Net.Store.MMapDirectory.OpenInput(String name, IOContext context)\n   at Lucene.Net.Store.Directory.OpenChecksumInput(String name, IOContext context)\n   at Lucene.Net.Codecs.Lucene46.Lucene46SegmentInfoReader.Read(Directory dir, String segment, IOContext context)\n   at Lucene.Net.Index.SegmentInfos.Read(Directory directory, String segmentFileName)\n   at Lucene.Net.Index.SegmentInfos.FindSegmentsFileAnonymousClass.DoBody(String segmentFileName)\n   at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run(IndexCommit commit)\n--- End of stack trace from previous location ---\n   at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run(IndexCommit commit)\n   at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run()\n   at Lucene.Net.Index.SegmentInfos.Read(Directory directory)\n   at Lucene.Net.Index.IndexWriter..ctor(Directory d, IndexWriterConfig conf)\n   at Examine.Lucene.Directories.SyncedFileSystemDirectoryFactory.CreateDirectory(LuceneIndex luceneIndex, Boolean forceUnlock)\n   at Examine.Lucene.Directories.DirectoryFactoryBase.<>c__DisplayClass2_0.<Examine.Lucene.Directories.IDirectoryFactory.CreateDirectory>b__0(String s)\n   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)\n   at Examine.Lucene.Directories.DirectoryFactoryBase.Examine.Lucene.Directories.IDirectoryFactory.CreateDirectory(LuceneIndex luceneIndex, Boolean forceUnlock)\n   at Umbraco.Cms.Infrastructure.Examine.ConfigurationEnabledDirectoryFactory.CreateDirectory(LuceneIndex luceneIndex, Boolean forceUnlock)\n   at Examine.Lucene.Directories.DirectoryFactoryBase.<>c__DisplayClass2_0.<Examine.Lucene.Directories.IDirectoryFactory.CreateDirectory>b__0(String s)\n   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)\n   at Examine.Lucene.Directories.DirectoryFactoryBase.Examine.Lucene.Directories.IDirectoryFactory.CreateDirectory(LuceneIndex luceneIndex, Boolean forceUnlock)\n   at Examine.Lucene.Providers.LuceneIndex.<>c__DisplayClass1_0.<.ctor>b__0()\n   at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)\n--- End of stack trace from previous location ---\n   at System.Lazy`1.CreateValue()\n   at Examine.Lucene.Providers.LuceneIndex.IndexExistsImpl()\n   at Examine.Lucene.Providers.LuceneIndex.IndexExists()\n   at Examine.Lucene.Providers.LuceneIndex.EnsureIndex(Boolean forceOverwrite)\n   at Examine.Lucene.Providers.LuceneIndex.get_IndexWriter()\n   at Examine.Lucene.Providers.LuceneIndex.CreateSearcher()\n   at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)\n--- End of stack trace from previous location ---\n   at System.Lazy`1.CreateValue()\n   at Umbraco.Cms.Infrastructure.Examine.UmbracoContentIndex.PerformDeleteFromIndex(IEnumerable`1 itemIds, Action`1 onComplete)\n   at Umbraco.Cms.Infrastructure.Examine.ExamineUmbracoIndexingHandler.DeferredDeleteIndex.Execute(ExamineUmbracoIndexingHandler examineUmbracoIndexingHandler, Int32 id, Boolean keepIfUnpublished)\n   at Umbraco.Cms.Infrastructure.Examine.DeferredActions.Execute()\n   at Umbraco.Cms.Core.Scoping.ScopeContext.ScopeExit(Boolean completed)\n   --- End of inner exception stack trace ---\n   at Umbraco.Cms.Core.Scoping.ScopeContext.ScopeExit(Boolean completed)\n   at Umbraco.Cms.Infrastructure.Scoping.Scope.<>c__DisplayClass56_0.<RobustExit>g__HandleScopeContext|0()\n   at Umbraco.Cms.Infrastructure.Scoping.Scope.TryFinally(Action[] actions)\n   --- End of inner exception stack trace ---\n   at Umbraco.Cms.Infrastructure.Scoping.Scope.TryFinally(Action[] actions)\n   at Umbraco.Cms.Infrastructure.Scoping.Scope.Dispose()\n   at Umbraco.Cms.Infrastructure.Services.CacheInstructionService.ProcessInstructions(CacheRefresherCollection cacheRefreshers, ServerRole serverRole, CancellationToken cancellationToken, String localIdentity, DateTime lastPruned, Int32 lastId)\n   at Umbraco.Cms.Infrastructure.Sync.DatabaseServerMessenger.Sync()\n   at Umbraco.Cms.Infrastructure.HostedServices.ServerRegistration.InstructionProcessTask.PerformExecuteAsync(Object state)","SourceContext":"Umbraco.Cms.Infrastructure.HostedServices.ServerRegistration.InstructionProcessTask","ProcessId":99,"ProcessName":"dotnet","ThreadId":63,"ApplicationId":"xxxxxxxxxxxxx","MachineName":"xxxxxxxx","Log4NetLevel":"ERROR"}

The CacheInstructionService is running here, and Lucene is trying to access a file within the TEMP directory of the app.

  1. The file does indeed not exist on that directory.
  2. The settings for the Subscriber are ASPNETCORE_UmbracoCMSExamine__LuceneDirectoryFactory = TempFileSystemDirectoryFactory and ASPNETCORE_UmbracoCMSHosting__LocalTempStorageLocation = EnvironmentTemp (set in accordance with the load balancing docs). This should mean that the ExamineIndex temp files should be outside of the website directory and in the /tmp/ directory, as specificied in the backoffice: Screenshot 2023-12-20 at 9 53 07 AM

Between this caching issue and consistent Failed to Acquire writelock -333 errors, our production instance is extremely unstable. Users of the backoffice are constantly unable to publish pages due to the "writelock", and when they are able to publish, the are not seeing their changes populated on the Subscriber. The only workaround we have found is to restart everything or KILL processes (like I mentioned in #14195).

It appears to me that Load Balancing is not fully fleshed out as the documentation says it is. Our only other option at this point is to go back to a single webapp for everything, and see if that helps our cause.

bergmania commented 11 months ago

Hi @chriskarkowsky..

TempFileSystemDirectoryFactory should only be used on multi-instance setups (multiple sites on the same filesystem).

Otherwise I would also use SyncedTempFileSystemDirectoryFactory, as this should limit the required full index rebuild.

chriskarkowsky commented 11 months ago

Hi @chriskarkowsky..

TempFileSystemDirectoryFactory should only be used on multi-instance setups (multiple sites on the same filesystem).

Otherwise I would also use SyncedTempFileSystemDirectoryFactory, as this should limit the required full index rebuild.

Hi @bergmania,

The documentation states:

The single instance Backoffice Administrative Web App should be set to use SyncedTempFileSystemDirectoryFactory The multi-instance Scalable Public Web App should be set to use TempFileSystemDirectoryFactory.

We have one single-instance appservice running for the backoffice, and then we have another appservice for the frontend, which we initially had scaled out to 2 instances. The frontend has TempFileSystemDirectoryFactory, and the backoffice has SyncedTempFileSystemDirectoryFactory. Am I misunderstanding?

bergmania commented 11 months ago

If by 2 instanced you mean some azure magic to scale out, then you are most likely correct. If it is two instances you control, and there by different webapp folders, then it should work with SyncedTempFileSystemDirectoryFactory.

SyncedTempFileSystemDirectoryFactory basically just fall back and copy the entire index from the webapp folder, if it do not exist in the temp folder, instead of rebuilding everything using the database.

No matter what, when you see the -333 lock, what is interesting is two things. 1) What have the lock when it happens 2) What is requesting a lock.

Multiple readers are allowed, but write locks are exclusive and forces both write and read to wait.

chriskarkowsky commented 11 months ago

No matter what, when you see the -333 lock, what is interesting is two things.

  1. What have the lock when it happens
  2. What is requesting a lock.

We are able to get the "writelock" error pretty consistently in our dev environment after deploying code and then attempting to save a page in the backoffice. I looked at the UmbracoTraceLog.

Note: at no point before this in the log do I see anything requesting WriteLock before this record, so I am assuming the answer to #1 is "nothing".

{
    "@t": "2023-12-20T19:21:07.5246351Z",
    "@mt": "Requesting {lockType} for id {id}",
    "@l": "Debug",
    "lockType": "WriteLock",
    "id": -333,
    "SourceContext": "Umbraco.Cms.Persistence.SqlServer.Services.SqlServerDistributedLockingMechanism",
    "ActionId": "...",
    "ActionName": "Umbraco.Cms.Web.BackOffice.Controllers.ContentController.PostSave (Umbraco.Web.BackOffice)",
    "RequestId": "...",
    "RequestPath": "/umbraco/backoffice/umbracoapi/content/PostSave",
    "ConnectionId": "..."
}

.....

{
    "@t": "2023-12-20T19:21:27.7319490Z",
    "@mt": "Exception ({InstanceId}).",
    "@l": "Error",
    "@x": "Microsoft.Data.SqlClient.SqlException (0x80131904): Lock request time out period exceeded.\nThe statement has been terminated.\n   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)\n   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)\n   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)\n   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)\n   at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)\n   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)\n   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method)\n   at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String methodName)\n   at Microsoft.Data.SqlClient.SqlCommand.ExecuteNonQuery()\n   at Umbraco.Cms.Infrastructure.Persistence.FaultHandling.RetryPolicy.ExecuteAction[TResult](Func`1 func)\n   at NPoco.Database.ExecuteNonQueryHelper(DbCommand cmd)\n   at NPoco.Database.Execute(String sql, CommandType commandType, Object[] args)\nClientConnectionId:5xxxxx",
    "InstanceId": "xxxx",
    "SourceContext": "Umbraco.Cms.Infrastructure.Persistence.UmbracoDatabase",
    "ActionId": "4251e73d-6851-44c6-9d85-bdf22ba9bffe",
    "ActionName": "Umbraco.Cms.Web.BackOffice.Controllers.ContentController.PostSave (Umbraco.Web.BackOffice)",
    "RequestId": "...",
    "RequestPath": "/umbraco/backoffice/umbracoapi/content/PostSave",
    "ConnectionId": "...",
    "ProcessId": 99,
    "ProcessName": "dotnet",
    "ThreadId": 16,
    "ApplicationId": "...",
    "MachineName": "..",
    "Log4NetLevel": "ERROR",
    "HttpRequestId": "...",
    "HttpRequestNumber": 16,
    "HttpSessionId": ".."
}

I am gathering that this isn't an issue with the umbracoLock table itself, but rather an issue with a lock on the umbracoLock table :). The SQL query to obtain the lock doesn't complete, causing the error. This is why the only workaround I have found when we get this is to find the blocking process and killing it. I mentioned in the other issue that if you try to do the ObtainWriteLockQuery manually, it will run longer that whatever value is SET LOCK_TIMEOUT is within the method. If this ObtainWriteLock query constantly times out, users will never be able to Save/Publish.

If this is indeed the case, much the like OP, the Unable to Obtain Writelock issue is a symptom of a larger issue with db queries taking a long time, and presenting itself when doing background jobs which we can't necessarily see within the logs, and maybe get some insight from Long Running Queries like OP.

I understand this issue is hard to debug, so if I can be of help testing anything, I will gladly participate because this issue is wreaking havoc on our production environment.

nul800sebastiaan commented 9 months ago

I see most people in this issue have been active in https://github.com/umbraco/Umbraco-CMS/issues/14195 as well, so I'll close this one as a duplicate, please continue the converastion on #14195