Closed enkelmedia closed 9 months ago
Hi there @enkelmedia!
Firstly, a big thank you for raising this issue. Every piece of feedback we receive helps us to make Umbraco better.
We really appreciate your patience while we wait for our team to have a look at this but we wanted to let you know that we see this and share with you the plan for what comes next.
We wish we could work with everyone directly and assess your issue immediately but we're in the fortunate position of having lots of contributions to work with and only a few humans who are able to do it. We are making progress though and in the meantime, we will keep you in the loop and let you know when we have any questions.
Thanks, from your friendly Umbraco GitHub bot :robot: :slightly_smiling_face:
The examine indexes are built just after startup on cold-boots in the background and indeed reads all content. We have been searching for a hanging lock for some time. Can you check if the query actually completes as expected.
Hi!
Thanks for your answer @bergmania!
I was mistaken here, it was not related to Examine, but since the "symptoms" were, similar it was my initial suspicion.
I digged deeper and found out what was going on and I think that the root cause is probably not a bug but maybe something that could be addressed in a better way.
First I need to describe the setup that we have for development at the moment which is probably just because we "always had it like that" and it used to work well on older versions of Umbraco where the load balancing support wasn't great. We have a central database used for development and for CI/Test. Some days we might have two devs working against the db while the test-server is also connected. We do have the DisableElectionForSingleServer
-setting in appSettings
but to be honest I do not think we ever assign any of the machines a "role" as publisher or subscriber which we probably should. Not sure if this setup is even supported or recommended.
Anyway.
Yesterday we ran a big batch-job on the test server that imported content and performed save/publish. The symptoms that I mistook for being examine index population was actually my machine processing all the cache instructions created by the import.
I'm guessing that there were a lot of instructions in the queue and the central database is hosted with an external provider so some which add some network latency as well. So far things "make sense" given the setup and configuration we have.
It might be a problem or issue that the whole process in CacheInstructionService.ProcessInstructions()
locks 333
until all cache instructions have been processed. I'm not sure but if the idea with the cache instructions is to update the local cache it doesn't really make sense to take a distributed lock while doing so? If a subscriber-machine that has been offline for a while wakes up and re-hydrates it's cache any work in the backoffice would fail until that processe is finished.
In our setup we do not store the NuCache files on disk (we have PublishedSnapshotServiceOptions { IgnoreLocalDb = true }
) so we know that the cache will be fresh on each restart.
I think that there might be a couple of things that could be considered here:
IgnoreLocalDb=true
any historical instructions would not have to be processed since the updates is, already in the cache. Maybe clear cache-instructions at startup or something.I'm not sure if the particular circumstances around or setup are the "real" problem here or if these ideas might be something that could be helpful for others as well.
Let me know what you think.
Cheers!
Hi!
Thanks for your answer @bergmania!
I was mistaken here, it was not related to Examine, but since the "symptoms" were, similar it was my initial suspicion.
I digged deeper and found out what was going on and I think that the root cause is probably not a bug but maybe something that could be addressed in a better way.
First I need to describe the setup that we have for development at the moment which is probably just because we "always had it like that" and it used to work well on older versions of Umbraco where the load balancing support wasn't great. We have a central database used for development and for CI/Test. Some days we might have two devs working against the db while the test-server is also connected. We do have the
DisableElectionForSingleServer
-setting inappSettings
but to be honest I do not think we ever assign any of the machines a "role" as publisher or subscriber which we probably should. Not sure if this setup is even supported or recommended.Anyway.
Yesterday we ran a big batch-job on the test server that imported content and performed save/publish. The symptoms that I mistook for being examine index population was actually my machine processing all the cache instructions created by the import.
I'm guessing that there were a lot of instructions in the queue and the central database is hosted with an external provider so some which add some network latency as well. So far things "make sense" given the setup and configuration we have.
It might be a problem or issue that the whole process in
CacheInstructionService.ProcessInstructions()
locks333
until all cache instructions have been processed. I'm not sure but if the idea with the cache instructions is to update the local cache it doesn't really make sense to take a distributed lock while doing so? If a subscriber-machine that has been offline for a while wakes up and re-hydrates it's cache any work in the backoffice would fail until that processe is finished.In our setup we do not store the NuCache files on disk (we have
PublishedSnapshotServiceOptions { IgnoreLocalDb = true }
) so we know that the cache will be fresh on each restart.I think that there might be a couple of things that could be considered here:
- If
IgnoreLocalDb=true
any historical instructions would not have to be processed since the updates is, already in the cache. Maybe clear cache-instructions at startup or something.- Would it be possible to run `CacheInstructionService.ProcessInstructions()´ without a distributed lock so that editors would be able to work in the backoffice while the cache is being populated on the subscriber?
I'm not sure if the particular circumstances around or setup are the "real" problem here or if these ideas might be something that could be helpful for others as well.
Let me know what you think.
Cheers!
This is interesting because we are coming to the same conclusion. Another symptom (we believe) of #14195 is now the cache instructions are no longer being populated on our Subscriber instance. Shutting off the Nucache db makes the site unbearably slow, so that is not an option. The only way we have found to fix the Subscriber instance is to restart the webapp :(.
Looking at the Load Balancing documentation it mentions:
The process is as follows:
I would love more insight on this process. It seems that the instructions are saved into umbracoCacheInstruction table from the Scheduler, and then picked up by the cacheInstructionService on each running Subscriber.
Another symptom we have seen is something like the following on the Subscriber:
{"@t":"2023-12-19T00:00:04.9007638Z","@mt":"Failed (will repeat).","@l":"Error","@x":"System.AggregateException: One or more errors occurred. (Exceptions were thrown by listed actions. (Could not find file '/home/site/wwwroot/umbraco/Data/TEMP/ExamineIndexes/ExternalIndex/_d.si'.))\n ---> System.AggregateException: Exceptions were thrown by listed actions. (Could not find file '/home/site/wwwroot/umbraco/Data/TEMP/ExamineIndexes/ExternalIndex/_d.si'.)\n ---> System.IO.FileNotFoundException: Could not find file '/home/site/wwwroot/umbraco/Data/TEMP/ExamineIndexes/ExternalIndex/_d.si'.\nFile name: '/home/site/wwwroot/umbraco/Data/TEMP/ExamineIndexes/ExternalIndex/_d.si'\n at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirError)\n at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, UnixFileMode openPermissions, Int64& fileLength, UnixFileMode& filePermissions, Func`4 createOpenException)\n at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)\n at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share)\n at Lucene.Net.Store.MMapDirectory.OpenInput(String name, IOContext context)\n at Lucene.Net.Store.Directory.OpenChecksumInput(String name, IOContext context)\n at Lucene.Net.Codecs.Lucene46.Lucene46SegmentInfoReader.Read(Directory dir, String segment, IOContext context)\n at Lucene.Net.Index.SegmentInfos.Read(Directory directory, String segmentFileName)\n at Lucene.Net.Index.SegmentInfos.FindSegmentsFileAnonymousClass.DoBody(String segmentFileName)\n at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run(IndexCommit commit)\n--- End of stack trace from previous location ---\n at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run(IndexCommit commit)\n at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run()\n at Lucene.Net.Index.SegmentInfos.Read(Directory directory)\n at Lucene.Net.Index.IndexWriter..ctor(Directory d, IndexWriterConfig conf)\n at Examine.Lucene.Directories.SyncedFileSystemDirectoryFactory.CreateDirectory(LuceneIndex luceneIndex, Boolean forceUnlock)\n at Examine.Lucene.Directories.DirectoryFactoryBase.<>c__DisplayClass2_0.<Examine.Lucene.Directories.IDirectoryFactory.CreateDirectory>b__0(String s)\n at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)\n at Examine.Lucene.Directories.DirectoryFactoryBase.Examine.Lucene.Directories.IDirectoryFactory.CreateDirectory(LuceneIndex luceneIndex, Boolean forceUnlock)\n at Umbraco.Cms.Infrastructure.Examine.ConfigurationEnabledDirectoryFactory.CreateDirectory(LuceneIndex luceneIndex, Boolean forceUnlock)\n at Examine.Lucene.Directories.DirectoryFactoryBase.<>c__DisplayClass2_0.<Examine.Lucene.Directories.IDirectoryFactory.CreateDirectory>b__0(String s)\n at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)\n at Examine.Lucene.Directories.DirectoryFactoryBase.Examine.Lucene.Directories.IDirectoryFactory.CreateDirectory(LuceneIndex luceneIndex, Boolean forceUnlock)\n at Examine.Lucene.Providers.LuceneIndex.<>c__DisplayClass1_0.<.ctor>b__0()\n at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)\n--- End of stack trace from previous location ---\n at System.Lazy`1.CreateValue()\n at Examine.Lucene.Providers.LuceneIndex.IndexExistsImpl()\n at Examine.Lucene.Providers.LuceneIndex.IndexExists()\n at Examine.Lucene.Providers.LuceneIndex.EnsureIndex(Boolean forceOverwrite)\n at Examine.Lucene.Providers.LuceneIndex.get_IndexWriter()\n at Examine.Lucene.Providers.LuceneIndex.CreateSearcher()\n at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)\n--- End of stack trace from previous location ---\n at System.Lazy`1.CreateValue()\n at Umbraco.Cms.Infrastructure.Examine.UmbracoContentIndex.PerformDeleteFromIndex(IEnumerable`1 itemIds, Action`1 onComplete)\n at Umbraco.Cms.Infrastructure.Examine.ExamineUmbracoIndexingHandler.DeferredDeleteIndex.Execute(ExamineUmbracoIndexingHandler examineUmbracoIndexingHandler, Int32 id, Boolean keepIfUnpublished)\n at Umbraco.Cms.Infrastructure.Examine.DeferredActions.Execute()\n at Umbraco.Cms.Core.Scoping.ScopeContext.ScopeExit(Boolean completed)\n --- End of inner exception stack trace ---\n at Umbraco.Cms.Core.Scoping.ScopeContext.ScopeExit(Boolean completed)\n at Umbraco.Cms.Infrastructure.Scoping.Scope.<>c__DisplayClass56_0.<RobustExit>g__HandleScopeContext|0()\n at Umbraco.Cms.Infrastructure.Scoping.Scope.TryFinally(Action[] actions)\n --- End of inner exception stack trace ---\n at Umbraco.Cms.Infrastructure.Scoping.Scope.TryFinally(Action[] actions)\n at Umbraco.Cms.Infrastructure.Scoping.Scope.Dispose()\n at Umbraco.Cms.Infrastructure.Services.CacheInstructionService.ProcessInstructions(CacheRefresherCollection cacheRefreshers, ServerRole serverRole, CancellationToken cancellationToken, String localIdentity, DateTime lastPruned, Int32 lastId)\n at Umbraco.Cms.Infrastructure.Sync.DatabaseServerMessenger.Sync()\n at Umbraco.Cms.Infrastructure.HostedServices.ServerRegistration.InstructionProcessTask.PerformExecuteAsync(Object state)","SourceContext":"Umbraco.Cms.Infrastructure.HostedServices.ServerRegistration.InstructionProcessTask","ProcessId":99,"ProcessName":"dotnet","ThreadId":63,"ApplicationId":"xxxxxxxxxxxxx","MachineName":"xxxxxxxx","Log4NetLevel":"ERROR"}
The CacheInstructionService is running here, and Lucene is trying to access a file within the TEMP directory of the app.
Between this caching issue and consistent Failed to Acquire writelock -333 errors, our production instance is extremely unstable. Users of the backoffice are constantly unable to publish pages due to the "writelock", and when they are able to publish, the are not seeing their changes populated on the Subscriber. The only workaround we have found is to restart everything or KILL processes (like I mentioned in #14195).
It appears to me that Load Balancing is not fully fleshed out as the documentation says it is. Our only other option at this point is to go back to a single webapp for everything, and see if that helps our cause.
Hi @chriskarkowsky..
TempFileSystemDirectoryFactory should only be used on multi-instance setups (multiple sites on the same filesystem).
Otherwise I would also use SyncedTempFileSystemDirectoryFactory, as this should limit the required full index rebuild.
Hi @chriskarkowsky..
TempFileSystemDirectoryFactory should only be used on multi-instance setups (multiple sites on the same filesystem).
Otherwise I would also use SyncedTempFileSystemDirectoryFactory, as this should limit the required full index rebuild.
Hi @bergmania,
The documentation states:
The single instance Backoffice Administrative Web App should be set to use SyncedTempFileSystemDirectoryFactory The multi-instance Scalable Public Web App should be set to use TempFileSystemDirectoryFactory.
We have one single-instance appservice running for the backoffice, and then we have another appservice for the frontend, which we initially had scaled out to 2 instances. The frontend has TempFileSystemDirectoryFactory, and the backoffice has SyncedTempFileSystemDirectoryFactory. Am I misunderstanding?
If by 2 instanced you mean some azure magic to scale out, then you are most likely correct. If it is two instances you control, and there by different webapp folders, then it should work with SyncedTempFileSystemDirectoryFactory
.
SyncedTempFileSystemDirectoryFactory basically just fall back and copy the entire index from the webapp folder, if it do not exist in the temp folder, instead of rebuilding everything using the database.
No matter what, when you see the -333 lock, what is interesting is two things. 1) What have the lock when it happens 2) What is requesting a lock.
Multiple readers are allowed, but write locks are exclusive and forces both write and read to wait.
No matter what, when you see the -333 lock, what is interesting is two things.
- What have the lock when it happens
- What is requesting a lock.
We are able to get the "writelock" error pretty consistently in our dev environment after deploying code and then attempting to save a page in the backoffice. I looked at the UmbracoTraceLog.
Note: at no point before this in the log do I see anything requesting WriteLock before this record, so I am assuming the answer to #1 is "nothing".
{
"@t": "2023-12-20T19:21:07.5246351Z",
"@mt": "Requesting {lockType} for id {id}",
"@l": "Debug",
"lockType": "WriteLock",
"id": -333,
"SourceContext": "Umbraco.Cms.Persistence.SqlServer.Services.SqlServerDistributedLockingMechanism",
"ActionId": "...",
"ActionName": "Umbraco.Cms.Web.BackOffice.Controllers.ContentController.PostSave (Umbraco.Web.BackOffice)",
"RequestId": "...",
"RequestPath": "/umbraco/backoffice/umbracoapi/content/PostSave",
"ConnectionId": "..."
}
.....
{
"@t": "2023-12-20T19:21:27.7319490Z",
"@mt": "Exception ({InstanceId}).",
"@l": "Error",
"@x": "Microsoft.Data.SqlClient.SqlException (0x80131904): Lock request time out period exceeded.\nThe statement has been terminated.\n at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)\n at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)\n at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)\n at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)\n at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)\n at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)\n at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method)\n at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String methodName)\n at Microsoft.Data.SqlClient.SqlCommand.ExecuteNonQuery()\n at Umbraco.Cms.Infrastructure.Persistence.FaultHandling.RetryPolicy.ExecuteAction[TResult](Func`1 func)\n at NPoco.Database.ExecuteNonQueryHelper(DbCommand cmd)\n at NPoco.Database.Execute(String sql, CommandType commandType, Object[] args)\nClientConnectionId:5xxxxx",
"InstanceId": "xxxx",
"SourceContext": "Umbraco.Cms.Infrastructure.Persistence.UmbracoDatabase",
"ActionId": "4251e73d-6851-44c6-9d85-bdf22ba9bffe",
"ActionName": "Umbraco.Cms.Web.BackOffice.Controllers.ContentController.PostSave (Umbraco.Web.BackOffice)",
"RequestId": "...",
"RequestPath": "/umbraco/backoffice/umbracoapi/content/PostSave",
"ConnectionId": "...",
"ProcessId": 99,
"ProcessName": "dotnet",
"ThreadId": 16,
"ApplicationId": "...",
"MachineName": "..",
"Log4NetLevel": "ERROR",
"HttpRequestId": "...",
"HttpRequestNumber": 16,
"HttpSessionId": ".."
}
I am gathering that this isn't an issue with the umbracoLock table itself, but rather an issue with a lock on the umbracoLock table :). The SQL query to obtain the lock doesn't complete, causing the error. This is why the only workaround I have found when we get this is to find the blocking process and killing it. I mentioned in the other issue that if you try to do the ObtainWriteLockQuery manually, it will run longer that whatever value is SET LOCK_TIMEOUT is within the method. If this ObtainWriteLock query constantly times out, users will never be able to Save/Publish.
If this is indeed the case, much the like OP, the Unable to Obtain Writelock issue is a symptom of a larger issue with db queries taking a long time, and presenting itself when doing background jobs which we can't necessarily see within the logs, and maybe get some insight from Long Running Queries like OP.
I understand this issue is hard to debug, so if I can be of help testing anything, I will gladly participate because this issue is wreaking havoc on our production environment.
I see most people in this issue have been active in https://github.com/umbraco/Umbraco-CMS/issues/14195 as well, so I'll close this one as a duplicate, please continue the converastion on #14195
Which Umbraco version are you using? (Please write the exact version, example: 10.1.0)
12.3.5
Bug summary
We have been working on a migration project from v7 to v12 (over around 20k content items and 25k media) and were almost done when we started to see errors like this when saving content, document types and other "stuff":
I first thought it was an issue with some kind of notification handler so I disabled all of them. No luck. I've noticed that the error takes about 20-30 seconds to "appear".
While profiling the SQL Server, some time after the site has started we see a bunch of these:
I've seen other similar issues like:
https://github.com/umbraco/Umbraco-CMS/issues/14195 and https://github.com/umbraco/Umbraco-CMS/issues/13804
To me, it feels like some background job starts after X seconds that keeps holding that lock on 333 which means that the edit actions can't get the lock to save changes. This happens both when I try to save content and document types.
I've tried to disable the "content version cleanup" using
ContentVersionCleanupPolicy
and maxed outDistributedLockingWriteLockDefaultTimeout
but nothing works. While this background job is processing the content we can't write to the database.I also tried the fix posted by @bergmania here: https://github.com/umbraco/Umbraco-CMS/issues/13804#issuecomment-1840366547 without success, I even disabled all the populators in the code sample but the background jobs that is blocking is still running.
I'm not sure what to do next?
Specifics
No response
Steps to reproduce
Create a site with 25k content and 25k media, use that database in a fresh folder of Umbraco.
Expected result / actual result
Saving should be possible even when background processing is running.