The process cannot access the file '\NuCache\NuCache.Content.db' because it is being used by another process. When using Kubernetes Pods

gabador-mof commented 2 years ago

Which exact Umbraco version are you using? For example: 9.0.1 - don't just write v9

8.18.3

Bug summary

Umbraco version: 8.18.3 Multilingual site containing about 10000 pages. Nucache DB size: 1.4GB

I'm in a bit of a pickle. We have a site that runs on Kubernetes, in a docker container. This is something we've done before and it has worked. The setup is as follows:

1 CMS Pod, 3 Front-end Umbraco Pods 3 Nuxt applications communicating with the Umbraco Front-end pods.

We have with lower data volumes been all right, but since loading in 10000 pages, containing a bunch of baked HTML content, we cannot start up our sites. Each pod takes well over 10 minutes to start, and then we run into a

The process cannot access the file 'C:\inetpub\wwwroot\site\App_Data\TEMP\NuCache\NuCache.Content.db' because it is being used by another process.

Specifics

The issue occurs locally too. My local configuration is the same as the deployed one, except for these two app settings, but since this is Kube and docker, I don't thing they're toooo relevant. Could be wrong.

We have another site running in Kubernetes, and that works fine.

The only difference between those sites is that this one contains more logic, and that we have an Entity framework instance in this one. It is the published content cache that takes a long time to build though. Here is a dot trace output when the app tries to start up in a smaller data set instance:

<add xdt:Transform="Insert" key="Umbraco.Examine.LuceneDirectoryFactory" value="Examine.LuceneEngine.Directories.TempEnvDirectoryFactory, Examine" />
    <add xdt:Transform="Insert" key="Umbraco.Core.LocalTempStorage" value="EnvironmentTemp" />

We don't use examine on the front-end servers, so WE've also implemented this:


public class DisabledExamineComposer : IUserComposer
    {
        public void Compose(Composition composition)
        {
            if (MofConfigurationManager.Current.IsFrontEndServer)
            {
                // replace the default
                composition.RegisterUnique<IUmbracoIndexesCreator, InMemoryExamineIndexFactory>();

                // replace the default populators
                composition.Register<MemberIndexPopulator, DisabledMemberIndexPopulator>(Lifetime.Singleton);
                composition.Register<ContentIndexPopulator, DisabledContentIndexPopulator>(Lifetime.Singleton);
                composition.Register<PublishedContentIndexPopulator, DisabledPublishedContentIndexPopulator>(
                    Lifetime.Singleton);
                composition.Register<MediaIndexPopulator, DisabledMediaIndexPopulator>(Lifetime.Singleton);
            }
        }
    }

SO I'm at a bit of a loss. It doesn't feel like 10000 pages, plus about 5000 more images should do this to the cache. What are we doing wrong, and what can we do to fix the startup time and reduce the lock times?

One other note to mention: On the CMS, when saving pages, we got all these errors while the other pods were starting up:

We have a number of composers in the system that are reading from the cache, and on save, some also validate and update the database in custom tables, though I doubt that entity framework has too much to do with the cache loading, or lock period timeout exceeded errors... To test, I have disabled these to ensure that they don't interfere, but we get the same issue...

We really need to hand this site over soon, and while everything is looking great, we cannot have more than one instance of the application running, and it doesn't look great that we get all these lock timeouts...

Steps to reproduce

See above. You would need to create a container from the site, configure:

An azure Kubernetes Cluster with no persistent storage, D2V3 Node instance with 14GB memory. Deploy one CMS instance, 3 front-end Instances. DB is Azure SQL Server currently on a premium plan & 250 DTU to rule that out Load in 10000 pages containing block List editors and raw HTML We are using Plugins: Nexu, SkyBrud Url Redirect Manager Umbraco Forms, NewsletterStudio

Azure Blob storage for images

Expected result / actual result

Cache rebuilds quickly and app starts up quickly.

p-m-j commented 2 years ago

I'd love to get to the bottom of this, can I start with a few questions?

1) What do you have configured for the appsettings key Umbraco.Core.MainDom.Lock? 2) Do the umbraco containers each have their own instance of NuCache.Content.db or is there any volume magic that could result in multiple processes attempting to write to the file concurrently? 3) Could you add the following element to Config/serilog.config under the appSettings element to see if the more verbose logging adds any aditional clues <add key="serilog:minimum-level:override:Umbraco.Core.Runtime" value="Debug" />?

gabador-mof commented 2 years ago

Hey!

1 SqlMainDomLock 2 each have their own instance of the content DB 3 - Done. Adding now.

Just to reiterate, it takes the same amount of time locally, trying to connect to an azure DB... It just runs and runs and runs...

p-m-j commented 2 years ago

What log messages do you have from the category Umbraco.Core.Sync.DatabaseServerMessenger when booting the site?

In the long term a 10 minute boot for a website seems unnaceptable but to resolve this requires a large refactor to the nucache setup, work is planned here in the backlog but not for v8.

In the short term a 10 minute boot for a reasonably large site sounds comparable to other sites of similar scale so i'm far more concerned about the exceptions writing to the nucache file than the slow boot.

gabador-mof commented 2 years ago

Yeah, I don't mind a 10 minute boot time with a large site, but I can't live with it failing to start. Currently it just falls over starting every time. Incidentally, To see if it had anything to do with the dabasae nucache, I deleted all its contents. What would be a way of rebuilding this, if I cannot get into the CMS?

I'm just fishing for some error messages but there are thousands in the logs, so not sure what's relevant.

gabador-mof commented 2 years ago

This is the only message from the DatabaseServerMessenger

No last synced Id found, this generally means this is a new server/install. The server will build its caches and indexes, and then adjust its last synced Id to the latest found in the database and maintain cache updates based on that Id.

p-m-j commented 2 years ago

As in you truncated the table dbo.umbracoCacheInstruction? That should result in a cold boot as the last synced id recorded on disk doesn't agree with the database.

So just turning the app off and old should result in that table being populated and then the NuCache.Content.db being populated.

The more interesting log messages should probably come from Umbraco.Core.Sync.DatabaseServerMessenger and Umbraco.Core.Runtime

p-m-j commented 2 years ago

No last synced Id found

When the app starts, a file is read {{LocalTempPath}}/DistCache/{{AppDomainId}}-lastsynced.txt this should have the id of the last instruction from dbo.umbracoCacheInstruction that was processed to be compared against the contents of dbo.umbracoCacheInstruction.

In your case that *-lastsynced.txt was not found.

gabador-mof commented 2 years ago

No, I Truncated the [dbo].[cmsContentNu] Table 😬 It rebuilt it on another environment, but on here, I get a heap of messages :

Missing cmsContentNu published content for node 10065, consider rebuilding. and Skip item id=10065, could not find parent id=9487.

And then after 15 minutes, it dies with The process cannot access the file 'C:\inetpub\wwwroot\stylus-be\App_Data\TEMP\NuCache\NuCache.Content.db' because it is being used by another process.

p-m-j commented 2 years ago

Around the time it dies what do you have logged from Umbraco.Core.Runtime

p-m-j commented 2 years ago

Additionally adding <add key="serilog:minimum-level:override:Umbraco.Web.PublishedCache.NuCache" value="Debug" /> along with the one mentioned before <add key="serilog:minimum-level:override:Umbraco.Core.Runtime" value="Debug" /> to Config/serilog.config should give a fuller picture.

gabador-mof commented 2 years ago

This is the error just before the lock that comes up locally: No data for media 3976

[InvalidOperationException: No data for media 3976] Umbraco.Web.PublishedCache.NuCache.DataSource.DatabaseDataSource.CreateMediaNodeKit(ContentSourceDto dto, IContentCacheDataSerializer serializer) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\DataSource\DatabaseDataSource.cs:459 Umbraco.Web.PublishedCache.NuCache.DataSource.<GetAllMediaSources>d__17.MoveNext() in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\DataSource\DatabaseDataSource.cs:327 Umbraco.Web.PublishedCache.NuCache.ContentStore.SetAllFastSortedLocked(IEnumerable1 kits, Boolean fromDb) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\ContentStore.cs:704 Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService.LoadMediaFromDatabaseLocked(IScope scope, Boolean onStartup) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\PublishedSnapshotService.cs:478 Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService.LockAndLoadMedia(Func2 action) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\PublishedSnapshotService.cs:454 Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService.<EnsureCaches>b__36_0() in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\PublishedSnapshotService.cs:277 System.Threading.LazyInitializer.EnsureInitializedCore(T& target, Boolean& initialized, Object& syncLock, Func1 valueFactory) +139 Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService.EnsureCaches() in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\PublishedSnapshotService.cs:210 Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService.CreatePublishedSnapshot(String previewToken) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\PublishedSnapshotService.cs:1109 System.Lazy1.CreateValue() +734

p-m-j commented 2 years ago

That doesn't sound like fun.

might be worth setting the following in appsettings <add key="Umbraco.Web.PublishedCache.NuCache.Serializer" value="JSON" /> (removing any existing entries that set the value to MsgPack).

truncating the cmsContentNu table

restarting the app and looking at the entries in cmsContentNu where nodeId = 3976

p-m-j commented 2 years ago

This is the query that's ran to load all the media content from cmsContentNu.

According to the exception you're getting the result set has null for both EditData and EditDataRaw which is unexpected.

DECLARE @nodeId INT = 3976

SELECT [umbraconode].[id]                    AS [Id],
       [umbraconode].[uniqueid]              AS [Key],
       [umbraconode].[level]                 AS [Level],
       [umbraconode].[path]                  AS [Path],
       [umbraconode].[sortorder]             AS [SortOrder],
       [umbraconode].[parentid]              AS [ParentId],
       [umbraconode].[createdate]            AS [CreateDate],
       [umbraconode].[nodeuser]              AS [CreatorId],
       [umbracocontent].[contenttypeid]      AS [ContentTypeId],
       [umbracocontentversion].[id]          AS [VersionId],
       [umbracocontentversion].[text]        AS [EditName],
       [umbracocontentversion].[versiondate] AS [EditVersionDate],
       [umbracocontentversion].[userid]      AS [EditWriterId],
       [nuEdit].[data]                       AS [EditData],
       [nuEdit].[dataraw]                    AS [EditDataRaw]
FROM   [umbraconode]
       INNER JOIN [umbracocontent]
               ON ( [umbraconode].[id] = [umbracocontent].[nodeid] )
       INNER JOIN [umbracocontentversion]
               ON ( ( [umbraconode].[id] = [umbracocontentversion].[nodeid] )
                    AND [umbracocontentversion].[current] = 1 )
       LEFT JOIN [cmscontentnu] [nuEdit]
              ON ( ( [umbraconode].[id] = [nuEdit].[nodeid] )
                   AND [nuEdit].[published] = 0 )
WHERE  (( ( [umbraconode].[nodeobjecttype] =
            'b796f64c-1f99-4ffb-b886-4bf4bc011a9c' )
          AND ( [umbraconode].[trashed] = 0 ) ))
       AND [umbraconode].[id] = @nodeId
ORDER  BY [umbraconode].[level],
          [umbraconode].[parentid],
          [umbraconode].[sortorder]

p-m-j commented 2 years ago

If cmContentNu is empty toggling back and forth between MsgPack and JSON for appsettings key Umbraco.Web.PublishedCache.NuCache.Serializer forces the table to be repopulated (there is probably a smarter way to make this happen, but this works)

gabador-mof commented 2 years ago

Ok, will try that. If it doesn't work, I might copy back a db that works, from a different env...

gabador-mof commented 2 years ago

and we can then figure out the lock issue with a db that should work and isn't corrupted

gabador-mof commented 2 years ago

So the cmscontentnu table is empty now, and I've changed the serializer type to JSON, and now it's running through the db as follows: Eventually though it will run into an error and then it will fall over and lock the cache.

gabador-mof commented 2 years ago

I'm now trying with a new DB. One that has a correct cmsContentnu.

Is it unusual to have 1.5GB cache? I'm happy to assign the resources to the nodes to run these instances, but it's going to be impossible if this data keeps growing, and we want to pull up new instances of the application. This is a new system, so v9 is not really an option yet...

Still loading after 30 minutes:

p-m-j commented 2 years ago

Is it unusual to have 1.5GB cache

NuCache.Content.db has a copy of the current draft and current published version for every content node in your site for all cultures, how does it compare to the result of executing this on your database sp_spaceused [umbracoPropertyData]

p-m-j commented 2 years ago

it's possible to turn off the NuCache.Content.db file entirely and load straight from your sql server database each boot.

This may result in slower warm boots, but might be worth experimenting to see how it impacts your cold boots when not writing to disk.

This is documented on https://our.umbraco.com/Documentation/Fundamentals/Setup/Server-Setup/Load-Balancing/azure-web-apps-v8#appdomain-synchronization (search for PublishedSnapshotServiceOptions)

gabador-mof commented 2 years ago

yeah, that's almost the same data. I meant is it usual to have so much data in an Umbraco database? In your experience, do other setups compare in startup times?

p-m-j commented 2 years ago

There's definitely some very large umbraco sites out there in the 50-100k documents range.

My personal experience is all with versions < 8.0 so it's hard for me to say how long it should take for a multilingual 10k node v8 site should take to boot.

This is the sort of thing it would be great if we had telemetry on.

gabador-mof commented 2 years ago

I've added the ignorelocaldb, and interestingly, the admin pod started. But the front-end pods do not. There is very strange behaviour going on. I restarted the deployment, the new pods came up. Then, it all started. I edited some pages, it worked fine. I restarted the deployment again, and then the new pods came up again. Fast. Like under a minute.

but the cache one the new pods is broken, and contains the content from before saving the pages.

I hit rebuild memory cache, and now I'm getting DB locks in the CMS when saving pages. I imagine it will take a while. But what's going on with the front-end pods? This is so bizarre...

gabador-mof commented 2 years ago

So on my local instance, the cache is 171 MB... The Cache rebuild and cold startup of my site is till going after 10 minutes. I think something is wrong... It's crazy to think that 5000 pages would take so long to load into memory.

I have block list editors in the site, would that have an effect, on the startup?

p-m-j commented 2 years ago

I have block list editors in the site, would that have an effect, on the startup?

I don't believe so.

This topic has gotten a little off topic now and may need breaking down into multiple issues with steps to reproduce for each.

With the nucache disk cache disabled we are not going to encounter NuCache.Content.db being used by multiple processes / app domains which is a shame as I would like very much to know how to reproduce this.

gabador-mof commented 2 years ago

I am turning it back on locally. I'm also trying to connect to the DB using a different umbraco application that works correctly, and loads the cache correctly in under 5 minutes. Still a bit much, but acceptable. Lock timeouts happen in the CMS when I try to edit content, but that's to be expected, as you want to ensure that the cache rebuilds before you start making changes to the db etc... So this is another website, running the same version of Umbraco, on the same database. One takes 3-5 minutes on the 1.5 GB, the other 20

gabador-mof commented 2 years ago

I have a better example, more isolated scenario.

I installed a new version of umbraco, with a bare-bones setup. I connected to the DB.

Connection timeout 120 in connection string.

<httpRuntime requestValidationMode="2.0" enableVersionHeader="false" targetFramework="4.7.2" maxRequestLength="157286400" executionTimeout="10800" fcnMode="Single" />

With this, the CPI maxes out on the DB, each time I try to start the app. This is connecting to the larger DB...

I then get a boot failed, SQL connection timeout.

System.Data.SqlClient.SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.\r\nThe statement has been terminated.\n

p-m-j commented 2 years ago

Without steps to reproduce, logs, or a bakpak I'm not sure how I can assist here.

gabador-mof commented 2 years ago

Incidentally, we have a shared local DB, many of us use the same DB. WHen I connect to a standalone DB, both large or small, it works fine. I'm not doing any write operations, just waiting for Umbraco to start up. When I connect to a shared DB, as in multiple instances connecting to the DB, it fails to get a SQL connection. I'm just testing this theory.

gabador-mof commented 2 years ago

I have this as amaster-replica elector: Is this sufficient? All front-end instances should be doing is reading from the database, with this?

public class MasterReplicaElector : IUserComposer
    {
        private readonly ILogger _logger;

        public MasterReplicaElector()
        {

        }

        public void Compose(Composition composition)
        {
            if (MofConfigurationManager.Current.IsFrontEndServer)
            {
                composition.SetServerRegistrar(new FrontEndReadOnlyServerRegistrar(_logger));
            }
            else
            {
                composition.SetServerRegistrar(new MasterServerRegistrar(_logger));
            }
        }
    }

    public class MasterServerRegistrar : IServerRegistrar
    {
        private readonly ILogger _logger;

        public MasterServerRegistrar(ILogger logger)
        {
            this._logger = logger;
        }
        public IEnumerable<IServerAddress> Registrations
        {
            get { return Enumerable.Empty<IServerAddress>(); }
        }
        public ServerRole GetCurrentServerRole()
        {
            //StylusLogger.CurrentLogger.LogWarning<MasterServerRegistrar>("Getting Scheduling Registrar: {0}", "Master");
            return ServerRole.Master;

        }

        public string GetCurrentServerUmbracoApplicationUrl()
        {
            // NOTE: If you want to explicitly define the URL that your application is running on,
            // this will be used for the server to communicate with itself, you can return the
            // custom path here and it needs to be in this format:
            // http://www.mysite.com/umbraco

            return null;
        }
    }

    public class FrontEndReadOnlyServerRegistrar : IServerRegistrar
    {
        private readonly ILogger _logger;

        public FrontEndReadOnlyServerRegistrar(ILogger logger)
        {
            this._logger = logger;
        }

        public IEnumerable<IServerAddress> Registrations
        {
            get { return Enumerable.Empty<IServerAddress>(); }
        }
        public ServerRole GetCurrentServerRole()
        {
            return ServerRole.Replica;
        }
        public string GetCurrentServerUmbracoApplicationUrl()
        {
            return null;
        }
    }

p-m-j commented 2 years ago

All front-end instances should be doing is reading from the database

frontend instances will still need to write to the database, e.g. ~adding a read lock for distributed locking, or~ registering themselves as a MainDom instance in [dbo].[UmbracoKeyValue] when using SqlMainDomLock.

IServerRegistrar is more for which instance is responsible for running scheduled publishing tasks etc.

gabador-mof commented 2 years ago

Ok, so I see for example an admin pod that I've set up in the UmbracoServer marked as isMaster - false and the front-end pod as ismaster = true.

What defines how Umbraco decides what's a master or not?

Edit- I don't see any of my current pods in the UmbracoServer table, and the cacheinstructions table is full. The two computer names are really old. New pods would have to register themselves as hey are pulled up, and Umbraco/Windows has a bug that the pod names are over 15 characters long, due to the deployment/replicaset/pod naming convention. would this have something to do with it?

p-m-j commented 2 years ago

What defines how Umbraco decides what's a master or not?

Master server is just a label on a diagram, it's the instance where you allow login to backoffice for editing.

WRT scheduled publishing it's described here but you're explicitly setting the scheduling publisher role with explicit-master-scheduling-server with your class MasterReplicaElector

Edit: The names make more sense in v9 where it is Publisher/Subscriber instead of Master/Replica

gabador-mof commented 2 years ago

Thanks you for your help so far!

I feel like I'm chasing my tail and there is something fundamentally wrong with something. I don't see any entries in the UmbracoServer table for any of the servers. There were a bunch of old server names, but nothing new, so when I deleted those, none of the cache instructions are going through to the replicas

Adding SQLDomLock to the web config basically maxes out the DB CPU while it's trying to load the cache, and it times out unless I set the Connection timeout to some huge number >240...

So I'm wondering - The fact that we're getting DB locks locally is understandable. It's effectively multiple admin instances.

But on the kubernetes pods, there are two issues - one is that none of the apps write their names into the UmbracoServer table, and the other is that none have an UmbracoApplicationUrl in the UmbracoSettings.Config.

If in the master/replica elector I try and get the pods IP address, and set the SQLMainDomLock, would this ensure that we get the correct list of servers in the server table? Assuming this works, then we won't get the issue of all the pods locking the db and none coming up...

p-m-j commented 2 years ago

I think that when you opt into explict server roles (by registering an implementation of IServerRegistrar) then the DatabaseServerRegistrarAndMessengerComponent doesn't create the TouchServer task which updates the umbracoServer table, see here.

It's not required for Umbraco to work out server roles if you set them explicitly.

p-m-j commented 2 years ago

Can you confirm that with your implementation of IServerRegistrar that only one of the instances returns ServerRole.Master for GetCurrentServerRole

gabador-mof commented 2 years ago

Yes, There's only one instance. During deploy, we only set the admin instance as master in the app configs, and compose based on that...

gabador-mof commented 2 years ago

Ok. So why does CPU Max out when any of the pods try to read the cache? I'm currently getting a CPU maxing out, and

`Boot failed: Umbraco cannot run. See Umbraco's log file for more details.

-> Umbraco.Core.Exceptions.BootFailedException: Boot failed.

-> System.Data.SqlClient.SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. The statement has been terminated. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted) at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)

I'm completely at a loss, and There's nothing relevant in the logs. It transpires that the cache rebuild isn't what takes the most amount of time, it's actually connecting to the database, as sometimes it takes very long to actually connect and start doing the work. In my clean install that's what I notices on the local db, that when other people were connected to the DB, it maxed out CPU. So it's the same even with master replica election on the pods, max CPU and boot failed

p-m-j commented 2 years ago

Are the instances fighting for MainDom status?

If you run the following query you should see one entry per instance of the application e.g. in production with 1 admin instance and 3 frontend instances there should be 4 rows for Umbraco.Core.Runtime.SqlMainDom-{{maindomkey}}

select * from umbracoKeyValue where [key] like 'Umbraco.Core.Runtime%'

gabador-mof commented 2 years ago

there are 252 rows in that table... Should it get cleared out? I image all the pods would have written an id in there at some point, but they don't get cleared out?

gabador-mof commented 2 years ago

Why would it work for another client running on kubernetes? Granted, their cache size is much smaller, but ther eare never any issues there? They are on a lower version number for umbraco

p-m-j commented 2 years ago

there are 252 rows in that table... Should it get cleared out?

I don't think it matters that there are dead records floating around and hints that the instances likely aren't fighting each other for MainDom status which is good.

Why would it work for another client running on kubernetes

I have no idea, I can't run these apps or view their logs.

gabador-mof commented 2 years ago

HI Paul,

Can i send you a copy of the code? I can also allow access through the firewall if you wouldn't mind taking a look?

We have a few scheduled tasks running in the background, not sure if they would have anything to do with the slow startup, but they're on a delay, so I don't see how.

Thanks for all the help so far!

Gabor

gabador-mof commented 2 years ago

I've just ran a SQL query to see what statement is running when we get into this tangle. Before it times out, the following sql is running against the DB. This times out. (@0 uniqueidentifier)DELETE FROM cmsContentNu WHERE cmsContentNu.nodeId IN ( SELECT id FROM umbracoNode WHERE umbracoNode.nodeObjectType=@0 )

It takes 8 minutes to run, and after that, the rebuild of the nucache table another 10.

It's interesting to see that it tries to rebuild an already built DB cache table. Why not just use it, if it's already built?

There is nothing in the UmbracoCacheInstruction table. Do you know if the UmbracoCacheInstruction table ever gets cleared out? What would cause a full DB cache rebuild?

Does the above help bring us any closer to the root of the issue?

If it helps, here are some log entries of the process:

Gabo Logs.txt r

p-m-j commented 2 years ago

It's interesting to see that it tries to rebuild an already built DB cache table. Why not just use it, if it's already built?

The main reason for that to occur is when the configured value for the appsettings key Umbraco.Web.PublishedCache.NuCache.Serializer doesn't match the value storedfor that same key in the umbracoKeyValue table in the database which seems likely as we have already mentioned toggling the value in this issue.

If you're talking about local dev where you and other folks are sharing a database this could be compounded by having differing config as it will keep having to be rebuilt depending on who ran the app last.

gabador-mof commented 2 years ago

Oh I see, ok, well we must stop chaning the serializer, and will just leave it with the standard msgpk serializer.

gabador-mof commented 2 years ago

SO what can we do to ensure that a sql timeout doesn't cause the published cache staying in a locked state? I feel like this is a fundamental problem with Umbraco...

I'm just trying to understand how we can fix this. The system either starts after a looong 10 minute wait, or the sql query times out, and then the cache file stays in a locked state. Not using the cache file is not an option, as everything is just super slow.

What can we do to try to figure out what's going on? I can send you a clean install of Umbraco that I used for testing, and grant you access to the QA Db. Connecting to this would be like connecting as a replica pod, so you could see what's going on...

Would that be something you could help with?

p-m-j commented 2 years ago

SO what can we do to ensure that a sql timeout doesn't cause the published cache staying in a locked state?

Just so we're on the same page here.

The nucache.content.db file is essentially just a dictionary that's written to the filesystem of the app server.
It's a key value store where the key is the integer id of some content and the value is a serialized representation of the content with that id.
This dictionary is thread safe but not multi process safe.
When an umbraco application starts it opens (or creates) this file and keeps it open until the app is shutting down.
If another process (or app domain for umbraco 8 running on iis with overlapping app domains enabled) attempts to open that file an exception is thrown with the message from the issue title.

Umbraco has an abstraction called MainDom, it's purpose is to prevent multiple processes (or app domains) attempting to read/write that dictionary file at the same time (it's also used elsewhere for lucene indexes etc).
Only one instance of the application should be able to obtain maindom status on the host at any time, and maindom status is required to initialize the published snapshot service.

The system either starts after a looong 10 minute wait

That sounds good other than the fact that it took a looong time, we know that it's slow to cold boot for large sites and fixing it is in our long term goals.

or the sql query times out

Which query times out?

and then the cache file stays in a locked state.

It's supposed to be locked whilst the app is running.

install of Umbraco that I used for testing, and grant you access to the QA Db. Connecting to this would be like connecting as a replica pod, so you could see what's going on...

I'd prefer a bacpac if possible but before we go down that road it'd be good for you to work out which query is timing out.

When you run locally are you also using SqlMainDomLock? what happens if you instead use the default by removing the appsetting with the key Umbraco.Core.MainDom.Lock

gabador-mof commented 2 years ago

HI Paul,

Ok, I understand.

So With Kubernetes, everything runs in its own container, so effectively a little silo. Nothing else should be accessing the nucache file. One the master instance, the nucache file takes a long time to build, but more often than not, it manages it after 10-15 minutes. I had to set the SQL Connection timeout to 600 currently, otherwise I was getting a SQL timeout error, followed by nucache file locked.

The issue is that the DB CPU maxes out trying to start up, especially if it's a cold start. This causes a timeout of ANY sql statement, though I will run a test now, to see what runs, and for how long.

I've just deployed to the QA environment:

Pod gets pulled up, other instance disposed. SQL starts running: (@0 uniqueidentifier)DELETE FROM cmsContentNu WHERE cmsContentNu.nodeId IN ( SELECT id FROM umbracoNode WHERE umbracoNode.nodeObjectType=@0 )

DB CPU jumps to 100%

Bunch of cache rebuild sql tasks follow, and SQL DTU still 100%. Back-end pod not yet starting, as deploy configured to wait after admin pod, for 10 minutes... 10 minutes have passed - back-end pod begins to run. after 1 minutes:

back-end pod begins deleting the content of the nu table, while the other is starting to build it... CPU still at 100%...

BE pod throws a boot failed exception.

20 minutes pass, still running:

Nothing in the logs about any errors.

Basically, it's impossible to craft a deploy strategy, or even to start the application up when this happens. Previously it wasn't even trying to run the cache rebuild, the same thing happened. Anyway. That's the long story - Can I send you a bakpak somehow? I need to understand if there's a bug we're looking at, if I need to change something in what we're doing, throw more processing power at it, or what to do, as we are two weeks away from going live, and I have no trust in Umbraco being able to handle the current workload, let alone future growth...

gabador-mof commented 2 years ago

It seems to me like the more DTU I throw at the DB, the more likely it is to succeed to pull up one server. But say I want two or three??

umbraco / Umbraco-CMS