umbraco / Umbraco-CMS

Umbraco is a free and open source .NET content management system helping you deliver delightful digital experiences.
https://umbraco.com
MIT License
4.49k stars 2.69k forks source link

The process cannot access the file '\NuCache\NuCache.Content.db' because it is being used by another process. When using Kubernetes Pods #12469

Closed gabador-mof closed 1 year ago

gabador-mof commented 2 years ago

Which exact Umbraco version are you using? For example: 9.0.1 - don't just write v9

8.18.3

Bug summary

Umbraco version: 8.18.3 Multilingual site containing about 10000 pages. Nucache DB size: 1.4GB

I'm in a bit of a pickle. We have a site that runs on Kubernetes, in a docker container. This is something we've done before and it has worked. The setup is as follows:

1 CMS Pod, 3 Front-end Umbraco Pods 3 Nuxt applications communicating with the Umbraco Front-end pods.

We have with lower data volumes been all right, but since loading in 10000 pages, containing a bunch of baked HTML content, we cannot start up our sites. Each pod takes well over 10 minutes to start, and then we run into a

The process cannot access the file 'C:\inetpub\wwwroot\site\App_Data\TEMP\NuCache\NuCache.Content.db' because it is being used by another process.

Specifics

The issue occurs locally too. My local configuration is the same as the deployed one, except for these two app settings, but since this is Kube and docker, I don't thing they're toooo relevant. Could be wrong.

We have another site running in Kubernetes, and that works fine.

The only difference between those sites is that this one contains more logic, and that we have an Entity framework instance in this one. It is the published content cache that takes a long time to build though. Here is a dot trace output when the app tries to start up in a smaller data set instance:

image

<add xdt:Transform="Insert" key="Umbraco.Examine.LuceneDirectoryFactory" value="Examine.LuceneEngine.Directories.TempEnvDirectoryFactory, Examine" />
    <add xdt:Transform="Insert" key="Umbraco.Core.LocalTempStorage" value="EnvironmentTemp" />

We don't use examine on the front-end servers, so WE've also implemented this:


public class DisabledExamineComposer : IUserComposer
    {
        public void Compose(Composition composition)
        {
            if (MofConfigurationManager.Current.IsFrontEndServer)
            {
                // replace the default
                composition.RegisterUnique<IUmbracoIndexesCreator, InMemoryExamineIndexFactory>();

                // replace the default populators
                composition.Register<MemberIndexPopulator, DisabledMemberIndexPopulator>(Lifetime.Singleton);
                composition.Register<ContentIndexPopulator, DisabledContentIndexPopulator>(Lifetime.Singleton);
                composition.Register<PublishedContentIndexPopulator, DisabledPublishedContentIndexPopulator>(
                    Lifetime.Singleton);
                composition.Register<MediaIndexPopulator, DisabledMediaIndexPopulator>(Lifetime.Singleton);
            }
        }
    }

SO I'm at a bit of a loss. It doesn't feel like 10000 pages, plus about 5000 more images should do this to the cache. What are we doing wrong, and what can we do to fix the startup time and reduce the lock times?

One other note to mention: On the CMS, when saving pages, we got all these errors while the other pods were starting up:

image

We have a number of composers in the system that are reading from the cache, and on save, some also validate and update the database in custom tables, though I doubt that entity framework has too much to do with the cache loading, or lock period timeout exceeded errors... To test, I have disabled these to ensure that they don't interfere, but we get the same issue...

We really need to hand this site over soon, and while everything is looking great, we cannot have more than one instance of the application running, and it doesn't look great that we get all these lock timeouts...

Steps to reproduce

See above. You would need to create a container from the site, configure:

An azure Kubernetes Cluster with no persistent storage, D2V3 Node instance with 14GB memory. Deploy one CMS instance, 3 front-end Instances. DB is Azure SQL Server currently on a premium plan & 250 DTU to rule that out Load in 10000 pages containing block List editors and raw HTML We are using Plugins: Nexu, SkyBrud Url Redirect Manager Umbraco Forms, NewsletterStudio

Azure Blob storage for images

Expected result / actual result

Cache rebuilds quickly and app starts up quickly.

p-m-j commented 2 years ago

Can I send you a bakpak somehow

Sure, if you can drop me a mail to pmj at umbraco dot com with a link (wetransfer or similar, some blob container)

gabador-mof commented 2 years ago

Thank you! Try it on an s2 instance to see the results best.

p-m-j commented 2 years ago
  1. Downloaded bacpac and restored locally
  2. Checked value of Umbraco.Web.PublishedCache.NuCache.Serializer in database it was JSON so I configured to match in web config.
  3. With no App_Data/TEMP directory nucache.content.db is populated in 12 seconds, entire boot process is less than a minute.
  4. Stopped app and changed Umbraco.Web.PublishedCache.NuCache.Serializer to MsgPack and I saw a boot failed with a timeout for RebuildContentDbCacheLocked like you described.
  5. Tried again and it regenerated all the content in cmsContentNu successfully in messagepack format.
  6. Retried cold boot - nucache.content.db is populated in 5 seconds, entire boot process is less than a minute.

If you like I can send back a bacpac with the cmsContentNu in messagepack format so would match what you have in web.config?

gabador-mof commented 2 years ago

Holy cow that's good news. Can you tell me how to regenerate it? There is a uat dB that I'll need to apply the fix to too. Thank you so much!

On Fri, 27 May 2022, 5:24 pm Paul Johnson, @.***> wrote:

  1. Downloaded bacpac and restored locally
  2. Checked value of Umbraco.Web.PublishedCache.NuCache.Serializer in database it was JSON so I configured to match in web config.
  3. With no App_Data/TEMP directory nucache.content.db is populated in 12 seconds, entire boot process is less than a minute.
  4. Stopped app and changed Umbraco.Web.PublishedCache.NuCache.Serializer to MsgPack and I saw a boot failed with a timeout for RebuildContentDbCacheLocked like you described.
  5. Tried again and it regenerated all the content in cmsContentNu successfully in messagepack format.
  6. Retried cold boot - nucache.content.db is populated in 5 seconds, entire boot process is less than a minute.

If you like I can send back a bacpac with the cmsContentNu in messagepack format so would match what you have in web.config?

— Reply to this email directly, view it on GitHub https://github.com/umbraco/Umbraco-CMS/issues/12469#issuecomment-1139669284, or unsubscribe https://github.com/notifications/unsubscribe-auth/APLMAE6D3AAI3B7IKFR2KKTVMDLKTANCNFSM5WZGII6A . You are receiving this because you authored the thread.Message ID: @.***>

p-m-j commented 2 years ago

You said to try with S2 instance that was for database server?

I restored the backup onto a Basic elastic pool 50 DTU - cold boot locally using the azure sql server took ~2 minutes, populating nucache.content.db from database took 1 minute 15 seconds.

gabador-mof commented 2 years ago

Ok, hang on, I've just read this in more detail. I'm sure that we've tried starting up our qa environment once or twice, and once the app loaded, it was fine. However, startup time is nothing like what you described. Ever. Can you tell me what's in your app settings? Is it a fresh, clean install of the database? Did you have any replicas connecting at the same time?

gabador-mof commented 2 years ago

Yeah, my application still doesn't start up with anything like that speed... Can you send me back the bakpak and the Umbraco instance you have used to perform these tests? I'd like to compare, as my apps currently still take over 5-10 minutes to start up... Also, in my replica instances I still get the following error 4 times out of 5 when starting up the app: 'C:\inetpub\wwwroot\stylus-be\App_Data\TEMP\NuCache\NuCache.Content.db' because it is being used by another process.

I'm sure with an improved startup time this would go away, but I need to try and get to that improved speed, especially since you managed to get such good startup times...

gabador-mof commented 2 years ago

further to the above, when I don't have the following line in the web config:

the application starts, but super sluggish. If I do add it, it doesn't start. I can see the dataRaw populated in the DB, but the application doesn't start - it says request timed out and then published content cache locked...

p-m-j commented 2 years ago

Is it a fresh, clean install of the database?

I tried with a database restored from the bacpac you provided both on a local sql server instance and remotely on a SQL azure basic elastic pool.

Did you have any replicas connecting at the same time?

No but i'd be willing to give that a go next week

Can you send me back the bakpak and the Umbraco instance you have used to perform these tests

I sent a bacpac of the database with cmsContentNu in messagepack format back to the address that mailed me one via wetransfer g.f**@m***.com on friday afternoon.

Application wise I used the web.ui project from this here repo so the web.config looks exactly like this but with a connectionstring set

my apps currently still take over 5-10 minutes to start up

We are bound to have some differences in boot up time, you have windows containers running on azure and i'm running on the metal, it may be that you have an easier time if you vertically scale the resources available for your containers?

Have you resolved the discrepancy between cmsContentNu serialization format config vs database for your qa environment?, that's going to cause a lot of headaches, might be worth sticking to a single instance of the app until the table is rebuilt (or just restore the bacpac I sent where it is completed already for your qa instance).

I still get the following error 4 times out of 5 when starting up the app: 'C:\inetpub\wwwroot\stylus-be\App_Data\TEMP\NuCache\NuCache.Content.db' because it is being used by another process.

Do you have any logs from when this is occuring that you could share?

There is one interesting PR that has made it into v9 and not been backported to v8 that might help here https://github.com/umbraco/Umbraco-CMS/pull/12149

gabador-mof commented 2 years ago

Hi Paul,

Sorry - I meant a clean install of Umbraco, not DB. I guess that's the case though.

Regarding the startup time, I have no limits on CPU, and am using a Standard_E4_v3 instance - 32GB memory, with 2500Mi allocated to each pod. That should be enough, based on current usage. I will increase this to 3000 temporarily and redeploy in a minute, and apply some CPU limits too, to ensure that all is set. Same with requests.

However. I am getting this long start time even on my local machine. I have 32GB of RAM, with no limits, and an i7 processor. I really shouldn't have any issues here...

I have set my applications to msgPack now, and deployed the applications. They took a long time, but then started up. I can see this ins the contentnu table:

image

This used to be json before. Does that mean it's resolved now?

When I deploy a pod, that will never have a nucache file, so it will always be a cold boot right?

Thanks for responding on a sunday man! Really appreciate it!

p-m-j commented 2 years ago

This used to be json before. Does that mean it's resolved now?

You sent a log before that had a line Database NuCache was serialized using JSON. Currently configured NuCache serializer MsgPack. Rebuilding Nucache which reported as completed 17 minutes later, that should no longer be occuring each boot.

When I deploy a pod, that will never have a nucache file, so it will always be a cold boot right?

Technically the cold boot state relates to a discrepancy between that lastsynced.txt file and what's in the umbracoCacheInstruction table and that lastsynced won't exist in temp so yes you will cold boot every time.

nucache.content.db has to be populated from cmsContentNu if either SyncBootState is coldboot or the nucache.content.db file doesn't exist yet unless the config value PublishedSnapshotServiceOptions.IgnoreLocalDb is truthy in which case cmsContentNu contents is loaded directly into memory and nucache.content.db is never created.

p-m-j commented 2 years ago

Have you tried comparing a cold boot (delete the App_Data/TEMP directory) without your application code vs with your application code to compare the duration?

gabador-mof commented 2 years ago

I'll give that a go now again... This is the current log when trying to start up. It's then followed by all the set logs

image

p-m-j commented 2 years ago

Each time you do a deployment you will have a new App_Data/Temp and therefore no nucache.content.db on disk, so I think the IO of populating that file seems redundant, it exists only to speed up future boots by avoiding trips to the database.

I gave it a whirl with PublishedSnapshotServiceOptions.IgnoreLocalDb set to true, and a forced cold boot by deleting App_Data/Temp using the database from before on a local sql server instance.

The result was a cold boot that took in total 9 seconds total of which 6 seconds was loading cmsContentNu. Comparing this to the previous attempt cold booting against a local sql server instance ~1 minute with 12 seconds loading cmsContentNu (and updating the cache file on disk).

gabador-mof commented 2 years ago

Ok. I'm removing the local db from the equation then. I mean one benefit of this might be that if the pod gets restarted, not replaced, then the file would still be there... But let's see.

gabador-mof commented 2 years ago

Same difference. It's just that now the app times out all the time.

This is running (@0 int,@1 int,@2 int,@3 int,@4 uniqueidentifier,@5 int,@6 bigint,@7 bigint)SELECT [umbracoNode].[id] AS [Id], [umbracoNode].[uniqueId] AS [Key], [umbracoNode].[level] AS [Level], [umbracoNode].[path] AS [Path], [umbracoNode].[sortOrder] AS [SortOrder], [umbracoNode].[parentId] AS [ParentId], [umbracoNode].[createDate] AS [CreateDate], [umbracoNode].[nodeUser] AS [CreatorId] , [umbracoContent].[contentTypeId] AS [ContentTypeId] , [umbracoDocument].[published] AS [Published], [umbracoDocument].[edited] AS [Edited] , [umbracoContentVersion].[id] AS [VersionId], [umbracoContentVersion].[text] AS [EditName], [umbracoContentVersion].[versionDate] AS [EditVersionDate], [umbracoContentVersion].[userId] AS [EditWriterId] , [umbracoDocumentVersion].[templateId] AS [EditTemplateId] , [pcver].[id] AS [PublishedVersionId], [pcver].[text] AS [PubName], [pcver].[versionDate] AS [PubVersionDate], [pcver].[userId] AS [PubWriterId] , [pdver].[templateId] AS [PubTemplateId] , [nuEdit].[data] AS [EditData] , [nuPub].[data] AS [PubData] , [nuEdit].[dataRaw] AS [EditDataRaw] , [nuPub].[dataRaw] AS [PubDataRaw] FROM [umbracoNode] INNER JOIN [umbracoContent] ON ([umbracoNode].[id] = [umbracoContent].[nodeId]) INNER JOIN [umbracoDocument] ON ([umbracoNode].[id] = [umbracoDocument].[nodeId]) INNER JOIN [umbracoContentVersion] ON (([umbracoNode].[id] = [umbracoContentVersion].[nodeId]) AND [umbracoContentVersion].[current] = @0) INNER JOIN [umbracoDocumentVersion] ON ([umbracoContentVersion].[id] = [umbracoDocumentVersion].[id]) LEFT JOIN [umbracoContentVersion] [pcver] INNER JOIN [umbracoDocumentVersion] [pdver] ON (([pcver].[id] = [pdver].[id]) AND ([pdver].[published] = @1)) ON ([umbracoNode].[id] = [pcver].[nodeId]) LEFT JOIN [cmsContentNu] [nuEdit] ON (([umbracoNode].[id] = [nuEdit].[nodeId]) AND [nuEdit].[published] = @2) LEFT JOIN [cmsContentNu] [nuPub] ON (([umbracoNode].[id] = [nuPub].[nodeId]) AND ([nuPub].[published] = @3)) WHERE ((([umbracoNode].[nodeObjectType] = @4) AND ([umbracoNode].[trashed] = @5))) ORDER BY [umbracoNode].[level], [umbracoNode].[parentId], [umbracoNode].[sortOrder] OFFSET @6 ROWS FETCH NEXT @7 ROWS ONLY Then it times out, and I re-run again, and then it times out again. Nothing in the logs.

I've just installed a fresh install of Umbraco 8.18.3 and it's still loading after 10 minutes. Rebuilding the cache.

I'm just restoring the db you sent back and will try that with a fresh install... Ok, just did that. App Still starting up after 9 minutes:

CmsContentNu in mspack format, the following script running:

(@0 int,@1 int,@2 int,@3 int,@4 uniqueidentifier,@5 int,@6 bigint,@7 bigint)SELECT [umbracoNode].[id] AS [Id], [umbracoNode].[uniqueId] AS [Key], [umbracoNode].[level] AS [Level], [umbracoNode].[path] AS [Path], [umbracoNode].[sortOrder] AS [SortOrder], [umbracoNode].[parentId] AS [ParentId], [umbracoNode].[createDate] AS [CreateDate], [umbracoNode].[nodeUser] AS [CreatorId] , [umbracoContent].[contentTypeId] AS [ContentTypeId] , [umbracoDocument].[published] AS [Published], [umbracoDocument].[edited] AS [Edited] , [umbracoContentVersion].[id] AS [VersionId], [umbracoContentVersion].[text] AS [EditName], [umbracoContentVersion].[versionDate] AS [EditVersionDate], [umbracoContentVersion].[userId] AS [EditWriterId] , [umbracoDocumentVersion].[templateId] AS [EditTemplateId] , [pcver].[id] AS [PublishedVersionId], [pcver].[text] AS [PubName], [pcver].[versionDate] AS [PubVersionDate], [pcver].[userId] AS [PubWriterId] , [pdver].[templateId] AS [PubTemplateId] , [nuEdit].[data] AS [EditData] , [nuPub].[data] AS [PubData] , [nuEdit].[dataRaw] AS [EditDataRaw] , [nuPub].[dataRaw] AS [PubDataRaw] FROM [umbracoNode] INNER JOIN [umbracoContent] ON ([umbracoNode].[id] = [umbracoContent].[nodeId]) INNER JOIN [umbracoDocument] ON ([umbracoNode].[id] = [umbracoDocument].[nodeId]) INNER JOIN [umbracoContentVersion] ON (([umbracoNode].[id] = [umbracoContentVersion].[nodeId]) AND [umbracoContentVersion].[current] = @0) INNER JOIN [umbracoDocumentVersion] ON ([umbracoContentVersion].[id] = [umbracoDocumentVersion].[id]) LEFT JOIN [umbracoContentVersion] [pcver] INNER JOIN [umbracoDocumentVersion] [pdver] ON (([pcver].[id] = [pdver].[id]) AND ([pdver].[published] = @1)) ON ([umbracoNode].[id] = [pcver].[nodeId]) LEFT JOIN [cmsContentNu] [nuEdit] ON (([umbracoNode].[id] = [nuEdit].[nodeId]) AND [nuEdit].[published] = @2) LEFT JOIN [cmsContentNu] [nuPub] ON (([umbracoNode].[id] = [nuPub].[nodeId]) AND ([nuPub].[published] = @3)) WHERE ((([umbracoNode].[nodeObjectType] = @4) AND ([umbracoNode].[trashed] = @5))) ORDER BY [umbracoNode].[level], [umbracoNode].[parentId], [umbracoNode].[sortOrder] OFFSET @6 ROWS FETCH NEXT @7 ROWS ONLY

image

{"@t":"2022-05-29T19:23:46.9110705Z","@mt":"{StartMessage} [Timing {TimingId}]","StartMessage":"Loading content from database","TimingId":"8b69235","SourceContext":"Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService","ProcessId":18592,"ProcessName":"iisexpress","ThreadId":5,"AppDomainId":2,"AppDomainAppId":"LMW3SVC2ROOT","MachineName":"YOGAB","Log4NetLevel":"INFO ","HttpRequestNumber":2,"HttpRequestId":"245a4f3b-bcf1-4005-b809-bcf6a8006b3e"}
{"@t":"2022-05-29T19:38:52.7405208Z","@mt":"{EndMessage} ({Duration}ms) [Timing {TimingId}]","EndMessage":"Completed.","Duration":905829,"TimingId":"8b69235","SourceContext":"Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService","ProcessId":18592,"ProcessName":"iisexpress","ThreadId":5,"AppDomainId":2,"AppDomainAppId":"LMW3SVC2ROOT","MachineName":"YOGAB","Log4NetLevel":"INFO ","HttpRequestNumber":2,"HttpRequestId":"245a4f3b-bcf1-4005-b809-bcf6a8006b3e"}
{"@t":"2022-05-29T19:39:05.7090577Z","@mt":"{StartMessage} [Timing {TimingId}]","StartMessage":"Loading media from database","TimingId":"76037b6","SourceContext":"Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService","ProcessId":18592,"ProcessName":"iisexpress","ThreadId":5,"AppDomainId":2,"AppDomainAppId":"LMW3SVC2ROOT","MachineName":"YOGAB","Log4NetLevel":"INFO ","HttpRequestNumber":2,"HttpRequestId":"245a4f3b-bcf1-4005-b809-bcf6a8006b3e"}
{"@t":"2022-05-29T19:39:06.0934126Z","@mt":"{EndMessage} ({Duration}ms) [Timing {TimingId}]","EndMessage":"Completed.","Duration":383,"TimingId":"76037b6","SourceContext":"Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService","ProcessId":18592,"ProcessName":"iisexpress","ThreadId":5,"AppDomainId":2,"AppDomainAppId":"LMW3SVC2ROOT","MachineName":"YOGAB","Log4NetLevel":"INFO ","HttpRequestNumber":2,"HttpRequestId":"245a4f3b-bcf1-4005-b809-bcf6a8006b3e"}
{"@t":"2022-05-29T19:39:14.3740253Z","@mt":"Forcing index {IndexerName} to be unlocked since it was left in a locked state","IndexerName":"InternalIndex","SourceContext":"Umbraco.Examine.ExamineExtensions","ProcessId":18592,"ProcessName":"iisexpress","ThreadId":5,"AppDomainId":2,"AppDomainAppId":"LMW3SVC2ROOT","MachineName":"YOGAB","Log4NetLevel":"INFO ","HttpRequestNumber":2,"HttpRequestId":"245a4f3b-bcf1-4005-b809-bcf6a8006b3e"}
{"@t":"2022-05-29T19:39:14.3760172Z","@mt":"Forcing index {IndexerName} to be unlocked since it was left in a locked state","IndexerName":"MembersIndex","SourceContext":"Umbraco.Examine.ExamineExtensions","ProcessId":18592,"ProcessName":"iisexpress","ThreadId":5,"AppDomainId":2,"AppDomainAppId":"LMW3SVC2ROOT","MachineName":"YOGAB","Log4NetLevel":"INFO ","HttpRequestNumber":2,"HttpRequestId":"245a4f3b-bcf1-4005-b809-bcf6a8006b3e"}
{"@t":"2022-05-29T19:39:14.3760172Z","@mt":"Forcing index {IndexerName} to be unlocked since it was left in a locked state","IndexerName":"ExternalIndex","SourceContext":"Umbraco.Examine.ExamineExtensions","ProcessId":18592,"ProcessName":"iisexpress","ThreadId":5,"AppDomainId":2,"AppDomainAppId":"LMW3SVC2ROOT","MachineName":"YOGAB","Log4NetLevel":"INFO ","HttpRequestNumber":2,"HttpRequestId":"245a4f3b-bcf1-4005-b809-bcf6a8006b3e"}
{"@t":"2022-05-29T19:39:14.3780118Z","@mt":"Starting initialize async background thread.","SourceContext":"Umbraco.Web.Search.BackgroundIndexRebuilder","ProcessId":18592,"ProcessName":"iisexpress","ThreadId":5,"AppDomainId":2,"AppDomainAppId":"LMW3SVC2ROOT","MachineName":"YOGAB","Log4NetLevel":"INFO ","HttpRequestNumber":2,"HttpRequestId":"245a4f3b-bcf1-4005-b809-bcf6a8006b3e"}
p-m-j commented 2 years ago

That log at the end there says it sucessfully loaded nucache for content in 15 minutes and 5 seconds, for media in 0.4 seconds.

There's no sign of any problem in that log output (other than things being slow)

gabador-mof commented 2 years ago

Yeah, but after 15 minutes, and nobody was using the db at the time. This time there wasn't a sql lock, but I literally cannot get any of my applications to start.

What's the next step at this point? I need to look at alternative hosting, to mitigate this issue, but in the long term this is not sustainable. No application takes 15 minutes to start...

Seeing how this is a bog standard Umbraco DB, with a bog standard OOB umbraco install, taking 15 minutes, can you tell me what we need to do here? Thank you!

p-m-j commented 2 years ago

I'm really sorry, I don't have a suggestion to improve performance booting a v8 site other than profiling the infrastructure to try and identify any bottlenecks that can be widened.

Future work to improve the startup duration is not going to target v8.

gabador-mof commented 2 years ago

Look, it's clearly not just a startup time of a site, because if it started up 10 times out of 10, after 7 minutes, I'd be fine with that.

So what you're saying is that there is no fix for this, and for a site that contains 10000 pages, Umbraco is not the right platform... I'm just trying to understand what our options are at this point, as it's a brand spanking new site, and we can't use it. Not even with the most cleanest of Umbraco installations.

Can you please explain how you managed to get such fast startup times over on your side, when I cannot get the same, running (what I assume to be) the same code, on my developer machine? Which version of Umbraco were you using?

p-m-j commented 2 years ago

I restored the a database from the bacpac supplied to both a local instance of SQL sever and a Basic tier elastic pool on azure.

Ran against a blank 8.18.4 codebase (no application specific code)

Specs on my development machine.

CPU: AMD Ryzen 7 3700X Memory: 64 GB DDR4 3200MHz Disk: SATA III ssd ~ 460 MB/s sequential write.

--

I have 32GB of RAM, with no limits, and an i7 processor. I really shouldn't have any issues here...

How long does it take you to cold boot against that database running locally with a blank code base (not in a windows container but on the metal)?

gabador-mof commented 2 years ago

15 minutes. That was on the metal, the logs I sent above...

p-m-j commented 2 years ago

15 minutes. That was on the metal, the logs I sent above...

And that was with no application specific code, and the database also running locally on the metal?

gabador-mof commented 2 years ago

No, the database was running on a S2 instance. It's the same with s3.

gabador-mof commented 2 years ago

I've taken the bulk of the data out, and moved it to a custom table, so the cache should now be around 600MB. THat's smaller than many of our other projects. To be on the safe side, I will upgrade to 8.18.4 too. I have just upgraded to that.

This has brought the application back into the realms of the usable. This is only when using the following params in web.config:

We don't have a serializer set, so defaulting to Json,

connection string connection timeout set to 600

I will try using MSGPack in a minute, just to see what that does, whether it speeds it up or breaks it. Ok, it broke it for a bit, but now all environments are on msgPack serializer, the cache size is 128MB and the site loads in an acceptable time-frame. I can now work with this.

However, I feel like with msgpack, there is no chunking of the data insertion when pushing into the database cache table. This might be why my app started up when using Json, but needed a super-long timeout when using MsgPack. Can you check if this is the case? If so, it would explain why there is a DB timeout when using msgpack and why the file remains locked after. If it was chunked, it would do it in smaller increments and potentially work better with larger sites too.

I could be wrong... It just struck me that the contentNu table size kept growing until the cache was rebuilt, but I ould not see any rows there

Cheers,Gabor

gabador-mof commented 2 years ago

Hi Paul, can you explain this main dom lock when you get a master/replica scenario? What happens when two applications start up at the same time? Say we have sqlMainDomlock key set, how do each of the pods do locking, and why?

(@0 int)UPDATE umbracoLock WITH (REPEATABLEREAD) SET value = (CASE WHEN (value=1) THEN -1 ELSE 1 END) WHERE id=@0

I have this command running occasionally in my SQL, when I get an error in the CMS that Lock Period timeout has been exceeded

p-m-j commented 2 years ago

Sure, but it's probably better to start with stating that I don't think you should be using SqlMainDomLock at all.

SqlMainDomLock solves some issues for running on Azure App Services.

For your setup the default implementation of IMainDomLock which uses named system semaphores should be fine.

V8 specific Answer (some of this is no longer true in v9)

SqlMainDomLock works by adding a record to umbracoKeyValue to take ownership of MainDom status for its MainDomKey where MainDomKey is a hash of a few fields (e.g. hostname) that identify an application running on a host. Two servers with different hostnames will have two different MainDomKeys for the same app.

Before it sticks that record in umbracoKeyValue it adds a distributed lock to the database in umbracoLock to ensure that two instances of the application with the same MainDomKey cannot obtain MainDom status at the same time, this is where you're seeing a timeout apparently (we don't do this anymore in v9, we just use a serialized isolation level instead when adding the record to umbracoKeyValue).

gabador-mof commented 2 years ago

Thank you for the explanation. I know we've gone off topic here, but just trying to understand what's happening in the application.

Sometimes I get an error now That the lock request timeout has been exceeded. This runs, and then it fails. (@0 int)UPDATE umbracoLock WITH (REPEATABLEREAD) SET value = (CASE WHEN (value=1) THEN -1 ELSE 1 END) WHERE id=@0

What causes a lock to be set? Is it possible that a call from the same application can set a lock, that then makes another request by someone else fail to acquire a lock?

p-m-j commented 2 years ago

I failed to answer the "What happens when two applications start up at the same time?" part.

Happy path is

1) Two instances start up on different servers, as they are different servers they have a different MainDomKey so they can both have "MainDom" status. 2) Both instances race to obtain a distributed write lock for adding a record to umbracoKeyValue to prove MainDom status. 3) One wins and adds its record, then releases its write lock 4) The other instance was blocked waiting to obtain a write lock, but now it can obtain its a lock and then add its record to umbracoKeyValue.

Both getting a write lock and adding/updating a maindom record to umbracoKeyValue are tiny queries and should be very fast unless there is bad latency between app server and database.

But obtaining a write lock has a timeout associated, for v8 the default is 5000ms unless configured otherwise via appsettings with the key Umbraco.Core.SqlWriteLockTimeOut, if a lock cannot be obtained in this time an exception is thrown.

p-m-j commented 2 years ago

To reiterate, as you are not running on azure app services you can skip all of this by not using SqlMainDomLock.

gabador-mof commented 2 years ago

I'm not using SqlMainDomLock...

gabador-mof commented 2 years ago

Unless by not specifying a key in the app settings, that's the default...

p-m-j commented 2 years ago

https://github.com/umbraco/Umbraco-CMS/issues/12469#issuecomment-1136765407

gabador-mof commented 2 years ago

I have subsequently removed it as per your suggestion. Sorry for not specifying

p-m-j commented 2 years ago

Distributed locking is used elsewhere in the codebase for other purposes, I imagine things will be a lot smoother if you add <add key="Umbraco.Core.SqlWriteLockTimeOut" value="20000" /> to appsettings.

gabador-mof commented 2 years ago

Ok, I can try this, but my question is what causes the lock, just trying to understand how we can optimize our code, so locking is minimized?

Say when you have a page, and the page saving is not finished due to event handlers and such running in the background... When would a lock be released? Specifically, when these lock errors come up, the above sql statement is running, could it be caused by the page save event still running in the background? Just trying to understand if we have a large page save operation, and multiple people are in the cms, saving other pages, could this lock occur?

p-m-j commented 2 years ago

Locks are obtained when required by our services. e.g. if you call ContentService.SaveAndPublish(aContent) (e.g. when using the backoffice UI) then a distrbitured write lock would be obtained for the content tree, they are released when the operation is completed.

gabador-mof commented 2 years ago

UNderstood. I imagine at this point there would be a cache operation going on in the background as well.

Honestly, thanks for all your help so far, I would have gone into a breakdown by now without it...

p-m-j commented 2 years ago

You're welcome but this sort of chat is probably better suited to the forums, a github discussion or the umbraco discord we are miles off from where we started now :)

gabador-mof commented 2 years ago

So, as an interesting development, there was some content on the home page, that seemed fine, that I deleted, and now everything seems to work like clockwork... Is it possible that if some content exists on the root node that doesn't match the schema, then Umbraco gets into a bit of a slum, and tries to parse that content and fails? And then everything slows down massively?

gabador-mof commented 2 years ago

Hi Paul, me again,

One last question please! If I connect to umbraco with a readonly user when not using SQLMainDomLock, I should be ok? From the replica servers?

Thank you!

p-m-j commented 2 years ago

https://our.umbraco.com/Documentation/Fundamentals/Setup/Server-Setup/Load-Balancing/flexible-advanced#subscriber-servers---read-only-database-access

I think that works as long as you have explicit server roles, at least that’s what the docs suggest