Sites with 100,000+ nodes fail to start with content.nucache locked errors

stevetemple commented 2 years ago

Which exact Umbraco version are you using?

8.17.1

Bug summary

When cold booting a site with 100,000+ nodes I'm getting content.nucache locked errors.

If the content.nucache file is present then it starts up fine, it's the initial generation of the site.

I can reproduce this by pointing an empty v8.17.1 site (just umbraco cms nuget package installed) at the database as well as with the heavily customised site.

This leaves the site in a failed state and it won't then recover.

Specifics

I get the following logs (tweaked to make readable):

2022-01-20T14:18:30.6168581Z    Loading content from database   Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService
2022-01-20T14:20:21.9847993Z    Exception ({InstanceId}).\r\nThe thread has been aborted, because the request has timed out. System.Threading.ThreadAbortException: Thread was being aborted.\r\n   at SNIReadSyncOverAsync(SNI_ConnWrapper* , SNI_Packet** , Int32) 
2022-01-20T14:20:22.0160195Z    Completed   Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService
2022-01-20T14:21:13.1427460Z    Panic, exception while loading cache data.  System.InvalidOperationException: Invalid operation. The connection is closed.\r\n   at System.Data.SqlClient.SqlInternalConnectionTds.ValidateConnectionForExecute(SqlCommand command)

The 2nd log line is actually inside some NPoco db operations, I truncated the very long stack trace,

What I believe is happening is that the connection that starts the site hits the execution timeout, this then redirects the user which generates a ThreadAbortException in the middle of a DB operation, I think this is at some point closing the DB connection which is being used on the thread that is populating the content.nucache db.

We then have it reporting as Completed (it's not) and then a Panic as the connection is closed and it leaves the nucache db files locked. Which then stops the site loading.

If I increase executionTimeout to something very large (I believe the default is 110 seconds) then this doesn't happen but does leave my site with very large executionTimeouts which isn't ideal.

Steps to reproduce

Have a site with very large number of nodes 100,000+ and delete nucache and start it from cold boot.

Expected result / actual result

I think the initial connection timing out shouldn't close the connection and leave the site in a failed state because nucache is now locked.

Ideally it could timeout but the building continues in a background thread and site comes up once that is completed.

duncanhaines commented 2 years ago

I'm seeing the exact same problem on a site, same version of umbraco, 8.17.1 and 85,000+ nodes. I've had to resort to the same fix of increasing executionTimeout;

<httpRuntime executionTimeout="300" />

kjac commented 1 year ago

Hi @stevetemple,

Thank you for reporting this, and sorry for the late response.

We are currently investigating options to mend or replace nucache. It's on the roadmap. Currenly I can't say anything as for a timeline, though.

While we prepare for that, every last bit of input from people experiencing issues with large sites is very much appreciated. So... a few questions come to mind 😄

Are you still experiencing this problem?
Did you try migrating to V10 (there have been some performance improvements along the way)?

stevetemple commented 1 year ago

Hi @kjac

Thanks for the response,

We have this pretty regularly, we've got a workaround of having very large executionTimeouts. It's worse with SQL Azure as that's slower DB generally.
Always flinch slightly when there is a suggestion is to upgrade a large prod site to vLatest as that is a big project, £10,000s worth of effort. We've only relatively recently really got them onto v8, so while that is the plan medium-long term, but not as yet something our client is willing to sign off on.

I suspect there is a possible way of batching some of this stuff to prevent timeouts, something I was going to look at but haven't had the time in the last few months.

osigas commented 1 year ago

Just to chime in on this, we've also got a v8 site with 200,000+ nodes that occasionally gets this issue too. It happened when the application pool recycles on a schedule overnight, but leaves the site in a state where every request ends with a System.IO.IOException: The process cannot access the file '{TEMP}\NuCache\NuCache.Content.db' because it is being used by another process.

So far our only solution has been to extend the exectionTimeout.

stevetemple commented 1 year ago

One thing that we've found that helps is running ALTER INDEX PK_cmsContentNu ON cmsContentNu REBUILD regularly. This index gets fragmented and slow pretty quickly on large sites.

marcoteodoro commented 1 year ago

Hi @stevetemple we are facing some issues with caching and was wondering if this is something you still need to do on v10?

miguelcrpinto commented 1 year ago

@osigas I believe the issue you are having is unrelated to the one reported by @stevetemple.

To minimize downtime, IIS starts a new instance of the web application before the old one is completely stopped. That exception occurs because the new application tries to use NuCache.Content.db while the old application is still using it.

You can avoid those locks by disallowing overlapping rotation in IIS. You can achieve that via the UI or with the following powershell command: Set-WebConfiguration /system.applicationHost/applicationPools/applicationPoolDefaults/recycling/@disallowOverlappingRotation -value "true"

Keep in mind that this might cause more downtime.

You can also disable the IIS periodic restarts (by default it restarts every 29 hours) via the UI or by running the following powershell command: Set-WebConfiguration /system.applicationHost/applicationPools/applicationPoolDefaults/recycling/periodicRestart/@time -value "00:00:00"

umbraco / Umbraco-CMS