Sites not responding because of high thread count

ronaldbarendse commented 5 years ago

We've had multiple sites to down because of a IIS 'full hang': All requests to your application are very slow or time out. Symptoms include detectable request queueing, and sometimes 503 Service Unavailable errors when queue limits are reached.

The KUDU portal is still reachable when the site is not responding and looking in the process explorer, the thread count is 500+ (normal running sites have around 45-50 threads).

These hangs started in the beginning of March and seem to be related to the ImageProcessor trimCache setting that is now automatically transformed to true in the Umbraco Cloud deploy script. After manually setting this back to false (which is the default for over 2 years in the Umbraco CMS codebase: https://github.com/umbraco/Umbraco-CMS/blob/853087a75044b814df458457dc9a1f778cc89749/build/NuSpecs/tools/cache.config.install.xdt), none of the sites had any downtime or high thread counts. After publishing/deploying an update to the site, this change is undone and therefore this is not a good work-around.

Reproduction

Have trimCache=true in the ImageProcessor cache configuration (config\imageprocessor\cache.config), request images that must be processed to trigger the cache trimming and wait for the thread count go to the limit, so IIS doesn't have available threads to process new incoming requests...

The cache trimming is started after an image is processed here: https://github.com/JimBobSquarePants/ImageProcessor/blob/36ef48e1a34623174c01a6d4ff9a83a0b894aab7/src/ImageProcessor.Web/HttpModules/ImageProcessingModule.cs#L721
This schedules a callback to clean up cached files: https://github.com/JimBobSquarePants/ImageProcessor/blob/58e18acd42de65b02421a85dee5361c4d74ac2ff/src/ImageProcessor.Web/Caching/DiskCache.cs#L204
The scheduler uses locking and time-outs (up to 30 minutes), which might cause the increase of threads: https://github.com/JimBobSquarePants/ImageProcessor/blob/58e18acd42de65b02421a85dee5361c4d74ac2ff/src/ImageProcessor.Web/Caching/ImageCacheBase.cs#L225

I've come this far, but as I can't easilly reproduce the issue, am not sure this is exactly what is causing the downtime. If this can be verified, the issue should be added to the ImageProcessor project...

For now, it would be nice to:

Have a way to disable the trimCache config transform (or remove it from the UC deploy script)
Update the documentation, so all transforms are documented (also changing the cache path to be outside the webroot): https://our.umbraco.com/documentation/Umbraco-Cloud/Set-Up/Config-Transforms/#forced-transforms

ronaldbarendse commented 5 years ago

Had 2 sites go down today, both had high thread counts (500+ and 100+) on the w3wp process, but the trimCache setting was false... So that's probably not the cause 🤔

Looking further, we have the ImageProcessor.Web.PostProcessor plug-in installed that starts processes that might not be killed correctly. I already uninstalled this plug-in on one site and that hasn't (yet) gone down...

ronaldbarendse commented 5 years ago

BTW On the second site that was not responding, I saw the thread count go up from 100 to 160 in a couple of minutes.

I've also seen some weird things with log files, where the latest files were not in App_Data\Logs.. I recall there was an issue with this beginning of March (after the major downtime), but can't find the links/explanation of what exactly happened.

As we clean all log4net files older than 3 months in a background task (RecurringTaskBase) every 24 hours, this might explain both the disappearing logs (although both creation and last write time must be over 3 months) and the 'thing' triggering the spontaneous increase of threads leading to the site going down.

KennethJakobsen commented 5 years ago

Hi @ronaldbarendse thank you for reporting your issue. It doesn't however seem like this trimCache is the issue here - so i'm closing this issue. We will of course keep investigating through your open support case.

ronaldbarendse commented 5 years ago

Looks like it isn't the trimCache setting indeed, but also not our own log cleanup task, as a site where I've removed this also went down and had a very high (and rapidly increasing) thread count...

So it might be the ImageProcessor.Web.PostProcessor plug-in, as that starts multiple threads for every processed/cached image. Will uninstall that on some projects and report back here 👍

KennethJakobsen commented 5 years ago

Hi @ronaldbarendse that would be awesome, we are very interested in any issues you may encounter so we can help prevent it in the future.

ronaldbarendse commented 5 years ago

After a lot of testing, we isolated the problem to having ImageProcessor.Web.PostProcessor installed... Sites with both trimCache true/false and with/without our own log cleaning still became unresponsive, but none of the sites where we uninstalled the PostProcessor have (yet) 🎉

skttl commented 5 years ago

@ronaldbarendse that is interesting. We use PostProcessor on all our sites. Can you post updates if you get it fixed in PostProcessor?

ronaldbarendse commented 5 years ago

@sniffdk also experienced this issue (site unavailable and thread count at 572), but isn't using ImageProcessor.Web.PostProcessor... Even @JimBobSquarePants doubts it's causing this, as it would require a constant input of new image requests (and the processes are killed after a timeout - 5 seconds by default).

We've recently also experienced IIS Express sometimes randomly stops when deveoping sites locally, so it might be introduced in a recent Umbraco version. All speculation, but maybe all this can be tied together and point in the right direction.

@KennethJakobsen: is it possible for the support team to check all active requests/threads when this issue occurs, so we at least know what's running in all these threads (e.g. using %windir%\system32\inetsrv\appcmd list requests /elapsed:10000 or by creating a minidump)?

zpqrtbnk commented 5 years ago

Ideally, just to make sure it's not us (Core or Cloud) doing something weird, I would love to get a minidump of a site with 500+ (or even 100+) threads. Just ping me if you can do it. Thanks!

ronaldbarendse commented 5 years ago

@zpqrtbnk I've contacted Eric because we have a site that's still in development go down today (500+ threads): we agreed he would let the developers have a look and not restart it. So talk to him and please go ahead with making a minidump!

ronaldbarendse commented 5 years ago

I've reproduced the crashing IIS Express process (multiple times, on multiple PCs and with different Umbraco projects): just start the website in debug mode (F5), let the site start up and show the frontpage, resave the Web.config (outside of VS, so it restarts without stopping the process) and repeat saving/reloading untill you get the following error:

Managed Debugging Assistant 'FatalExecutionEngineError' : 'The runtime has encountered a fatal error. The address of the error was at 0x738fb3e3, on thread 0x4c1c. The error code is 0xc0000005. This error may be a bug in the CLR or in the unsafe or non-verifiable portions of user code. Common sources of this bug include user marshaling errors for COM-interop or PInvoke, which may corrupt the stack.'

Looking at the thread specified in the error, it's running the following code every time: Microsoft.AspNet.SignalR.Core.dll!Microsoft.AspNet.SignalR.Hubs.ReflectedHubDescriptorProvider.GetTypesSafe.

We also noticed the process crash most of the time after changing document types or doing a deploy (using the marker file in data\deploy). Umbraco ModelsBuilder is enabled and set to LiveAppData BTW.

@zpqrtbnk Can you try to reproduce this? It might be related, don't you think?

ronaldbarendse commented 5 years ago

@zpqrtbnk The problems above are caused after updating Microsoft.CodeAnalysis.CSharp to version 3.0 (which is allowed by the Umbraco.ModelsBuilder NuGet package 3.0.10)...

ronaldbarendse commented 5 years ago

We're still experiencing sites going down randomly (all with high thread counts) and have already excluded the ImageProcessor trimCache setting and ImageProcessor.Web.PostProcessor as with different configurations (on/off, installed/uninstalled) it still happens. As other users have also experienced this, it can't be in our own code/implementations. And all only sites running on Umbraco Cloud...

Umbraco support reported on May 2nd they were seeing an increase in 503 errors because of an issue in their server setup and were working on a fix. The next scheduled maintenance was on May 13th (https://status.umbraco.io/incidents/t3vhcwkvz029), but that either did not include the fix or at least not one that fixed the problem. Also note this isn't related to the incident on May 3rd (https://status.umbraco.io/incidents/1wwq67wmx7q0), as that's related to an Azure DNS issue (R50C-5RZ).

Looking further, I noticed the log4net.config file is transformed with an additional UCDbErrors appender, but also that the default rollingFile appender is replaced with a custom one:

<appender name="rollingFile" type="UmbracoCloud.log4net.RollingFileAppender.RollingFileAppender, UmbracoCloud.log4net.RollingFileAppender, Version=1.0.0.0, Culture=neutral, PublicKeyToken=5fbb1827f0ea7d2a">

As they're both wrapped in the AsynchronousLog4NetAppender that starts threads to do the logging asynchronously in the background, it could be one of these appenders is blocking, causing an increase in threads. It looks like this is a known problem in the Log4Net.Async package: https://github.com/cjbhaines/Log4Net.Async/issues/39.

mikkelhm commented 5 years ago

Hi @ronaldbarendse

I'm unsure the thread usages when using the loggers we are adding, but I'm very interested in what your findings. On Cloud we do run some forced transforms, but if you'd like to, you could try to roll with the following log4net.config file in your site. It will prevent the customer loggers from being added, and only run with what is defined. the following could be used:

<?xml version="1.0"?>
<log4net>

  <root>
    <priority value="Info"/>
    <appender-ref ref="FileAppender" />
  </root>
    <appender name="FileAppender" type="log4net.Appender.FileAppender">
      <file value="App_Data\Logs\CustomUmbracoTraceLog.txt" />
      <appendToFile value="true" />
      <lockingModel type="log4net.Appender.FileAppender+MinimalLock" />
      <layout type="log4net.Layout.PatternLayout">
        <conversionPattern value=" %date [P%property{processId}/D%property{appDomainId}/T%thread] %-5level %logger - %message%newline" />
      </layout>
    </appender>

  <!--Here you can change the way logging works for certain namespaces  -->

  <logger name="NHibernate">
    <level value="WARN" />
  </logger>

</log4net>

You can test it - but also a disclaimer that it is totally unsupported, as files will no longer be written async, and the "error logs" feature in the Cloud portal will also no longer work. Any chance you could test it out?

Also do you have any sites running on version 8? And know if it has the same issues? we switched to Serilog when doing Version 8.

ronaldbarendse commented 5 years ago

After having major troubles with a website heavily using Examine, I've seen the thread count go up as soon as a re-index started (almost all threads were in waiting state). Especially when the re-indexing takes a while (e.g. when the site/server is under load or has a lot of items), the thread count keeps increasing and the requests time-out (probably because it waits for the index to become available and that takes too long).

Looking at the Examine source, the SafelyProcessQueueItems starts a single task to process the queued items, but all onComplete callbacks are added as task continuations: https://github.com/Shazwazza/Examine/blob/864168c7f019908242dd56197e075edea5ee77cf/src/Examine/LuceneEngine/Providers/LuceneIndex.cs#L727-L776.

And according to the Microsoft documentation:

A continuation is a task that is created in the WaitingForActivation state.

So this could be the cause of the increading threads (in waiting state) while the site isn't responding...

I've now added runAsync="false" to the indexers configuration to test whether the thead count still increases when the site becomes unavailable.

Not sure what causes the actual time-outs though, but I've seen lots of Lucene.Net.Store.AlreadyClosedException exceptions while the site was unavailable:

Lucene.Net.Store.AlreadyClosedException: this IndexReader is closed
   at Lucene.Net.Index.IndexReader.EnsureOpen()
   at Lucene.Net.Index.DirectoryReader.GetFieldNames(FieldOption fieldNames)
   at Examine.LuceneEngine.Providers.LuceneSearcher.GetSearchFields()
   at UmbracoExamine.UmbracoExamineSearcher.GetSearchFields()
   at Examine.LuceneEngine.Providers.BaseLuceneSearcher.CreateSearchCriteria(String type, BooleanOperation defaultOperation)
   at UmbracoExamine.UmbracoExamineSearcher.CreateSearchCriteria(String type, BooleanOperation defaultOperation)
   at Examine.LuceneEngine.Providers.BaseLuceneSearcher.CreateSearchCriteria(String type)
   at Umbraco.Web.PublishedCache.XmlPublishedCache.PublishedMediaCache.GetUmbracoMediaCacheValues(Int32 id)

@Shazwazza Any thoughts on this?

ronaldbarendse commented 5 years ago

You can test it - but also a disclaimer that it is totally unsupported, as files will no longer be written async, and the "error logs" feature in the Cloud portal will also no longer work. Any chance you could test it out?

I'll try this out when I can! I've not seen an overview of all forced transforms in the Cloud documentation (already talked about this with @sofietoft, but she needs input from developers first) or a way to disable these... Going from your example, the log4net.config transform adds the UCDbErrors below the AsynchronousLog4NetAppender and changes the type of the rollingFile appender, right?

Also do you have any sites running on version 8? And know if it has the same issues? we switched to Serilog when doing Version 8.

All problems are with Umbraco 7 sites (7.12.4 - 7.14.0). Version 8 might solve these problems (different logger, updated Examine/Lucence version, etc.), but in testing we've found a lot of problems and are waiting for these to be fixed in 8.1...

Shazwazza commented 5 years ago

Hi @ronaldbarendse thanks for the info - i can try to see if i can determine if that is causing any issues. The thing is that this code has been like this for many many years without reports of issue but stranger things have happened - and we all probably know a lot more than we did many years ago ;) SafelyProcessQueueItems creates a new thread when there is no indexing going on, once the queue is written to, it keeps using the same thread to continuously read from the queue until the queue is empty - which during re-indexing will be asynchronously added to. Once the queue is empty then a continuation thread will be created only when it finishes. But ... this line is what I think needs to be looked at (can't remember how all this fits together exactly) https://github.com/Shazwazza/Examine/blob/864168c7f019908242dd56197e075edea5ee77cf/src/Examine/LuceneEngine/Providers/LuceneIndex.cs#L772

Also, why is it re-indexing and doing so during heavy load? Is this normal re-indexing or something custom?

Shazwazza commented 5 years ago

@ronaldbarendse have added this to examine repo so i can track it there https://github.com/Shazwazza/Examine/issues/125

ronaldbarendse commented 5 years ago

Also, why is it re-indexing and doing so during heavy load? Is this normal re-indexing or something custom?

I started the re-index as a last resort, as it might have been corrupted with all the environment restarts, AlreadyClosedExceptions, etc. The site itself wasn't under heavy load at all. The default indexers for content and media were producing these problems, but custom Merchello indexes were also configured on this specific site...

The randomness of this issue makes it very hard to debug/diagnose... As of now, my guess would be this is something in the logging (as UC has a custom logger: UmbracoCloud.log4net.RollingFileAppender.RollingFileAppender and Log4Net.Async has an unsolved issue regarding the ParallelForwardingAppender) or Examine (starting threads for every queued item and also because it requires additional configuration to perform well on Azure/UC).

ronaldbarendse commented 5 years ago

FYI: I've found other possible component we use (and may be specific to our sites) that might be responsible for the waiting/blocking threads: https://github.com/logzio/logzio-dotnet/issues/34.

Shazwazza commented 5 years ago

@ronaldbarendse I've done some research and added notes here to the Examine question https://github.com/Shazwazza/Examine/issues/125#issuecomment-518524401

As it is now i can't see this being an issue with the Examine queue.

Regarding this

After having major troubles with a website heavily using Examine, I've seen the thread count go up as soon as a re-index started (almost all threads were in waiting state). Especially when the re-indexing takes a while (e.g. when the site/server is under load or has a lot of items), the thread count keeps increasing and the requests time-out (probably because it waits for the index to become available and that takes too long).

Yes unfortunately in v7, if indexes are not there and they need to be rebuilt the application will block until they are done because indexes are semi critical to looking up media on the front-end. In v8 this is not the case. I'm unsure if this index rebuilding you are referring to is on startup but if it is and you have a busy site then yes request threads will continue to increase until they are served and they wont be served until startup completes. It's not ideal but that's why media is no longer cached in lucene in v8 and index rebuilding if it's needed is queued to a background thread after the application starts.

You'll definitely not want to run runAsync="false" in production, this will cause problems and should only be used for testing. As for AlreadyClosedException you'll also want to check if your site is in a constant restart/shutdown loop. If its constantly shutting down you'll get all sorts of problems and then need to determine why it's doing that (not a fun task).

ronaldbarendse commented 5 years ago

I just wanted to update this issue, as we're still seeing sites go down at random. As HQ has now added monitoring and automatically restarts unresponsive sites within 20 minutes, the downtime is at least kept to a minimum.

As I don't have the permissions to create a minidump of a unresponsive site on UC (and we don't see this issue occurring outside of Umbraco Cloud, e.g. not locally and on other - more traditional - hosting setups), the root issue of all these waiting threads is still unknown.

umbraco / Umbraco.Cloud.Issues

Sites not responding because of high thread count #72

Reproduction