Closed robertjf closed 2 years ago
Is this on Umbraco Cloud, or on pure Azure? Where is D:\home\site\wwwroot\App_Data\TEMP
- is it on a shared drive, or on a local drive? Is it possible that another instance of the site, running on another server, is trying to access these files? These files can only be accessed by 1 instance at a time - we have mechanism in place to ensure this on one server but not accross servers.
This is in standard Azure on a basic plan with single instance only, there’s no special configuration at all, pretty vanilla Umbraco 8 with hacked up starter kit
The same config principles apply to v8 as v7 for running on Azure. We defo need to get the docs updated (PRs are welcome!) The current azure docs are here https://our.umbraco.com/Documentation/Getting-Started/Setup/Server-Setup/azure-web-apps
for v8 you will need these in appSettings:
<add key="Umbraco.Core.LocalTempStorage" value="EnvironmentTemp" />
<add key="Umbraco.Examine.LuceneDirectoryFactory" value="Examine.LuceneEngine.Directories.SyncTempEnvDirectoryFactory, Examine" />
Examine.LuceneEngine.Directories.TempEnvDirectoryFactory
interestingly enough, there is no examine config files in this website - I hadn't noticed that before... is that normal? Out of the box shouldn't this just work? Do I need to add the files?
There is no examine config files in v8. OOTB, just like v7, you need to adjust some config to make umbraco work with azure. So the above 2 config values are needed. These are equivalent to the v7 appSetting umbracoLocalTempStorage
and the examine config directoryFactory
switch.
The 2 config values are just in your web.config, these are just appSetting
key values in v8
ah, right - good to know :)
so is it right to assume these are going to be automatically added in future, or are they only in certain situations?
No you cannot assume that, Just like deploying v7 to Azure you need to setup specific config for that, it's no different in v8.
If however, you create an Umbraco website from the Azure portal, then yes these should be pre-configured for you but I don't think we have a v8 build on the Azure portal yet.
Thanks for explaining the required config settings.
But in v8, even with the above settings, the exact error happens when you enable Slot Swap in the Azure App Service: before the swap both slots are working ok, but after the Swap the new "Production" slot throws the error every time. A full App Service restart is needed to make things working again.
If you work with a single slot everything is Ok,
That's interesting, will assume the same problem exists in v7 too since it's the same paradigm.
I think the only way around such behavior is to have an option to not have a persisted cache file, or name the cache file based on the AppDomainAppId + MachineName (should be unique among processes).
@zpqrtbnk what do you think here?
Note: We have a site on Umbraco 7.13.2 running on Azure with slot swapping (and the "LocalTempStorage" ="EnvironmentTemp" app setting) and we never run into this problem with the XML cache file. Maybe because it's not open all the time?
NuCache stores its file in Path.Combine(_globalSettings.LocalTempPath, "NuCache")
and therefore respects the <add key="Umbraco.Core.LocalTempStorage" value="EnvironmentTemp" />
setting. The temp path combines the application id and the site name, but not the machine name, as the local temp is supposed to be local, ie per-server.
I don't know much about "slot swaps" in Azure. But, if they are running on the same machine, same application id, same site name, same temp dir then... there might be a collision? That would require some troubleshooting to get it right. But then, as you mention, it's not only NuCache but other things too.
Now indeed, the difference is that v8 locks the NuCache files for as long as it's running where v7 was "using" the Xml file from time to time = the two sites may cohabit (but that was a bad idea).
So... to make it short, I'd like to hear more about Slot Swaps and, on all slots, get the value of HostingEnvironment.SiteName
, HostingEnvironment.ApplicationID
, and %temp%
.
@zpqrtbnk thanks for looking into this. I will post today the requested information about the Slot Swaps and those values.
This is what I got on each slot, before and after the swap:
<!-- STAGE slot, before swap
HostingEnvironment.SiteName = 'terniumcomdev__e951'
HostingEnvironment.ApplicationID = '/LM/W3SVC/1464357220/ROOT'
%temp% = 'D:\local\Temp'
-->
<!-- PRODUCTION slot, before swap
HostingEnvironment.SiteName = 'terniumcomdev__dd91'
HostingEnvironment.ApplicationID = '/LM/W3SVC/1742369212/ROOT'
%temp% = 'D:\local\Temp'
-->
<!-- STAGE slot, after swap
HostingEnvironment.SiteName = 'terniumcomdev__dd91'
HostingEnvironment.ApplicationID = '/LM/W3SVC/1742369212/ROOT'
%temp% = 'D:\local\Temp'
-->
<!-- PRODUCTION slot, after swap
HostingEnvironment.SiteName = 'terniumcomdev__e951'
HostingEnvironment.ApplicationID = '/LM/W3SVC/1464357220/ROOT'
%temp% = 'D:\local\Temp'
-->
And this time the error was not thrown. In a previous execution (with different slots) I got the following:
System.IO.IOException: The process cannot access the file 'D:\local\Temp\UmbracoData\854d45b396372eb551323cc722f33136\NuCache\NuCache.Content.db' because it is being used by another process.
I am positively sure that this setting was in effect when the error was thrown:
<add key="Umbraco.Core.LocalTempStorage" value="EnvironmentTemp" />
but this one was not (may this be the issue?)
<add key="Umbraco.Examine.LuceneDirectoryFactory" value="Examine.LuceneEngine.Directories.SyncTempEnvDirectoryFactory, Examine" />
In my latest test (no error) both settings were in effect.
The Lucene setting has no impact on NuCache. The fact that you are seeing an error such as The process cannot access the file 'D:\local\Temp\UmbracoData\854...136\NuCache\NuCache.Content.db'
indicates that you are indeed using the proper local temp storage setting.
NuCache uses the built-in MainDom mechanism to ensure that only one app domain at a time can own the cache (and the associated files). MainDom uses a machine-wide named lock; the name is built by combining the application id and the application physical path.
(so obviously I should also have asked you the HostingEnvironment.ApplicationPhysicalPath
values...)
A site cannot, in theory, even try to access the NuCache file until it has aquired the MainDom lock. Therefore, for the error to happen... the site must own the machine-wide lock on (app.id, app.path) and yet someone else must lock the files at (app.id, site.name), meaning
Any chance you can get the ApplicationPhysicalPath
values?
And... I don't know enough about "slots" to figure out the physical server thing. Ideas?
Just tested it. The ApplicationPhysicalPath
value is always D:\home\site\wwwroot\
in both slots, before and after the swap.
Regarding the physical server... it is not fully documented, but many posts assume it is the same because all slots share the same App Service Plan = the same set of resources (i.e. 3.75 Gb RAM).
So... assuming that all sites run on the same physical server (for now), that leaves us with app.path.
D:\home\site\wwwroot\
is a kinda virtual path that Azure uses, so basically all apps are always under that path, and that can be annoying in some cases... but in our case, it should indeed make things even more safer.
Did not realize you posted a log file with the original issue - now looking at that file.
@zpqrtbnk - the log file is not mine, so it does not apply to the slot issue. Thanks.
In any case, for some mysterious reason the "process cannot access the file NuCache.Content.db" error has not appeared lately - in all slots swaps during the last three days. I will post more information - including a log - if it does happen again. Thanks for your help!
Thanks for the update.
Even though it's not your log... the significant lines are:
2019-03-19 03:16:49,695 [P9488/D3/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Booting Umbraco 8.0.0 on RD2818786B7D96. [Timing e8bc0a7]
2019-03-19 03:16:50,759 [P9488/D3/T1] INFO Umbraco.Core.MainDom - Acquired.
2019-03-19 03:17:39,322 [P9488/D3/T17] INFO Umbraco.Core.MainDom - Released (environment)
2019-03-19 03:25:35,501 [P9488/D4/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Booting Umbraco 8.0.0 on RD2818786B7D96. [Timing 761ae5c]
2019-03-19 03:25:36,074 [P9488/D4/T1] INFO Umbraco.Core.MainDom - Acquired.
2019-03-19 03:27:45,813 [P9488/D4/T16] INFO Umbraco.Core.MainDom - Released (environment)
2019-03-19 03:27:49,388 [P9488/D5/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Booting Umbraco 8.0.0 on RD2818786B7D96. [Timing 9ce9254]
2019-03-19 03:27:50,970 [P9488/D5/T1] INFO Umbraco.Core.MainDom - Acquired.
2019-03-19 03:28:24,123 [P9488/D5/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Booted. (34755ms) [Timing 9ce9254]
2019-03-19 03:28:59,434 [P9472/D2/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Booting Umbraco 8.0.0 on RD2818786B7D96. [Timing 56405ae]
2019-03-19 03:29:06,148 [P9472/D2/T1] INFO Umbraco.Core.MainDom - Acquired.
2019-03-19 03:29:31,320 [P9472/D2/T1] ERROR Umbraco.Core.Runtime.CoreRuntime - Boot failed. (32637ms) [Timing 56405ae]
Umbraco.Core.Exceptions.BootFailedException: Boot failed. ---> System.IO.IOException: ...
2019-03-19 03:29:46,191 [P9488/D5/T13] INFO Umbraco.Core.MainDom - Released (environment)
Where we see that a new process (9472) starts while the previous one (9488) is running, and acquires the MainDom lock before the previous process releases it, thus hitting an exception when trying to read the cache. That should not be possible...
If anything happens again, thanks for reporting. Meanwhile, I've made a few changes so that in 8.0.2 we log all the important infos (app.path, app.id, etc) when the site boots.
Hi @zpqrtbnk Some more info (the exception is being thrown again in my Azure environment). A log is attached, the same behavior you described happens in the last lines of the log.
Note that in this case:
UmbracoTraceLog.RD2818784FF824.20190409.zip
I'm running v8.0.1 installed via the nuget.org package.
Hope it helps.
@zpqrtbnk - I could further isolate the problem.
I performed the following steps:
So it looks like this is related to the Azure Slot swap - although it is the same machine, for some reason the process starts over without having shut down the previous one.
One last comment about the slot swap process: the "swapped" Web app remains the same, in the same machine .. but the URLs that hit it change (i.e. from contoso-stage.azurewebsites.net to contoso.azurewebsites.net). I don't know if immediately after the switch the Azure infrastructure sends some kind of wakeup call.
Hey - thanks for the details - just FYI I am away at the Barcelona meetup, with little time for this issue - will resume work on it at the end of the week (so don't feel bad if I don't reply).
~Have looked at your logfile and... here is the thing: the new process (P564) is starting for a different application (since you swapped). So P8108 was for the other application. It does not have to release its lock for P564 to acquire it... OTOH the lock has to be released... but that should happen on the other slot. And so... I would love to see the other log. Is this possible?~
~Note that it still does not explain what is going on with the cache. I am currently running some experiments with Azure AppServices to try to get a better understanding of it all.~
Or maybe not - further experimenting.
I have created an Azure App Service hosting a simple MVC web app, with two slots. Had to have one per-slot appSetting in order to force slot-swaps to trigger a restart. Then, have our MainDom system and some temp file lock to the app... and I can swap slots, and see the MainDom lock being properly released, and swaps happening without issues.
I have to accept that something is wrong, considering your log, but for now I am running out of idea.
During the swap, on my test, the old process terminates and then a new process starts. In your log, the new process starts before the old process terminates, but that should not be a problem: this is precisely why we have the MainDom lock.
Is this happening every time you swap, or only from time to time? Is this on a production / live site? To which extend would it be possible for you to run some custom DLLs which would include more tracing / logging?
@zpqrtbnk , The error is no longer happening. By the time I reported this and send you the log, it was happening on every swap.
I did not change anything in Umbraco since then - the same package 8.0.1 is used.
I'm almost sure the difference is that, while trying to workaround the error, I completely disabled Application Insights on Azure - it was enabled by default when I first created the App Service, and I did not opt out of it as I usually do.
To verify if this was the cause I enabled again Application Insights in both slots and something strange happened:
So looking in retrospect, now my thoughts are:
Current situation:
Thanks for your help!
Thanks for keeping up with the detective work ;-) I have tried to experiment with enabling/disabling AppliCation Insights on my test app, but still cannot reproduce the out-of-order lock management that you see.
Happy that it works for you, but it annoys me.
Will lower the priority of this issue, but still, will try to send you a DLL to test, that would log way more details about what is going on. Stay tuned!
This: Debug.zip contains a patched DLL that just logs infos when acquiring the main lock (process id, but also user name, app id and maindom hash) - if you have a moment to try it...
Done. I could reproduce the error. No slot swapping involved, I just enabled Application Insights.
Initial state:
Clean restart to clean up logs
Enable Application Insights
Let me know if I can perform any additional tests
From the log:
2019-04-15 21:07:33,806 [P9404/D2/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Booting Umbraco 8.0.1. [Timing f0400f6]
2019-04-15 21:07:33,933 [P9404/D2/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Booting
site 'terniumcomqa__482f',
app '/LM/W3SVC/653304847/ROOT',
path 'D:\home\site\wwwroot\',
server 'RD2818785491E5'.
2019-04-15 21:07:34,696 [P9404/D2/T1] INFO Umbraco.Core.MainDom - DEBUG
server RD2818785491E5
site terniumcomqa__482f
process 9404
user IIS APPPOOL\terniumcomqa__482f
domain 2
appId /LM/W3SVC/653304847/ROOT
appPath D:\home\site\wwwroot\
temp D:\local\Temp
local D:\local\Temp\UmbracoData\935c1c6dfba987b17a7c67a60ce950cb
mdhash d04962ddd4ea323ea555b5a47c5349c8f613ab9e
2019-04-15 21:07:34,714 [P9404/D2/T1] INFO Umbraco.Core.MainDom - Acquiring.
2019-04-15 21:07:34,714 [P9404/D2/T1] INFO Umbraco.Core.MainDom - Acquired.
[...]
2019-04-15 21:09:54,491 [P9404/D2/T15] INFO Umbraco.Core.MainDom - Stopping (environment)
2019-04-15 21:09:54,647 [P9404/D2/T15] INFO Umbraco.Core.MainDom - Released (environment)
[...]
2019-04-15 21:10:02,321 [P9404/D3/T24] INFO Umbraco.Core.Runtime.CoreRuntime - Booting Umbraco 8.0.1. [Timing c4c5f6d]
2019-04-15 21:10:02,446 [P9404/D3/T24] INFO Umbraco.Core.Runtime.CoreRuntime - Booting
site 'terniumcomqa__482f',
app '/LM/W3SVC/653304847/ROOT',
path 'D:\home\site\wwwroot\',
server 'RD2818785491E5'.
2019-04-15 21:10:02,758 [P9404/D3/T24] INFO Umbraco.Core.MainDom - DEBUG
server RD2818785491E5
site terniumcomqa__482f
process 9404
user IIS APPPOOL\terniumcomqa__482f
domain 3
appId /LM/W3SVC/653304847/ROOT
appPath D:\home\site\wwwroot\
temp D:\local\Temp
local D:\local\Temp\UmbracoData\935c1c6dfba987b17a7c67a60ce950cb
mdhash d04962ddd4ea323ea555b5a47c5349c8f613ab9e
2019-04-15 21:10:02,778 [P9404/D3/T24] INFO Umbraco.Core.MainDom - Acquiring.
2019-04-15 21:10:02,778 [P9404/D3/T24] INFO Umbraco.Core.MainDom - Acquired.
[...]
2019-04-15 21:10:12,305 [P9404/D3/T24] INFO Umbraco.Core.Runtime.CoreRuntime - Booted. (10011ms) [Timing c4c5f6d]
[...]
2019-04-15 21:10:16,369 [P1480/D2/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Booting Umbraco 8.0.1. [Timing aa57fab]
2019-04-15 21:10:16,727 [P1480/D2/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Booting
site 'terniumcomqa__482f',
app '/LM/W3SVC/653304847/ROOT',
path 'D:\home\site\wwwroot\',
server 'RD2818785491E5'.
2019-04-15 21:10:18,852 [P1480/D2/T1] INFO Umbraco.Core.MainDom - DEBUG
server RD2818785491E5
site terniumcomqa__482f
process 1480
user IIS APPPOOL\terniumcomqa__482f
domain 2
appId /LM/W3SVC/653304847/ROOT
appPath D:\home\site\wwwroot\
temp D:\local\Temp
local D:\local\Temp\UmbracoData\935c1c6dfba987b17a7c67a60ce950cb
mdhash d04962ddd4ea323ea555b5a47c5349c8f613ab9e
2019-04-15 21:10:18,852 [P1480/D2/T1] INFO Umbraco.Core.MainDom - Acquiring.
2019-04-15 21:10:18,962 [P1480/D2/T1] INFO Umbraco.Core.MainDom - Acquired.
[...]
2019-04-15 21:10:24,822 [P9404/D2/T20] INFO Umbraco.Web.UmbracoApplicationBase - Application shutdown. Details: ConfigurationChange
[...]
2019-04-15 21:10:31,793 [P1480/D2/T1] INFO Umbraco.Core.Runtime.CoreRuntime - Resolved. (10046ms) [Timing 0ba631e]
2019-04-15 21:10:38,462 [P1480/D2/T1] ERROR Umbraco.Core.Runtime.CoreRuntime - Boot failed. (22127ms) [Timing aa57fab]
Umbraco.Core.Exceptions.BootFailedException: Boot failed. ---> System.IO.IOException: The process cannot access the file
'D:\local\Temp\UmbracoData\935c1c6dfba987b17a7c67a60ce950cb\NuCache\NuCache.Content.db'
because it is being used by another process.
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
[...]
2019-04-15 21:10:58,311 [P9404/D3/T19] INFO Umbraco.Core.MainDom - Stopping (environment)
2019-04-15 21:10:58,686 [P9404/D3/T19] INFO Umbraco.Core.MainDom - Released (environment)
We see P9404/D2
booting, acquiring the MainDom lock, running, then releasing the MainDom lock, because it has been notified by the environment that it had to shut down. And then P9404/D3
boots, acquires the MainDom lock, and runs.
And then, P1480/D2
starts and... acquires the MainDom lock, even though P9404/D3
has not been notified of its shutdown and has not released the lock (it does, 20 or 30 seconds later).
The important line here is mdhash d04962ddd4ea323ea555b5a47c5349c8f613ab9e
- that hash is directly used as a name for the named Semaphore
we use to handle the MainDom lock - and these semaphores are supposed to be system-wide. Since the same name is used everywhere... I fail to understand how P1480/D2
can acquire a lock that has not been released.
There has to be something I am not seeing here. Thanks for reporting... will think about it all and try to have an idea!
Note to self: have run some tests of the MainDom class on local machine, with two console apps, and it all works as expected. Note that the app that releases the lock is Signaled (signal)
whereas in the above log, it's always (environment)
meaning... P9404/D3
is not even signaled to stop?!
If I can bother you with another Debug.zip... MANY THANKS for your time here!
@JoseMarcenaro just wondering, do you have any applicationInitialization configuration?
Done. Same secuence as before, same error result. error2-log.zip
@Jeavon you are right. I do have applicationInitialization configuration.
Thanks for the log. What it contains is:
P3488/D7
starts, acquires the lock, runs... P3488/D7
is signaled by the hosting environment that it needs to release the lock... and releasesP3488/D8
starts, acquires the (now released) lock, runs... P3488/D8
requires the lock, it signals P3488/D7
- which does report itP7468/D2
starts, and wants the lockAt that point... P7468/D2
just gets the lock immediately. It signals the current lock holder, P3488/D8
, but P3488/D8
does not report this. It just reports, later on, that it is signaled by the hosting environment, and releases the lock.
So
And then again, this all seems to happen on the very same hosting environment / server. Shared semaphores and events are at system-level so... they should work here. Just can't figure out what is going on :(
@JoseMarcenaro I think that might be a clue for @zpqrtbnk I only asked as I only saw the error once I had added applicationInitialization, could be coincidence but with applicationInitialization IIS is internally within the module starting up a new site
but with applicationInitialization IIS is internally within the module starting up a new site
afraid I don't understand ;-) can you be more... something?
And it's starting internally before shutting down the currently running one, once it gets the 200 then it makes it public to serve requests
@zpqrtbnk to enable locally add within system.webServer element of web.config
<applicationInitialization doAppInitAfterRestart="true">
<add initializationPage="/"/>
</applicationInitialization>
Trying to find out how it works exactly, this seems to be the most info https://docs.microsoft.com/en-us/iis/get-started/whats-new-in-iis-8/iis-80-application-initialization
I thought ApplicationInitialization configuration was used by the Azure infrastructure only when launching a swap, to warm up the stage slot before performing the swap and starting to send requests to it. But that "doAppInitAfterRestart" attribute shows me it might not be the case. In my app, I do not have the attribute set.
<applicationInitialization>
<add initializationPage="/es/index" hostName="www.txdev.ternium.com"/>
</applicationInitialization>
@JoseMarcenaro it is used whenever IIS needs to start/restart. For Web Apps it's correct you are using it. It will be used for things such as Slot Swapping, Scale (cold boots) and server transition (when your site is moved to a new server) and app recycle. Without it, you will not get continuous operation.
From https://docs.microsoft.com/en-us/iis/configuration/system.webserver/applicationinitialization/: doAppInitAfterRestart
Optional Boolean attribute.Specifies that the initialization process is initiated automatically whenever an application restart occurs. Note that this is different than the preLoadEnabled attribute in the application element, which specifies that the initialization process is started after a restart of the application pool. The default value is false.
@JoseMarcenaro have you tried setting it to true?
So... the reason why the new domain starts before the previous one is down, would be because IIS is warming up the new domain, then swaps them, then terminates the old domain. Makes sense.
Nevertheless, in theory, the new domain should signal the old domain, and then wait for the old domain to release the lock, before acquiring it. Whereas... it all looks like the new domain is running in isolation and not seeing the rest of the world...
Going to try and repro by enabling app initialization...
@Jeavon no, but I don't think it will add any useful information. I can run my site without these errors - it now happens only when setting on Application Insights and then everything goes right.
So avoiding the error it is not the priority. Finding more information about how it happens is.
Odd, on my test app it always shuts an app pool before starting another - either I fail at configuring app initialization, or ...
PR: https://github.com/umbraco/Umbraco-CMS/pull/5924
Something is happening in Azure WebApps where the NuCache.Content.db file is locked causing the site to hang. I've attached the log file for reference.
The exception is as follows:
This is on Umbraco 8.0.1 which was upgraded from Umbraco 8.0.0. (attachment will need to be changed from .txt to.json)
UmbracoTraceLog.RD2818786B7D96.20190319.txt