projectkudu / kudu

Kudu is the engine behind git/hg deployments, WebJobs, and various other features in Azure Web Sites. It can also run outside of Azure.
Apache License 2.0
3.12k stars 655 forks source link

App services swap behavior discussion #2583

Closed xt0rted closed 5 months ago

xt0rted commented 7 years ago

A few of us in the azure channel on the asp.net core slack (sign up page) were talking about the swap functionality and trying to figure out exactly how the auto swap setup works vs. what our expectations are of it. Is this under the kudu umbrella and can be talked about here, or is there somewhere else this should be talked about?

davidebbo commented 7 years ago

It's part of App Service and not really part of Kudu, but it's probably ok discussing here. You should start with http://ruslany.net/2015/09/how-to-warm-up-azure-web-app-during-deployment-slots-swap/. @ruslany is the expert on swap, and if there is missing info on that post, he may be able to clarify it.

ctolkien commented 7 years ago

From the linked blog post:

Then the site in the staging slot gets warmed up. To warm up the site an HTTP request is made to the root directory of the site to every VM instance where site is supposed to run. The warm up request has a User-Agent header set to “SiteWarmup”; After warm up has completed the host names for the sites in production and staging slots get swapped.

It would be good to get some clarity around what "after warm up has completed" actually means. I can only assume that it doesn't mean a 2XX status code is returned? In practice, if we perform a swap there can be ~60 seconds whilst the live site is unresponsive (and in some cases, time outs, depending on the site).

This makes deployments reaaally painful. The "two-step" "Swap with Preview" gets around this issue, but it's clumsy to use and isn't really integrated into "auto-swap" or the CI functionality in things like VSTS

xt0rted commented 7 years ago

I have some code that runs on site startup and can be turned on or off with an app setting. This is enabled in my warm up slot, and disabled in my production slot. I had a deploy yesterday where the swap happened without that piece of code running in the warm up slot. This is where some of my confusion starts. The docs seem to indicate that the site will be started up before the swap happens, yet in this instance it wasn't. I've added three endpoints under applicationInitialization but that doesn't seem to have made a difference.

I'm also experiencing what @ctolkien mentioned where the auto swap occurs, and then my site's down for a minute while it starts back up. The whole reason for wanting to use a deployment slot with swap was to prevent that.

Now that I'm using the deployment slot I've also seen my builds go from about 6-8 minutes to 20+ minutes. I'm not sure if that's including the swap time or not, but it seems once the post deploy webhook is fired the swap can take another 5 minutes or so to happen. I have two additional instances of this site running for internal use without deployment slots and those still take 10 minutes or less.

The final thing I've noticed is post swap my initial page load times go up as if the views are being compiled. I'm precompiling my views so I'm not really sure why initial page loads are in the 5 second range. I don't recall this being a problem before the slot & swap was setup.

ruslany commented 7 years ago

@ctolkien - having live site unresponsive during the swap is an unexpected behavior that we should investigate. The idea of the deployment slot swap is that the site that is in production slot is never restarted during the swap. Can you please let me know the name of the site where you have this repro and I'll look into this?

As to your question about what "warmup has completed" means, the current logic is as follows:

  1. We first wait until site's worker process has been restarted and picked up the configuration changes (such as slot app settings and connection strings)
  2. If local cache is enabled then we make a request to a root directory of the site and wait until it returns any status code and then keep polling it until the local cache initialization is complete
  3. If AppInit warmup is enabled then we make a request to a root directory of the site and wait until it returns any status code and then keep polling it until the appinit module initialization is complete.
ruslany commented 7 years ago

@xt0rted - can you please let me know the name of your site and timeframe when you had a repro? I'll check in our logs to understand your site in production slot is being restarted.

xt0rted commented 7 years ago

@ruslany dummy1326 and it's the site with portal in its name. My deploys on 10/4 around 5:04 PM EST and the second around 8:27 PM EST (this is the one where the startup code didn't run in the warm up slot).

ruslany commented 7 years ago

@xt0rted I see that the site in the swap has completed in 2 minutes, but then the production slot was restarted at 2017-10-03 21:21:27.6907435 UTC, which is more than 10 minutes after the swap. The restart is unexpected and is caused by the bug that we have fixed, but have not yet enabled the fix worldwide. I'll enable it everywhere sometime this or next month, but meanwhile you can enable it just for your web app by adding this app setting to EVERY deployment slot:

WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG = 1

@ctolkien - can you try the same and see if it makes any different for your case?

xt0rted commented 7 years ago

@ruslany I added that setting to my site and upon site restart received the following error on first page load, then all subsequent page loads were fine.

System.InvalidOperationException:
   at System.Web.Mvc.DefaultControllerFactory+DefaultControllerActivator.Create (System.Web.Mvc, Version=5.2.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at System.Web.Mvc.DefaultControllerFactory.GetControllerInstance (System.Web.Mvc, Version=5.2.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at System.Web.Mvc.DefaultControllerFactory.CreateController (System.Web.Mvc, Version=5.2.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at System.Web.Mvc.MvcHandler.ProcessRequestInit (System.Web.Mvc, Version=5.2.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at System.Web.Mvc.MvcHandler.BeginProcessRequest (System.Web.Mvc, Version=5.2.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at System.Web.Mvc.MvcHandler.BeginProcessRequest (System.Web.Mvc, Version=5.2.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at System.Web.Mvc.MvcHandler.System.Web.IHttpAsyncHandler.BeginProcessRequest (System.Web.Mvc, Version=5.2.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
   at System.Web.HttpApplication.ExecuteStep (System.Web, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
Inner exception System.MissingMethodException handled at System.Web.Mvc.DefaultControllerFactory+DefaultControllerActivator.Create:
   at System.RuntimeTypeHandle.CreateInstance (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
   at System.RuntimeType.CreateInstanceSlow (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
   at System.RuntimeType.CreateInstanceDefaultCtor (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
   at System.Activator.CreateInstance (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
   at System.Activator.CreateInstance (mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089)
   at System.Web.Mvc.DefaultControllerFactory+DefaultControllerActivator.Create (System.Web.Mvc, Version=5.2.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)

After the setting was in place I did a new deploy (around 4:07 PM EST). The build time was ~12 minutes and then ~5 minutes after the post deploy hook fired the site swapped. After the swap my first page load took at least 30 seconds. After that my second page load took about 30 seconds as well. Then all page loads seemed to load right away.

Something else I've noticed since setting up the deployment slot is my debug snapshots from app insights are no longer being captured. After setting up auto swap I did go back through and make sure app insights was setup on each slot, but the snapshots still aren't being captured. I'm not sure if this is due to a configuration change or possibly an issue with how that works. I can file a bug report on that repo if this is out of scope of the swapping.

ruslany commented 7 years ago

@xt0rted - Do you get this exception every time after restart of the site? One of the reason the fix hasn't been enabled everywhere is because there maybe side effects from it. If you get it every time then it is definitely caused by this setting.

In our logs I can see that when we started warming up the site by sending HTTP requests to the root directory of the site the latencies were anywhere from 20 to 50 seconds. Then the last ping latency was 2 seconds and after that we've performed the swap.

Then later I see that there were requests made from a browser to /prospective-buyers/signup and to /api/prospective-buyers/create-profile and those took 23 and 4 seconds respectively. After that all subsequent requests took no longer than 2 seconds.

Do you use AppInit module for your site and do you have those two URL paths included in its configuration?

xt0rted commented 7 years ago

@ruslany those requests were from me browsing the site after the deploy. The appinit module is setup with the following settings:

<applicationInitialization>
    <add initializationPage="/" hostName="***-staging.azurewebsites.net" />
    <add initializationPage="/heartbeat" hostName="***-staging.azurewebsites.net" />
    <add initializationPage="/account/login" hostName="***-staging.azurewebsites.net" />
</applicationInitialization>

Nothing special happens on those pages, I just added them in to see if anything changed during deploy.

xt0rted commented 7 years ago

@ruslany as far as I can tell that exception only happened once after first adding the setting, since then I haven't seen it.

ruslany commented 7 years ago

Ok, then it is not conclusive if this exception is the result of adding this new app setting.

Even if the root directory and other pages listed in AppInit module have been warmed up it still may be possible that other pages would take longer on the first hit. If you want them to be warmed up too then I would recommend adding them to the AppInit configuration.

The issue with debug snapshots and deployments slots is something I have not heard about before. We'll need to look into this.

ctolkien commented 7 years ago

The name of the site is dummy90210

OK, doing some testing right now. Doing a bunch of swaps (trying both the swap with preview, and direct swap).

Have had the production site return HTTP502 timeout - and be unresponsive.

We also had an error logged with resetting the slots be logged.

Visualised: image

image

This does not have the WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG = 1 in place - just testing vanilla at the moment.

ruslany commented 7 years ago

@ctolkien - yes, I see in our logs that the site is restarted for the same reason as in @xt0rted's case. The app setting should prevent that.

ctolkien commented 7 years ago

@ruslany - I put that app setting in place. It seems "improved", but still not seamless. We still logged 500 errors for timeouts (last part of graph):

image

And the response time shot up to 30 seconds:

image

ctolkien commented 7 years ago

No idea if related, but there is a stack of data usage in line with when swaps are taking place: image

Also CPU usage spikes right up whilst this is taking place, which I guess makes sense... in that it is having to reboot the staged site - not sure if that is what is impacting the performance of the live site.

xt0rted commented 7 years ago

I triggered a swap with preview and as soon as I did my live site hung for about 20 seconds. When the swap was completed the site then hung again for another 20 seconds or so, second page load hung about that long as well, then the rest of the page loads seemed fine. This was completed around 7:11 PM EST.

I then triggered the swap again to get back to where I started. As soon as I did my live site hung for around 50 seconds on first page load, and 25 seconds on second page load then the rest were fine. Once the swap was completed the site then hung for around 20-30 seconds for first page load and a little less for the second. This was started at 7:14 PM EST and completed at 7:21 PM EST.

During the swaps I'm not seeing any 500 errors, the data in/out seems fine, requests are up some, and the average response time goes up.

ruslany commented 6 years ago

@xt0rted - I am wondering if these latency increases are caused by the fact that both production and staging slots run on the same VM instance so when the site in the staging slot is restarted during the first step of swap with preview it causes more CPU usage on the VM which affects the site in the production slot. Also after the swap is completed we restart the "used to be in production" site which again may increase the CPU usage. I see in the VM logs that CPU spiked around the time you performed the swap with preview.

We are in a process of enabling swap between slots that are located in different app service plans. With that you would be able to place production and staging slots in different VMs so that during the swap they do not affect each other's performance.

xt0rted commented 6 years ago

@ruslany right now kudu is doing my build & deploy, would moving this out to say AppVeyor with webdeploy help at all? That's been my plan all along (already using them for PR builds), I've just wanted to get a two stage deployment in place first.

I have two other sites and some functions running in this VM. None of them are doing much, but I could see about moving the functions out.

I also forgot about Local Cache. None of my sites store anything on disk that needs to be persisted between reboots, so if enabling that could help I can try doing that as well.

ctolkien commented 6 years ago

We are using AppVeyor, it's not making any difference.

We are in a process of enabling swap between slots that are located in different app service plans. With that you would be able to place production and staging slots in different VMs so that during the swap they do not affect each other's performance.

That's going to double my clients costs - they aren't going to wear that :/

xt0rted commented 6 years ago

@ruslany I did another deploy earlier and the same things happened, existing slot restarted, the staging slot started to warm up, then after the swap the new production slot restarted again. The deploy finished around 10:27 AM EST and the swap happened about 5 minutes later. Here's the cpu/memory graph from that time span.

image

We're running on a single S2 plan right now with zero traffic to the site when this deploy happened, and no webjobs of functions running (aside from the queue triggers polling). The only resources in use was kudu doing the build & deploy and I was the only user on the site.

Over the weekend I was looking around the web and I kept finding people asking about this same thing, why is the production slot restarting after swap. There's a number of these questions on Stack Overflow and Server Fault dating from this past July to over 2 years ago. Everyone asking about this seems to think that there should be no downtime and no cold start but runs into the similar issue of the slot restarting after swap, and the responses are just people reiterating what the docs & your blog post says. I've yet to find someone who says they're actually running with this setup working as intended.

PureKrome commented 6 years ago

@ruslany Hi Ruslan! I've also been a heavy user of deployment swaps for a while now - since they came out. We all totally understand the requirement for a restart when you have PINNED application settings. So that's ok.. .. but it feels like someting in the deployment workflow is not working as expected. It's like, something is out of step.

What we expected was a deployment workflow like the following:

If feels like the LB flips over to the new slot SOONER than it should and as such, starts sending requests to that way soon than it should, before the site is really fully deployed, warmed and ready to rock'n'roll.

xt0rted commented 6 years ago

I just stumbled upon the xdt transform wiki page and the http to https redirect rule looks a little different from mine (I don't have the <add input="{WARMUP_REQUEST}" pattern="1" negate="true" /> condition). This has me wondering, could some of the issues we're seeing in regards to warm up be due to not having that condition in the rewrite rule? This Stack Overflow question from a few months ago seems to hint towards this as well.

davidebbo commented 6 years ago

@xt0rted it could be, is it could prevent the warmup requests from actually reaching your app.

ctolkien commented 6 years ago

We run CloudFlare (as does @PureKrome from what I understand) in front of all our sites - and it will redirect all HTTP requests to HTTPS before being sent on to our origin server. HTTP requests never reach us so if the WarmUp Service doesn't follow redirects this would be an explanation for what we're seeing.

PureKrome commented 6 years ago

As what @ctolkien said.

Also, we have other sites that use the URL redirection in the web.config and they do all http => https 301 PERM's.

ruslany commented 6 years ago

@ctolkien , @PureKrome - yes, warmup logic does not follow the HTTP redirects.

xt0rted commented 6 years ago

I've updated my url rewrite rule to include <add input="{WARMUP_REQUEST}" pattern="1" negate="true" />. I think this helps some since I'm not noticing any restarts or unresponsiveness during deploy, but that's as far as it goes towards the original issues. The exact changes I made was adding the rewrite condition, and changing the app init section to a single endpoint of / without any host.

What I've observed over my last few deploys is as follows:

  1. deploy to the staging slot is triggered
  2. staging slot builds & deploys
  3. site warm up is triggered
  4. auto swap is triggered the moment site warm up is finished
  5. the now production slot restarts
  6. 30 seconds later the production slot comes back up
  7. about a minute or two later the now staging slot restarts
  8. the production slot also restarts
  9. 30 seconds later the production slot comes back up

Using the app insights live metrics view I can see the cpu peg at ~45% for a good 30 seconds or so each time the site starts up, so I can confirm with certainty that the staging slot is warming up before the swap happens.

I do have two slot settings, one of which is to keep my webjobs disabled in the staging slot since I don't want them running outside of the production slot. Given this setup, is it possible to achieve a zero downtime deployment as we're expecting?

PureKrome commented 6 years ago

@ruslany Thanks for the reply. So .. if we have a really simple .NET Website with a simple URL redirect rule => any/all HTTP traffic gets redirected to HTTPS (which to me is a pretty damn common scenario) then ..

warmup logic does not follow the HTTP redirects.

which means the warmup logic doesn't apply.

If yes - then this is pretty damn serious.

Would be really nice if there's an option in the Azure portal for us (the developers) to set the warmup URL because then we could :-

ctolkien commented 6 years ago

Yeah, that the warmup modules doesn't follow the standard HTTP status codes is a.) very surprising b.) seemingly undocumented. I would strongly suggest it is a bug - just about everysite these days should be automatically redirecting to HTTPS.

I've just attempted the following:

<rule name="Azure Exception" xdt:Transform="Insert" stopProcessing="true">
          <match url=".*" />
          <conditions logicalGrouping="MatchAll" trackAllCaptures="false">
            <add input="{HTTP_HOST}" pattern="myclient(-\w+)?\.azurewebsites\.net" />
          </conditions>
          <action type="None" />
    </rule>

This stops any further redirection that we have configured when hitting the site from the foo.azurewebsites.net host name.

Have also configured:

<applicationInitialization xdt:Transform="Insert">
      <add initializationPage="/" hostName="myclient.azurewebsites.net" />
    </applicationInitialization>

To hit the azurewebsites.net hostname, so as to not run through CloudFlare.

image

Response time shows at 5 mins...

However, I don't know if I trust that figure, it could be from the "staging" side before promoted to production.

Am going to script out some tests to measure TTFB from a remote location whilst a swap takes place.

ctolkien commented 6 years ago

OK, it works!

I wrote a quick and dirty PowerShell script:

do {
    $elapsed = [System.Diagnostics.Stopwatch]::StartNew()
    Invoke-RestMethod $clientUrl -MaximumRedirection 0 | Out-Null
    Write-Host "Elapsed Time: $($elapsed.Elapsed.ToString())"
    Start-Sleep -Seconds 1
} while($true)

And the results are here: https://gist.github.com/ctolkien/33e646c675f6da2455b1a49d852983fb

TL;DR - a few requests broke 1 second but nothing catastrophic. No timeouts, HTTP500's either.

ctolkien commented 6 years ago

OK wait, my solution shouldn't work:

<applicationInitialization xdt:Transform="Insert">
  <add initializationPage="/" hostName="myclient.azurewebsites.net" />
</applicationInitialization>

That is going to hit the production site, not the staging site (myclient-staging). I would think it's the staging site which needs to be warmed up prior to swapping.

xt0rted commented 6 years ago

@davidebbo where does WARMUP_REQUEST come from? I don't see this documented anywhere aside from being mentioned once in the wiki and in various stack overflow answers. I have that set and my warm up requests are making it through, but the AlwaysOn requests are still showing as getting 301 redirects. According to your SO answers this value should also work for always on, but I'm not seeing that.

davidebbo commented 6 years ago

@ruslany knows best, but I think it's something that comes from IIS and is not new to App Service. @ahmelsayed mentioned it in this old IIS AppInit post. It doesn't seem to be well documented though...

xt0rted commented 6 years ago

I turned auto swap off and I'm working with the manual swap for right now seeing if I can get that working. So far I'm not having any luck with that either.

pasted image at 2017_10_17 12_30 am That's a shot of my CPU after I triggered the swap. The 60-40 second range is when the staging slot was warming up, the 40-25 second range is after the warm up but before the actual swap, then the swap happened, and from 25-0 is when the production slot started back up.

image This is 5-6 minutes later when I refreshed the 3 pages I had open from after the swap. All hung until the ~10 second mark when the cpu died back down, and it's been flat with instant page responses since then.

Upon further investigation I noticed in my log files that the site only ever started up once, but at the exact time of the cpu spike 5 minutes later there were these two groups of entries (not sure if they're both for the production slot, there's 8 log files for this timespan for some reason).

2017-10-17T04:37:10,Verbose,******,-1,SnapshotHelper::TakeSnapshotTimerCallback,
2017-10-17T04:37:10,Verbose,******,-1,SnapshotHelper::TakeSnapshotInternal - no new files in CodeGen,
2017-10-17T04:37:13,Verbose,******,-1,SnapshotHelper::TakeSnapshotTimerCallback,
2017-10-17T04:37:13,Verbose,******,-1,SnapshotHelper::TakeSnapshotInternal - no new files in CodeGen,
2017-10-17T04:40:36,Verbose,******,-1,SnapshotHelper::TakeSnapshotTimerCallback,
2017-10-17T04:40:36,Verbose,******,-1,SnapshotHelper::TakeSnapshotInternal - snapshot is newer than freshness threshold ,

That's all that I see regarding the snapshotter though and the only thing logged during that time.

ruslany commented 6 years ago

@PureKrome @ctolkien - the reason the warmup logic does not follow redirects is that because the warmup bypasses our frontends and goes directly to each worker that hosts the site. The workers do not have knowledge of HTTPS and custom domains, only frontends do. So if a site has rewrite rules to enforce a domain name or https then warmup code won't be able to make requests to the workers after redirect.

The suggestion to define a specific route to hit to initiate a warmup is interesting. I think we can add that capability.

The WARMUP_REQUEST is a server variable set by IIS AppInit module. When the first request to a site is made it starts the appinit module which in turn makes requests to all the URL paths listed in the appinit configuration section in the web.config file. All those requests have the WARMUP_REQUEST server variable set to 1.

ruslany commented 6 years ago

@ctolkien - when a first warmup requests is made to a site's worker process the AppInit module in that worker process starts and it makes internal child requests to all the URL paths listed in the configuration in web.config file. This appinit configuration that you used just means that when it makes HTTP request to the "/" page it will set the HTTP host header to myclient.azurewebsites.net. It does not mean that it will hit the production slot.

ruslany commented 6 years ago

@xt0rted - I assume you are still swapping this site ("dummy1326 and it's the site with portal"). I'll check our logs today to see what's going on.

xt0rted commented 6 years ago

@ruslany correct, same site

ctolkien commented 6 years ago

OK,

So I spent all of yesterday running every scenario and writing up a giant doc of everything I'd learned.. only to discover only one of the deployment slots was running the correct config and it was all for nought.

I have managed 0 impact swaps on an S1 instance. Here you can see a user seamlessly transition from version 722 to 733:

image

Which is:

image

However it's the lead up to this swap that is the issue. I haven't nailed down the exact specifics, but it seems like rebooting the staging slot prior to swapping can put enough load to cause major issues for your production users - with response times peaking up to 30 seconds.

Then again, I've also had the CPU maxed and have no discernible impact on production users:

image

(Couldn't easily get a screen grab of the previous 40 seconds, but all requests were < 1 sec, some much faster).

So as things stand right now - the below is probably all wrong:

<add input="{HTTP_HOST}" pattern="[\w\d-]+\.azurewebsites\.net$" negate="true" />
<add input="{WARMUP_REQUEST}" pattern="1" negate="true" />
<add input="{HTTP_USER_AGENT}" pattern="^AlwaysOn$" negate="true" />

End of the day however, I still don't have seamless deploys. Having requests spike to 30 seconds still isn't good enough. More testing required to determine the secret sauce that had us swapping without requests to prod being majorly impacted.

ruslany commented 6 years ago

@xt0rted - for the swap that happened around at 2017-10-17 04:29:39.3249895 UTC I do not see in our logs any subsequent restarts of the site in production 5 minutes later. I see a couple of requests that took around 20 seconds shortly after the swap. One to the "/" and one to "/marketplace".

I've noticed that in your appinit configuration you only have this url path:

    <applicationInitialization>
      <add initializationPage="/heartbeat" />
    </applicationInitialization>

I recommend to add "/" there too. Without this what happens is that swap warmup makes request to the "/" page and gets redirected to HTTPS because at that point "WARMUP_REQUEST" is not set yet. So it does not actually warmup your application's root URL. If you add "/" there then appinit module will make request to the root and "WARMUP_REQUEST" will be set, so your redirect rule won't be executed.

xt0rted commented 6 years ago

@ruslany the reason I removed / is because that will redirect to the login page. I switched it for /heartbeat since that route always returns a 200 response and doesn't require auth. Should I still include an entry for / or does that not work in my setup?

Also there were no deployments on 10/17, the last one was 10/16/2017 around 11:42 PM EST. I guess my math was slightly wrong, isn't my deploy still about an hour earlier from what you're seeing in the logs?

ruslany commented 6 years ago

I see that there were deployments at the following times: 2017-10-17 03:44:34.6635553 UTC (which is close enough to 11:42 PM EST) 2017-10-17 04:26:33.6181898 UTC

In both cases I do not see the logs that would indicate that the site has been restarted any time soon after it got into production slot.

I would still try to include "/" there anyway. The redirect to login page happens from your application code, not from the rewrite rule. This means that at least some code path in app will be executed during warmup.

npiasecki commented 6 years ago

For what it’s worth, I also ran into the problem of cold starts after swaps after migrating some apps to Azure App Service, and I traced it to an HTTPS redirect rule. The request to / would get redirected to HTTPS, but that happens early enough that the juicy stuff in App_Start never ran, leaving it to the first real user request to wait on it. It made sense when I figured it out, but was not intuitive and will probably only get worse as HTTPS becomes the default.

<rewrite>
  <rules>
    <clear />
    <rule name="Do not force HTTPS for application initialization" enabled="true" stopProcessing="true">
      <match url="(.*)" />
      <conditions>
        <add input="{HTTP_HOST}" pattern="localhost" />
        <add input="{HTTP_USER_AGENT}" pattern="Initialization" />
      </conditions>
      <action type="Rewrite" url="{URL}" />
    </rule>
    <rule name="Force HTTPS" enabled="true" stopProcessing="true">
      <match url="(.*)" ignoreCase="false" />
      <conditions>
        <add input="{HTTPS}" pattern="off" />
      </conditions>
      <action type="Redirect" url="https://{HTTP_HOST}/{R:1}" appendQueryString="true" redirectType="Permanent" />
    </rule>
  </rules>
</rewrite>
xt0rted commented 6 years ago

For a few weeks now I've been doing manual swaps (no auto, no swap with preview). This has been giving me a better experience than the auto swap setup I had, but I haven't tried going back to auto swap since the url rewrite rule was updated to handle warm up requests.

What I'm seeing with this setup is all of my pages have a <200 ms load time until shortly after a swap is initiated. Once initiated the load times shoot up to 20-25 seconds. Based on the logs the site is not restarting when this happens. Instead it seems as if the "warmed up" pages need to be warmed back up. I'm not familiar with the inner workings of asp.net in this regard, but it's almost as if something with the swap initiation is invalidating a low level .net cache. The time it takes for this to happen seems to vary, it can be anywhere from a second or two after initiation on up to about 20 seconds later. Once the pages are "warmed up" again they have <200 ms load times until the swap finally happens.

I've adjusted my warm up end points so /, /account/login, and /account/register are all hit before the swap. The login & registration pages take a couple extra seconds and then they're <200 ms from there on out. Everything behind the login page hits the dB. What I've seen on these pages is there's a 20-30 second load time on most of these and then they settle in at <200 ms. While these pages are hanging the login & registration pages are responsive and continue to have <200 ms load times.

Once I realized this was happening I setup MiniProfiler in production. This is showing me that when I hit refresh after the swap my / page starts loading right away, but there's a 20-25 second hang on the first dB hit. Further investigation has shown that EF caches views/metadata on first hit and that there are some ways to pre-compile that but those resources are years old and I just haven't had a chance to try any of it out. The EF docs even mention this and the built-in apis to work around it, but there's no out of the box solution from EF.

As of now I want to try setting up a /warmup route that simply queries a couple of the dB tables for the first row and to include that in my list of warm up pages. I'm not sure how much time that may cut off from the other pages, but for now that's my next step. Once I see how that goes I'll try moving back to auto swap to see how that works.


As a side note, the profiling I've been doing has really opened my eyes up to just how poor EF's performance is if you're not really careful with it. For instance source.Skip(result.PageIndex * result.PageSize).Take(result.PageSize) doesn't generate a parameterized query, nor do most of my where clauses, while source.Skip(() => skip).Take(() => take) does. For the most part these things aren't bad without load, but they'll definitely become an issue with the resources available on Azure vs. local dev where most of them aren't even noticeable to me. I really wish the EF team had some Roslyn analyzers for these things since they do mention them deep in the docs where very few will see it.

PureKrome commented 6 years ago

@xt0rted very interesting info there Brian!

One thing that I keep seeing is the HTTPS problem. I'm thinking about starting new side projects with no https anymore. I used to always enforce it, etc. and then have the URL-Rewrite rule to force all http to https.

Instead, maybe now started playing with moving this concern to CLOUDFLARE. so the only public way into my website is via CF which terminates the HTTPS connection and then passes the query to my Azure App Service website as http, accepting that the traffic from CF->Azure shouldn't be acceptable to a ManInTheMiddle attack.

This isn't a solution for all, though ... which means this problem still exists and hopefully can be resolved.

ctolkien commented 6 years ago

Instead, maybe now started playing with moving this concern to CLOUDFLARE.

We have CloudFlare do our HTTP -> HTTPS redirection currently but I'd still hesitate to remove HTTP -> HTTPS redirection rule from our site - if we happen to move away from CloudFlare, or it's temporarily disabled, etc. then it's easy to get overlooked.

At the moment, I think the best course of action is to have an IIS rule which matches on Warmup/AlwaysOn requests and then stops processing all further rules.

Canonical hostname, stripping trailing slashes, etc. there are other rules which can impact this too.

This also doesn't help with Always On either, but that is a discussion for another day.

npiasecki commented 6 years ago

While we're on the topic of restarts, the past week in East US my S2 instance has been booted to 3 different hosts, which seems a higher rate than normal (used to only happen about once every couple weeks).

In each case, the sites seem to warm up, but then I get a flurry of "can't write to filesystem" errors in my logs, and the new "Diagnose and solve problems" shows a double restart due to storage latency issues, causing 2-3 minutes of very slow response times. It's as if the system is booting me to a new host for upgrades (fine), but then after the LB switches to the new host, the new host immediately restarts due to latency, causing everything to go sideways for a few minutes.

2017-10-29_10-05-08

Has East US been just having a lot of updates this week?

ruslany commented 6 years ago

I published a blog post that explains in more details how application initialization works and how to fix most common problems related to deployment slots: "Most common deployment slot swap failures and how to fix them"

ctolkien commented 6 years ago

So, what does this do then? image

How is this implemented? What knock-ons does it have? etc...

Seems to have popped up in the last few weeks on the Custom Domains tab on App Services.