projectkudu / kudu

Kudu is the engine behind git/hg deployments, WebJobs, and various other features in Azure Web Sites. It can also run outside of Azure.
Apache License 2.0
3.12k stars 655 forks source link

App services swap behavior discussion #2583

Closed xt0rted closed 5 months ago

xt0rted commented 7 years ago

A few of us in the azure channel on the asp.net core slack (sign up page) were talking about the swap functionality and trying to figure out exactly how the auto swap setup works vs. what our expectations are of it. Is this under the kudu umbrella and can be talked about here, or is there somewhere else this should be talked about?

davidebbo commented 6 years ago

@ctolkien yes, this is a brand new feature, and is now the preferred way of forcing https. It all happens on the Front End server, so there is no need to make any changes to the app itself. This is controlled at ARM API level by the new httpsOnly flag on the site object.

xt0rted commented 6 years ago

Is there some place things like this get announced? I follow Azure/app-service-announcements but didn't see it there.

davidebbo commented 6 years ago

@xt0rted it should have been on there, but this tracker being new, things don't go there consistently enough. Hopefully we'll get better over time :)

ctolkien commented 6 years ago

@davidebbo , can we assume that as this happens on a server in front of our app, that all the app initialising bits discussed above are not impacted as they run local to the app?

davidebbo commented 6 years ago

@ctolkien correct. The redirection happens before the VM that has your code comes in the picture at all.

xt0rted commented 6 years ago

I switched from the https url rewrite rule to the https only option in the portal, and I'm noticing some interesting behavior that I wasn't getting before doing that. Post-swap the staging slot seem to start up but no always on requests come in for it according to the requests chart in the portal. If I curl -I https://thesite-staging.azurewebsites.net I get an instant response back and then the normal ~2 requests every minute or two starts back up.

image

The spike around 6:40 is when the initial deployment & swap was going on, and then right after 7:00 is when the swap finished. Prior to the swap I was getting ~2 requests consistently coming in from the always on option, but then afterwards they stopped.

When I first enabled the https only option I triggered a deployment that experienced similar issues post deployment but before the swap was initiated. In that instance the staging slot never started up pre-swap which I know for a fact because there's a piece of code that runs in the "staging" environment but not the "production" environment due to a sticky slot setting.

I'm not sure if this happens to the production slot because I have appinsights setup to poll a couple endpoints every 5 minutes since the always-on setting seemed flaky.

npiasecki commented 6 years ago

@davidebbo Does the HTTPS Only feature live at the slot level? I had it enabled, did a swap, and it was suddenly disabled, so it either does or I'm losing my mind

ruslany commented 6 years ago

It is not a slot setting. It stays with the site instance during the swap. So if you had it on for site in production slot and off for site in the staging slot then after swap it will be off in production slot and on in the staging slot. You'll need to explicitly enable it on all slots if you want to enforce https regardless of swap operations.

xt0rted commented 6 years ago

@ruslany @davidebbo I'm seeing mentions of a SnapshotHelper in my log file (this isn't new https://github.com/projectkudu/kudu/issues/2583#issuecomment-337119179), but according to the response I got from snapshothelp@micrisoft.com these entries aren't from the app insights snapshot collector. Do either of you know what causes them? Their showing up in my log seems to almost always correspond to when I experience my slow requests (during & after swapping) but I can't find anything saying where they're coming from or what's causing them.

The log entries I saw earlier are:

SnapshotHelper::RestoreSnapshotInternal   SUCCESS - File.Copy
SnapshotHelper::RestoreSnapshotInternal   SUCCESS - process
SnapshotHelper::TakeSnapshotTimerCallback
SnapshotHelper::TakeSnapshotInternal   SUCCESS - Process
SnapshotHelper::TakeSnapshotInternal   SUCCESS - File.Copy
npiasecki commented 6 years ago

Continuing in this theme, I switched my sites to use the new HTTPS Only option and nuked my SSL rewrite rules in hopes that this would solve the cold starts I'm seeing during infrastructure upgrades when I get booted to a new VM. (Swaps work just fine, but I get nailed with cold starts when I get moved to a new VM, which Pingdom confirms by reporting it as a period of timeout > 30 s.)

It happened again today around 3:30 Eastern as I got booted for the Server 2016 upgrade. After studying my Application Insights logs, I noticed that my staging slots start once, but my production slots start twice (with different AppDomains), about 2 minutes apart, and Pingdom reports (and I experienced, since it was in the middle of the day) slow requests around the timestamps of the last restart.

I had enabled WEBSITE_LOCAL_CACHE_OPTION on my production slots some time ago in an attempt to minimize the number of restarts I experience, and now I feel it could explain why I'm getting hit with these double restarts on infrastructure upgrades.

Is it all possible that the load balancer is switching to the new VM when the normal content share is ready, serves a few requests successfully, then 1-2 minutes later the local cache is ready and the whole thing restarts, and bam! I see a cold start in production?

pollax commented 5 years ago

Hello, just checking in to see if there are any updates to this issue? We are too experiencing restart issues when swapping slots. Similar setup as previously mentioned, web-apps that enforce HTTPS but we also have Cloudflare.

I read the blog post by Ruslany, I tried to apply it, but I don't think it's working 100% for us still so wondering if there are any updates or what the suggested way forward is? With daily or multiple deploys per day, this becomes a severe issue that we will need to solve very soon.

bsimser commented 5 years ago

Just want to chime in. I have an open premier issue that has been going on for weeks now with our swaps. I've got a Teams call today to do some real-time diagnostics, but on one of our sites (I have a node in Canada East and one in Canada Central, load balanced behind Front Door) I have autoswap turned on. It seems that after we deploy to both nodes (using Octopus) it's literally 10 minutes before the swap "finishes" (even though the logs said they did 10 minutes ago) and the site shows the newly deployed code. The other sites in question (that I have the support call on) I've turned off autoswap (by Microsoft's request) and doing a manual stop, start, warmup, etc. and the swap is "better" (down to about 2-3 minutes after we deploy from 10) but the other issue is that a 2 minute swap overlaps between nodes, so Front Door sees the entire site down for 1-2 minutes. Less than desirable. I'll post any new findings today and information on autoswap vs. manual. These are all just web apps so no VMs involved.

tmmueller commented 4 years ago

@bsimser Any news? We can't figure out how to swap app service slots behind Front Door without downtime. It worked perfectly without Front Door.

bsimser commented 4 years ago

@tmmueller no word yet. There is a default 30 second timeout in Front Door that will cause a timeout. It's hard coded by default. The PG team just pushed a change out that allows you to change this value to something other than 30 seconds but there's no UI for it. Has to be done with PowerShell. I haven't tried it because as I told premier support (many times) this doesn't actually fix the problem of having both nodes go down at the same time. I've asked many times that I'm willing to manipulate Front Door with PowerShell to take a node down, do the deployment, bring the node back up (in the backend service) but haven't found any PowerShell to do this. Support continues to believe that increasing the default timeout beyond 30 seconds will fix this problem but I don't see how it can.

I've got two systems now, one using a series of steps to deploy the app node by node, and other just to push the updated app to the slot and let autoswap take care of it. Both nodes end up with an overlap of about 2 minutes when both nodes go down which IMHO is wrong. And deployment to the autoswap nodes finishes (as far as returning from the PowerShell command) but still doesn't actually kick in (i.e. you won't see the new site) until at least 10 minutes after that (and then both nodes kick in on top of each other and the entire system is down for 1-2 minutes with a lovely non-configurable non-brandable message from Front Door). Nobody seems to be able to answer me in support on my original questions.

Still trying to figure out a solution.

tmmueller commented 4 years ago

@bsimser Thanks for the details. We're going to try increasing the sendRecvTimeoutSeconds setting to see if we can at least get it to respond eventually rather than throwing errors at the end of the swap.

cjblomqvist commented 4 years ago

I've had the same issues as described above. More specifically, I don't have issues with random restarts, but issues during swapping. I'm on the P1V2 option, with both slots tied to that. It seems when the full swap process starts, the first step is to restart the deploy/stage slot, which causes CPU spikes. This in turn causes performance issues for everything on the same app service plan, since it's the same resources. I've confirmed this by doing the restart manually, which causes similar behavior.

It would be great to be able to isolate the deploy slot resource wise so that it doens't affect the production slot while starting/warming up. By not doing that, the slot swapping becomes much less valuable. The only solution I can see is to load balance between two completely (as in two different app service plans) environments, so you can deploy one environment at a time. Then again, I hoped the point of swapping slots was to not needing that...

PS. It's also very strange that the actual swapping takes so much time. 5+ minutes seems unreasonable. It would also be wonderful if the (re)start of an app doesn't take that long and use up that many resources, albeit I understand that's a trickier thing to deal with.

bsimser commented 4 years ago

@cjblomqvist the CPU spikes are worrysome. Like you mentioned the only option is to separate the nodes. That's what I have setup. I have a P2V2 node in Canada Central with a deployment slot, and the same in Canada East with Front Door load balancing between the two (pro tip: unless you set the latency sensitivity to something other than 0 it won't round-robin and the load will come from the nearest data centre, which in the case of Canada means you're probably always getting served up form one node only).

The problem with this approach (which still exists to this day) is that either using autoswap or manually starting the swap, there's an overlap of 1-2 minutes when both sides are down. Luckily this has decreased with the P2V2 nodes so it's hard to detect (and sometimes we get lucky and there's no downtime) but it's still not 100%. As for load balancing, at some point, hopefully the Traffic Manager AzureRm commands will be available for Front Door then I can just shut down one side, do the deployment to it, then bring the other side down but again in all my testing, there are delays everywhere when you run PowerShell commands and even though they finish you still might see the actual results for a few minutes.

Trying to achieve zero downtime seems almost impossible with the slot swapping behaviour so far for me.

ctolkien commented 4 years ago

1-2 minutes when both sides are down is in no way acceptable, especially running multiple instances with frontdoor!

We need to deploy updates and have them be seamless midway through users making purchasing decisions on our site. Likewise, we've discerned that it is the restarting of the staging slots which can wreak havoc on the CPU, which in turn impacts production users.

Would be awesome, to ... perhaps set the priority of that task to lower so that it doesn't impact production. I don't care if warming up staging takes another 30 seconds.

pollax commented 4 years ago

Another 3 months with no updates. Anyone heard anything?

raaviiqbal commented 3 years ago

We are trying to migrate from cloud services classic to app services and I ran into this exact issue - my production slot performance goes totally out of wack during a swap. It seems that the CPU on the staging spikes and causes the production response time to 10x.

Is there now a way to use 2 different app service plans one for prod and one for staging?

if not, how can people run production sites on app services? I'm going to have to abort this and keep my site on cloud services...

cjblomqvist commented 3 years ago

@raaviiqbal load balancing is most likely your only alternative to avoid the spikes (which will give you basically the same effect as swapping with two different resources, but less convenient)

raaviiqbal commented 3 years ago

@cjblomqvist gotcha - so the staging swap built into the app service stuff isn't really designed for a production system where it is unacceptable to have impacts to the production response time during swap?

I'm using Azure FrontDoor so would it basically work like this?

1) use ARM template to deploy an entirely new app service with it's own app service plan 2) warm it up and run automated tests on it 3) add it into the frontdoor as a backend 4) tell frontdoor to route all the traffic to the new backend 5) destroy the old backend

Sort of like what's described in @philliproux's article here: https://philliproux.com/post/azure-front-door-blue-green-deployments/

cjblomqvist commented 3 years ago

Yep, although it might make more sense to keep both up unless you're very budget constrained. @raaviiqbal

raaviiqbal commented 3 years ago

@cjblomqvist @bsimser @pollax @ctolkien - you can split the app service across app service plans now such that the prod slot is on a different plan than the staging slot - thereby eliminating the issue of the slot warmup bogging the CPU and affecting prod traffic. @ruslany cleared this up for me on his blog and I confirmed it has the desired effect.

jvano commented 5 months ago

Hi

If the problem persists and is related to running it on Azure App Service, please open a support incident in Azure: https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request

This way we can better track and assist you on this case

Thanks,

Joaquin Vano Azure App Service