Closed AlanParr closed 9 years ago
I created a new entry in the Known Issues page to capture everything that someone needs to know about this issue: https://github.com/projectkudu/kudu/wiki/Known-issues
Previous content:
We deployed Kudu 47.40908.1797.0 last Tuesday (you can see version by going to root of Kudu). Try setting KUDU_EXTENSION_VERSION=47.40826.1772
in the Azure App Settings to go back to the previous build (verify that setting is effective by going to Kudu root again).
This way we can isolate whether the issue is related to the new Kudu, or possibly some unrelated environmental thing.
I've just done a deployment and it's back down to 5 minutes using Kudu 47.40826.1772 so looks like it is most likely related to that.
Are there release notes so I can see what I am missing out on by sticking back a version?
Hmmm, that's concerning, as nothing fundamental has changed. I guess the one thing is the switch to the new msbuild, which also implies the new Roslyn-based compiler, etc...
Can you try going back to the newest Kudu, but set MSBUILD_PATH=D:\Windows\Microsoft.NET\Framework\v4.0.30319\MSBuild.exe
. That would let us further isolate.
Also, you can try manually building clean from Kudu Console, comparing D:\Windows\Microsoft.NET\Framework\v4.0.30319\MSBuild.exe
and D:\Program Files (x86)\MSBuild\14.0\Bin\MSBuild.exe
New Kudu with old msbuild is 5 minutes also.
It's concerning that the new msbuild/csc could be that much slower for your solution. Is there anything unusual about it? Can you describe what it's using? e.g. is it MVC, Web API, etc... and what version?
Also, can you share your site name, either directly or indirectly? This may help us investigate.
It's just a bog-standard MVC4 app targeting .Net 4.5. I'm using VS2015 which would be using msbuild 14 and am experiencing no issues locally.
Site is cms-staging-weu-site. I do have a custom deploy.cmd, although from what I can remember, the only thing I changed is setting the solution path.
Is the repo public? If not, any chance you can give me access to the repo such that I could try deploying it on my own test site?
We have test repos with all kind of variations of MVC and framework, but we have not seen this kind of extreme slow perf.
~~Rather oddly, I thought I'd try reverting back to using the default msbuild today to check out the logs on a long build, and it seems to be fine now. Unfortunately, I didn't copy the full trace from when it was going slow before, although the fact that all the time was spent on copying to temp makes me think it was something post-build, so not necessarily msbuild related, perhaps that was a red herring. I'll continue doing deployments today and see if it occurs again.~~
Forget the above, the problem is back again. Will try a few things and repost.
The repo isn't public as it is a commercial application and I'm not sure our account levels at our source hosts would allow me to grant access to an external user. However, if you've got a private repo you could grant me temporary access to, I would be happy to push to that.
Some further testing: It looks like msbuild is running with really high memory usage, see attached screenshot of process explorer.
This is occurring across 3 different site instances, located in West Europe, East US, and East Asia, so it isn't localised.
Two of the deployments that have taken a long time are:
Example of a good deployment:
Comparing deployment logs between deployments 230b and 53a6, when overriding msbuild with msbuild 14, it outputs warnings, which didn't occur in the good deployment. There doesn't seem to be any single item that takes longer, every item just seems to take longer.
It is worth noting that my repository also contains another MVC4 site and an asmx web service. These are deployed in to their own Web Apps on the same Shared server. When the main site is misbehaving, they take about 20 minutes to build. When the main site is using the framework msbuild, these are fine too without having their msbuilds overridden, so I suspect their speed problems are just the result of resource starvation due to whatever problem the main site is having.
I see. So the theory here is that the new Roslyn compiler (VBCSCompiler.exe) is not slower, but it eats up way more memory, causing the machine to thrash and be super slow if it doesn't have enough.
Note that you can monitor the CPU and memory usage of your overall VM over the last 24 hours using https://portal.azure.com/:
This could confirm the theory if the slow builds correlate with high usage.
The CPU usage on the app service plan increased from 50% at 9am, peaking at 98% at 10am, and dropping down slowly after that. This approximately correlates with the big 50 minute build at 10 and a number of less slow, but still slower than they should be, builds either side of this.
On an additional likely related note, I've just had one deployment fail with the following error:
D:\Windows\Microsoft.NET\Framework\v4.0.30319\Microsoft.CSharp.targets(316,9): error MSB6006: "Csc.exe" exited with code -1. [D:\home\site\repository\Product\Production\Model\Model.csproj] Failed exitCode=-1, command="D:\Windows\Microsoft.NET\Framework\v4.0.30319\MSBuild.exe" "D:\home\site\repository\Product/Production/MVC/MVC.csproj" /nologo /verbosity:m /t:Build /t:pipelinePreDeployCopyAllFilesToOneFolder /p:_PackageTempDir="D:\local\Temp\8d2bdd3aecbffeb";AutoParameterizationWebConfigConnectionStrings=false /p:SolutionDir="D:\home\site\repository\Product\" /p:Environment=staging;Configuration=staging Command 'starter.cmd deploy.cmd' was aborted due to no output nor CPU activity for 61 seconds. You can increase the SCM_COMMAND_IDLE_TIMEOUT app setting (or WEBJOBS_IDLE_TIMEOUT if this is a WebJob) if needed.\r\nstarter.cmd deploy.cmd One or more errors occurred. Command 'starter.cmd deploy.cmd' was aborted due to no output nor CPU activity for 61 seconds. You can increase the SCM_COMMAND_IDLE_TIMEOUT app setting (or WEBJOBS_IDLE_TIMEOUT if this is a WebJob) if needed.\r\nstarter.cmd deploy.cmd
This happened because it was going so slow that Kudu failed to detect activity for 60 seconds and aborted the build.
Well, I think we understand the issue now, and your best bet is to use the pre-Roslyn compiler.
We'll try to bring up the memory usage to the Roslyn team, but it's unlikely to be something they can just fix. They have a completely different architecture from before, and presumably it needs more memory to do its thing :(
I've just compared the csproj for the main site where I am having troubles and the MVC Rest site which is fine, below are the main differences that I can see.
<Import Project="$(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)\TypeScript\Microsoft.TypeScript.targets" Condition="Exists('$(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)\TypeScript\Microsoft.TypeScript.targets')" />
I am assuming that Visual Studio is not installed on the VM, so this could be our culprit? Any idea what a suitable alternative might be?
So you mean 14.0 for the Rest site and not 4.0? 12.0 is VS 2013, and 14.0 is VS 2015.
VS is not installed, but the various SDKs are. e.g. using Kudu console you can see that the file D:\Program Files (x86)\MSBuild\Microsoft\VisualStudio\v14.0\TypeScript\Microsoft.TypeScript.targets
.
But maybe I'm not following your reasoning here. If something needed was missing, things would fail to build. Here, everything builds, but the new toolset eats up more memory. So I don't think the issue relates to missing files.
No, rest site is definitely 4.0, I was confused by it too!
Admittedly, I was grabbing at straws, the presence of VisualStudio in the path just threw me off.
If you or the Roslyn guys want the code base to test with, you can contact me at alan dot parr at rexson dot co dot uk and we can find a way to get it to you. You've just got to promise not to laugh at my poor coding practices though :)
For extra info, I am using the Roslyn compiler locally in VS2015 without issue on x64 Windows 8.1.
Right, 4.0 is VS 2012, 12.0 is VS 2013, and 14.0 is VS 2015. Because we're really good at numbering things ;)
Even though you're using it without issues locally. I would guess that it is also takes much more memory than with the old toolset. But when you have the memory, it's fast and unnoticeable.
I checked to confirm, while building the main site, msbuild is only going up to 20mb.
If you want to more closely replicate what happens in Azure, try to do a Publish to File System (some details here).
Also, it might be hard to catch the peak memory usage from Task Manager, because if it's not out of memory, it may be very brief.
I'll give it a try.
Just gave it a try, worked fine. As you mentioned it might not be visible, but I had my eye on task manager the whole time and it didn't go over 32MB from what I could see.
I don't think we've quite got to the bottom of this one. I am now experiencing the long builds and 100% CPU using the old msbuild. This is affecting all projects, not just the main site. Will do some further tests, including reverting Kudu build again, on Monday.
That's quite unexpected. Where are you seeing the CPU being at 100%? Is that from the Preview Portal?
Right now, CPU usage is very low on your Medium worker, and memory usage is only 69%. Would be interesting to check at the very time you deploy.
Hello David, unfortunately we experiencing the same difficulties as AlanParr. Funny that once you closed this issue we even couldn't test new changes.
We have several resource groups in Southeast Asia and each one contains 9 webapps, they are connected to different branches of our private github repository.
During last 1.5 weeks it's not possible to deploy anything there. We faced the same behaviour with slow performance on our environments, we tried to change our app service plans to more powerful and there is no difference, even on P2 plan.
Also we opened ticket with SR number 115091513162734
5 days ago, unfortunately we didn't get any help, guys just collecting information and that's all.
We tried to make the same environments in neighbor geo-locations, created new app service plans and no one didn't help. Our development process is blocked since we faced that.
We created custom deployment script with next steps:
Each build attempt has the different result (failed or success), building time is always different it could take 30 minutes, 1 hour, and 5-7 hours per one website when normal time is 5-10 minutes.
Last build was successful for two resources after 7 hours of building, others were failed due to time out.
{
"log_time": "2015-09-20T09:25:22.8826931Z",
"id": "",
"message": "Command 'starter.cmd deploy.cmd' was aborted due to no output nor CPU activity for 60 seconds. You can increase the SCM_COMMAND_IDLE_TIMEOUT app setting (or WEBJOBS_IDLE_TIMEOUT if this is a WebJob) if needed.\\r\\nstarter.cmd deploy.cmd",
"type": 1,
"details_url": null
},
{
"log_time": "2015-09-20T09:25:22.9139291Z",
"id": "",
"message": "One or more errors occurred.",
"type": 2,
"details_url": null
},
{
"log_time": "2015-09-20T09:25:22.9295511Z",
"id": "",
"message": "Command 'starter.cmd deploy.cmd' was aborted due to no output nor CPU activity for 60 seconds. You can increase the SCM_COMMAND_IDLE_TIMEOUT app setting (or WEBJOBS_IDLE_TIMEOUT if this is a WebJob) if needed.\\r\\nstarter.cmd deploy.cmd",
"type": 2,
"details_url": null
}
I've created dummy webapp illegal.azurewebsite.net.
One of the websites has next names:
b*****-dev-f***.azurewebsites.net
and another resource group:
b*****-east-asia-dev-f***.azurewebsites.net
Hi Maksim,
I actually investigated your case yesterday based on your ticket, and had some findings which I sent to support team.
TL;DR: the slowness is caused by out-of-memory conditions on the Medium VM. You need to reduce memory usage (suggestions below), or move to a Large.
I’ve identified a few reasons for the high memory usage.
I don’t know what this site is doing, but it has a strange memory usage pattern: it starts low, then quickly climbs to over 2GB (within a minute or two). And then it suddenly drops back to almost nothing (like 10MB). And it instantly starts climbing again. This is happening continually.
2GB usage for one w3wp process is very significant on a VM with a total of 3.5GB.
This is what I mentioned above, and it is a recent regression, in the sense that we recently switched to use the new toolset by default. Please have them try the suggestion below: set MSBUILD_PATH=D:\Windows\Microsoft.NET\Framework\v4.0.30319\MSBuild.exe in the Azure App Settings.
They have 36 sites running in that App hosting plan (counting staging sites, which are separate sites).
The strange thing is that you have an App Hosting Plan named b***_-dev and another named b**-develop. Yet it seems all the sites are in the first one, even if the site names start with b_-develop-. I wonder if this is unintentional, and you meant to put them in the other Plan, which is currently unused?
If needed, you can contact me directly at david.ebbo AT microsoft
Thanks for you fast reply, David!
Didn't know that you are involved in our ticket. I decided to write here because we didn't receive any information.
You need to reduce memory usage (suggestions below), or move to a Large. We thought this high memory usage caused by continuous build processes.
I just added MSBUILD_PATH=D:\Windows\Microsoft.NET\Framework\v4.0.30319\MSBuild.exe
variable
to b***-dev
resources and they are successfully built. Looks like this workaround helped.
Build started at 14:14:53 UTC and last website finished at 14:29:57.
They have 36 sites running in that App hosting plan (counting staging sites, which are separate sites). The strange thing is that you have an App Hosting Plan named b**-dev and another named b-develop. Yet it seems all the sites are in the first one, even if the site names start with b*-develop-. I wonder if this is unintentional, and you meant to put them in the other Plan, which is currently unused?
Don't know about 36 websites :smile:
Actually we always used b***-dev
resource group with the same name for hosting plan until faced this problem. b***-develop
was created by request from Microsoft support engineer he asked us to move all "faulty" resources to b***-develop
, probably he thought it could be helpful in this case.
Hi David
I just tried to chart the period of time when I was doing the build on Friday but the portal won't let me change the date range. It was between 16:30 and 17:15 GMT on Friday on the Medium worker.
I've reverted back to old Kudu on the Medium worker and the build is back down to 5 minutes. On Monday, I will try continuously deploying one after the other for as much of the day as I can to see how things stack up.
I will leave my 2 Small workers running the current Kudu build for comparison.
@AlanParr It's really quite surprising that reverting to previous Kudu does more than just reverting to old msbuild. Stay with that for now, but note that it can't be a permanent solution, as old builds don't stay up that long. If needed, we can investigate at the same time sometime next week and I can look up memory usage at the very time you deploy (using new Kudu and old msbuild).
I'm not sold on it being kudu either, I'm thinking more msbuild + something. Hammering with deployments tomorrow should help determine whether or not kudu is that something.
@davidebbo Hi David! I'm working in the same team with @mskutin.
Is that true that web apps in the same App service plan (or resource group) can affect each other's performance?
I'm not quite understand does it affect only two persons (AlanParr and us)? If it's kudu/msbuild/roslyn version issue it should affect everyone, right?
@ebashmakov
Is that true that web apps in the same App service plan (or resource group) can affect each other's performance?
Absolutely. Sharing App Service Plan means sharing VMs. So they can definitely affect each other. However, that's not necessarily the case for resource groups, which can have multiple App Service Plans. Only the App Service Plans is relevant here.
I'm not quite understand does it affect only two persons (AlanParr and us)? If it's kudu/msbuild/roslyn version issue it should affect everyone, right?
That's an interesting question, and to be honest at this point it's still not fully understood. But it's very possible that some builds are people's local machines are taking much more memory than before, but that it doesn't harm anything due to having more memory available.
@davidebbo Thanks for your answers!
So actually there is no solution except scale-up/scale-out, isn't there? I think in this situation we will consider to try web deploy.
I know we are getting out of this issue topic. But could you please help us to understand the purpose of App Service plans? If we create separated App service plan for each Web app could it increase their performance?
Is there something else that we can do to clarify the real reason of this issue?
Scale out probably won't help, as each VM will be running the same code, with the same memory constraints. But scaling up will, as it gives the VM(s) more resources.
The purpose of App Service plans is to run difference sites in different pools of VMs, which you can scale up/out independently of each other.
The higher memory issue with the new compiler doesn't seem to affect all apps. Once we get a good repro that has the issue and that we can run ourselves, we can hopefully get to the next level on investigation.
I've been running some additional tests today, which unfortunately have not cleared things up in the slightest!
I've set up the following configuration: West Europe - Medium, Old MSBuild, Old Kudu 47.40826.1772 East US - Medium, Old MSBuild, Current Kudu (47.40908.1797.0) * Small for Build 1 East Asia - Medium, New MSBuild, Current Kudu (47.40908.1797.0) * Small for Build 1
Build | Start Time (GMT) | West Europe | East US | East Asia |
---|---|---|---|---|
1 | 09:00 | 4 mins 6 secs | 11 mins 24 secs* | 30 mins 4 secs* |
2 | 09:36 | 2 mins 2 secs | 13 mins 1 sec | 52 mins 10 secs |
3 | 10:33 | 2 mins 3 secs | 3 mins 7 secs | 4 mins 52 secs |
4 | 10:56 | 2 mins 1 secs | 2 mins 17 secs | 2 mins 39 secs |
5 | 11:07 | 1 min 57 secs | 1 min 20 secs | 3 mins 20 secs |
As you can see, the first couple of builds are what we have been seeing, running the new MSBuild takes significantly longer than old MSBuild on current Kudu, which in turns takes longer than old MSBuild on old Kudu.
Then from Build 3 onwards, everything is fine on all environments. It is worth noting that before these troubles began, all instances were small. I've just upped them to Medium to try and reduce the impact of the MSBuild problems on the Production sites due to resource contention. The builds are faster than the usual 5 minutes, most likely due to upping the instance sizes.
My current thinking is that reverting to Kudu is a red herring as it doesn't consistently improve things. It may be that changing the Kudu version is resetting something on the instance that is the real root of the problem, perhaps coupled with MSBuild itself, although I am starting to wonder if MSBuild isn't also a red herring.
The only truly common factor seems to be the Azure App Service itself, although if that were the case, I would've thought a great deal more customers would be affected by this.
If I can run any further tests, please let me know. I'm really eager to get to the bottom of this, as I'm sure you are. I'm not looking forward to explaining to my customer why their hosting costs have just doubled overnight!
@AlanParr thanks for doing all this testing!
The key thing to remember is that this issue is not about the new compiler being much slower, but primarily about it taking more memory. What causes it to be slower is if the higher memory usage causes the VM to be low on memory and start thrashing. At that point, everything running on the VM will become much slower.
That's likely why the test results are inconsistent. We you'll want to do is correlate them to the VM memory usage, which you can get from the new Portal (under App Service Plan).
Any change you can share the test repo to a private bitbucket repo and give me access (davidebbo)? Hopefully, I'll then be able to repro the behavior.
I'll get that sorted and email you the details.
@davidebbo We're experiencing the exact same things as @AlanParr for the past week as well as getting random Kudo errors such as:
After resolving the 1st and 2nd issues which have never happened before deployments take at least 30 minutes, the same as @AlanParr. The 3rd issue we haven't corrected yet as it sometimes works and sometimes doesn't, but when it happens EVERY website on the standard websites goes down.
Also we are experiencing the following issues:
@davidebbo Should we also try going back a version for Kudo and see if that works?
In case someone missed the fix that was indicated above, like I did.
In Azure Web Apps portal, go to the Azure App Settings and set in the configuration the following key value setting. This will use the older non Roslyn based compiler.
MSBUILD_PATH=D:\Windows\Microsoft.NET\Framework\v4.0.30319\MSBuild.exe
Reason as indicated above:
So the theory here is that the new Roslyn compiler (VBCSCompiler.exe) is not slower, but it eats up way more memory, causing the machine to thrash and be super slow if it doesn't have enough.
We'll try to bring up the memory usage to the Roslyn team, but it's unlikely to be something they can just fix. They have a completely different architecture from before, and presumably it needs more memory to do its thing :(
@cthames Number 3 seems to be a symptom of the overall issue of msbuild using too much CPU. In most cases, the CPU activity keeps going so Kudu doesn't timeout. However, sometimes the CPU activity seems to stall for long enough for Kudu to timeout, see this link further up in this issue.
@cthames When all this began I was on a Small instance. As well as the MSBUILD_PATH fix, I've also found moving up to a Medium does sometimes reduce the impact of the issue when it occurs.
@AlanParr Yeah, I just confirmed we're on Standard Small for all 8. We thought we had 4 on Small and 4 on Medium but that doesn't seem to be the case anymore. I'll need to look into but maybe we can't have different instance sizes within the same App Service Plan anymore.
Still two things are a little disconcerting.
@cthames A few things:
Ok, I have some potentially good news. I talked to some compiler people and spent a ton of time trying things. Here are the key finding:
SCM_BUILD_ARGS=/p:UseSharedCompilation=false
(and add your own unrelated things to it if you had any). This disables running the compiler as a server. In my tests, this was a lot faster. And this will let you use all the new compiler features, unlike the MSBUILD_PATH
workaround. I would appreciate if you could all give this a try and report back! And make sure to the yank MSBUILD_PATH
setting.. We can make this the default if it works well.Have performed a few deployments with this change and so far so good. Builds on all instances took between 4 and 6 minutes for the first build, 2-4 for subsequent builds. CPU usage is quite variable, ranging from 14% to a maximum of 24% but it's not spiking as before and doesn't seem to be causing any issues.
Note: Rather than change SCM_BUILD_ARGS on all sites, I added the setting in to my deployment script at the end of the msbuild argument string just before SCM_BUILD_ARGS. Same effect, less work :)
@davidebbo I've confirmed the SCM_BUILD_ARGS=/p:UseSharedCompilation=false
works for us. I also confirmed that the old MSBUILD_PATH was removed. ~5 minutes to build, same as old compiler using the MSBUILD_PATH.
Great, thanks for confirming. I have updated the known issues page with this new guidance.
We are going to update Kudu to pass this flag automatically, hopefully in the next few days. However, if you have a custom deployment scripts, you'll need to make that change in there (or stay with the App Setting).
To be more specific, the fix we will make is to modify kuduscript to always pass /p:UseSharedCompilation=false
to the msbuild command line.
Normally, from pushing to build complete takes about 5 minutes. In the last week or so, it is taking between 15 and 30 minutes. Initially, I thought it was because I had overridden MSBUILD_PATH to use the C#6 msbuild, however I have removed this from one site and the build still takes the same amount of time. Reviewing the Kudu trace log shows it is taking nearly 20 minutes to copy the build output to temp, the solution isn't that big so it really shouldn't take very long.
Relevant snippet of Kudu Trace:
2015-09-14T11:40:23 Copying all files to temporary location below for package/publish: 2015-09-14T11:40:23 D:\local\Temp\8d2bcf7a334374d. 2015-09-14T11:57:47 KuduSync.NET from: 'D:\local\Temp\8d2bcf7a334374d' to: 'D:\home\site\wwwroot'