microsoft / ApplicationInsights-dotnet

ApplicationInsights-dotnet
MIT License
565 stars 287 forks source link

Memory leak with Application Insights + asp .net core 3.1 + linux containers #1678

Closed SidShetye closed 4 years ago

SidShetye commented 4 years ago

FYI, this is also being tracked by Azure support as case 119121121001290

Repro Steps

Please see via my personal repo here at for both the sample repo as well as the pre-built containers. Also note that the restricted way azure app service spawns containers prevents you from running dotnet dump on azure (issue with diagnostics team) so this is reproduced outside azure with the standard VS2019 asp.net core 3.1 template.

Actual Behavior

The user visible behavior is that Azure App Service recycles the container with the log entry

Stoping site <sitename> because it exceeded swap usage limits.

What happens under the hoods is that it takes between 1 to 4 days for the memory leak to exhaust about 1.5 GB memory. This is seen as high used and low available memory as see by the free -w -h linux tool (which uses the /proc/meminfo from container host).

              total        used        free      shared     buffers       cache   available
Mem:           1789        1258          81           5           8         441         378
Swap:             0           0           0

Eventually the container is killed (exit code 137, low memory).

Disabling Application Insights by not setting the APPINSIGHTS_INSTRUMENTATIONKEY environment variable prevents this leak.

Expected Behavior

No memory leak

Version Info

SDK Version :

  <ItemGroup>
    <PackageReference Include="Microsoft.ApplicationInsights.AspNetCore" Version="2.12.1" />
    <PackageReference Include="Microsoft.ApplicationInsights.Profiler.AspNetCore" Version="2.1.0-beta1" />
    <PackageReference Include="Microsoft.Extensions.Logging.ApplicationInsights" Version="2.12.1" />
  </ItemGroup>

.NET Version : .net core 3.1 How Application was onboarded with SDK: csproj integration Hosting Info (IIS/Azure WebApps/ etc): Azure App Service Linux Containers as well as locally as simple linux containers

pharring commented 4 years ago

@SidShetye For the purposes of elimination (or confirmation), please can you repeat the test, but without the "Profiler.AspNetCore" package?

SidShetye commented 4 years ago

We've spent a ton of time on this to date (see case 119121121001290 and it's master case) and can't really sponsor any additional time. Since we've have identified the product causing this and even have a repro repo, the idea is you (the vendor) debugs this this point on.

The leak could be from anywhere in Application Insights but what I can add from the top of my head is that we did see an almost 4x memory footprint with profiler enabled within just 30 seconds of launch (not in containers or in linux)

image

SidShetye commented 4 years ago

@pharring I should mention, the support case 119121121001290's secure file exchange areas has memory dumps of this situation. There is a reference dump a few minutes after starting and then 2 days when the memory consumption is 1.4 GB. The container eventually died after a total of 3 days of uptime where even a simple ps wouldn't fork.

root@c9aa866c3d4c:/storage# ps aux
-bash: fork: Cannot allocate memory
SidShetye commented 4 years ago

@pharring I'm uploading some of the prior work from our support ticket here for easy access. It's off the VS2019 template reproduce repo above (no proprietary bits), so safe for public distribution.

Memory dumps: Uploaded to Azure Storage here. It should be accessible for the next 7 days for you to download.

Video of low memory situation: Not a terribly exciting video but it's just 90 seconds. https://www.youtube.com/watch?v=4JIufEk2hw8

pharring commented 4 years ago

@xiaomi7732 can you take a look?

noahfalk commented 4 years ago

Memory dumps: Uploaded to Azure Storage here. It should be accessible for the next 7 days for you to download.

@SidShetye Any chance we can get these dumps back with a longer expiration time? Folks on the AppInsights team asked if I could take a look but at the moment I have access to neither the dumps nor a valid ikey to make progress. If possible the dumps are likely to be a much faster way to get useful info than waiting to get the repro set up and leaking again.

SidShetye commented 4 years ago

Memory dumps: Uploaded to Azure Storage here. It should be accessible for the next 7 days for you to download.

@SidShetye Any chance we can get these dumps back with a longer expiration time? Folks on the AppInsights team asked if I could take a look but at the moment I have access to neither the dumps nor a valid ikey to make progress. If possible the dumps are likely to be a much faster way to get useful info than waiting to get the repro set up and leaking again.

Traveling, on mobile but this should work https://crypterondiag.blob.core.windows.net/reported-issues/appinsight-memory-blowup/2020-02-12_09_14_00%20generic%20aspnetcore3.1%20AppInsights%20memoryleak.zip?sp=r&st=2020-01-02T03:47:31Z&se=2020-05-01T10:47:31Z&spr=https&sv=2019-02-02&sr=b&sig=rv1HBvhY%2FP8EcY%2FFV4XdKff7S6mmLe8u0rNx6Q09YlQ%3D

noahfalk commented 4 years ago

Thanks @SidShetye ! A dev on my team started looking at the dumps today and ruled a few things out but didn't get to a root cause yet. We'll update once we learn more.

evilpilaf commented 4 years ago

This looks related to a similar issue I'm having with a .netcore 3.1 web api running in azure app services, The memory slowly builds up through a day maxing out the instances and causing the pools to be recycled, Attached is the memory dump we got from production where we can see that the issue comes from the AddApplicationInsightsTelemetry() method. appinsights leak

SidShetye commented 4 years ago

@noahfalk Any update since the last 3 weeks?

noahfalk commented 4 years ago

Hi @SidShetye, one of my coworkers has been working on this for considerable time these last weeks and we are continuing to do so - I apologize it is taking a long time to solve. We are seeing that native memory is leaking which is much more time consuming to track down than managed memory leaks. We attempted to use several different memory leak analysis tools on Linux but thus far they've either failed to run or the data they produced has made no sense. We are now at the point where we are making custom modifications to tool called libleak and we've identified potentially problematic allocation stacks, but the callstack unwind/symbolication from the tool is unusable. We are trying a next set of tool modifications that will let us stop the app under a debugger on the apparently problematic allocations so that we can gain access to higher quality stack unwinds.

SidShetye commented 4 years ago

thanks @noahfalk

On a side note, it wouldn't hurt to talk to the .net diagnostics tool team so any custom tool enhancements at your end can flow back to the broader .net community. It's only a matter of time before other folks also have to debug managed + unmanaged leaks on linux. Perhaps you can save them that frustration.

noahfalk commented 4 years ago

it wouldn't hurt to talk to the .net diagnostics tool team

I think the team you are referring to is my team : ) Myself and immediate coworkers own tools such as dotnet-dump, dotnet-trace, dotnet-counters as well as all the low level diagnostic features implemented inside the .Net runtime. @pharring above is on the AppInsights team and he asked if we could help them out investigating this.

so any custom tool enhancements at your end can flow back to the broader .net community

The modifications we are hacking together would be pretty rough for someone else to try repurposing as-is but I definitely agree on the broader sentiment. I want to stay focused on getting this bug solved for you first but after that we can take stock of what code/knowledge we gained in the exercise and how can we leverage it for broader community benefit.

SidShetye commented 4 years ago

@noahfalk Ok, got it. Thanks for those tools, hopefully we can open dotnet-dumps within VS/VSCode's GUI some day! A topic for another day though.

Back to this, I wanted to add that the leaks we observed were seen on 2+ core systems with server GC. On a single core server or workstation GC, we didn't observe those leaks. We had mentioned this on the internal support ticket but just noticed that the github issue doesn't mention that.

noahfalk commented 4 years ago

Hi @SidShetye a quick update and thanks for your continued patience on this long bug. The custom tooling work we did is bearing dividends and we isolated two callstacks with high suspicion of leaking memory. As of a few days ago we switched modes to root causing, creating tentative fixes, and doing measurements to assess the impact of the changes. Those tasks are still in progress but it is very satisfying to have some leads now.

noahfalk commented 4 years ago

Hi @SidShetye, just wanted to give another update on progress. Due to the tireless efforts of my coworker @sywhang we found a few issues that had varying contributions, but the one that appears most problematic plays out like this: 9c08f2e5-9ef7-4554-bd24-af8f9ba69943

The yellow is a baseline scenario that we modified slightly from the one you provided to make ApplicationInsights service profiler capture traces much more frequently. The blue line is after we applied a tentative fix in the coreclr runtime. x-axis is seconds, y-axis is commited memory in MB and we ran both test cases long enough that we thought the trend was apparent.

Interestingly we thought the native CRT heap was growing because something was calling malloc() and forgetting to call free() but that wasn't the culprit. Instead the malloc/free calls were correctly balanced but there was a bad interaction between the runtime allocation pattern and the glibc heap policies that decide when unused CRT heap virtual memory should be unmapped and where new allocations should be placed. We didn't determine why specifically our pattern was causing so much trouble for glibc, but the tentative fix is using a new custom allocator that gets a pool of memory directly from the OS with mmap().

This particular issue is in the coreclr runtime (Service profiler is using a runtime feature to create and buffer the trace) so we'd need to address it via a runtime fix. Our current tentative fix still has worse throughput than the original so we are profiling it and refining it in hopes that we don't fix one problem only to create a new one. We'll also need to meet with our servicing team to figure out what options we have for distributing a fix once we have it.

I'll continue to follow up here as we get more info. In the meantime we found that adjusting the environment variable MALLOC_ARENA_MAX to a low value appeared to substantially mitigate the memory usage so that is a potential option for you to experiment with if you are trying to get this working as soon as possible. Here are a few links to other software projects that were dealing with similar issues to add a little context: https://github.com/prestodb/presto/issues/8993 https://github.com/cloudfoundry/java-buildpack/issues/320

SidShetye commented 4 years ago

Thanks for the update and - wow, this runs deep! Perhaps a write up on Microsoft devblogs would be great, especially as an advanced followup to this piece. The folks root-causing this deserve the credit/recognition.

noahfalk commented 4 years ago

@ank3it can you assign @sywhang? It appears I can remove my own assignment but can't reassign to anyone else (odd). @sywhang has been doing the lion's share of the work, I've just been writing up a little progress here on his behalf.

sywhang commented 4 years ago

Hi @SidShetye,

Sorry it took quite a while for me to try out various approaches to fix the underlying issue and decide on a reasonable fix in terms of complexity and performance impact, but the fix finally got merged to the master branch of dotnet/runtime repository in https://github.com/dotnet/runtime/pull/35924.

Would it be possible for you to try out using the latest branch and confirm that we've fixed your issue? I'm now trying to backport the fix to our servicing release of .NET Core 3.1 and having validation from you as the first customer who reported this issue would let me go through the backport process for .NET servicing release.

Thank you!

SidShetye commented 4 years ago

@sywhang I really appreciate the efforts in fixing this but we have already stripped out Application Insights from our stack. Is there a write-up on this? Seems like it was a very tricky issue, so a post-analysis write up could help others too.

sywhang commented 4 years ago

Thanks @SidShetye for the update! I'll try to get the backport approved regardless so that others won't hit the same issue.

Regarding writeup, I don't have anything written up on this yet because we're so close to .NET 5 release snap date and I have a huge stack of work items to go through, but I am definitely planning to share some of the tools I wrote as part of this investigation, as well as a writeup on the investigation as soon as I have some extra cycles : )

forik commented 4 years ago

We are also encountering memory issues and suspect AI. Below is a screenshot of a diff from two dotnet-gcdump files taken in a 1 hour period. Seems to be a diagnosticlistener leakage somewhere inside AI. изображение

noahfalk commented 4 years ago

@forik I'd recommend opening a new issue with the information on your bug so it can be better tracked. The new issue template will ask for info specific to your setup and from what you posted so far it appears unlikely that your issue is a duplicate of Sid's issue.

sywhang commented 4 years ago

The fix to this issue has been merged to both master branch and .NET Core 3.1 servicing branch (targetting 3.1.5 servicing release) of the runtime repo. I don't have write access to this repository but I believe the issue can be closed now.

SidShetye commented 4 years ago

Can’t test for confirmation but I’ll go ahead and close it.

yigitgursoy commented 3 years ago

Still happening in Net6.0

image

saveveltri commented 8 months ago

I can confirm that it's still happening in .net core 6. Is the MALLOC_ARENA_MAX workaround still works well ?

noahfalk commented 8 months ago

If you are seeing unbounded memory growth in App Insights profiler running on .NET 6+ then it is probably going to be a bug in a different code path, and potentially a different root cause. I'd recommend creating a new issue so the App Insights team can investigate it.

Is the MALLOC_ARENA_MAX workaround still works well ?

I don't know whether MALLOC_ARENA_MAX will affect the behavior you are seeing. Its possible that it will help if your leak was coming from some other code path that continues to use malloc(). The allocations that were problematic in the scenario we investigated for this bug years ago were converted to use mmap(). I wouldn't expect those specific allocations to be leaking or to be impacted by MALLOC_ARENA_MAX any longer.