Does AppInsights use unmanaged memory to store data? (possible mem leak)

yahorsi commented 5 years ago

Hi Guys!

We have .NET Core 2.2 app(s) running in Azure with AppInsights enabled. In all our app's we see that app private bytes grows asignificantly up until the moment the app stops responding.

We've made several memory dumps and from dumps it looks like there is NO problems with managed heap. But, UNmanaged heap grows.

After !heap -stat –h for the heap that grows we found suspect block size 6e (110 bytes) that uses %50 of the heap. And after !heap -flt s we've got a list of entries on the heap. When we look at the memory for the every entry we see few strings: Operation IKey Id Name RootID

In the order I described them above. Looks like some serialized objects. So, the question is following, could it be AppInsights using unmanaged heap and having for some reason leak? Or may be AppInsights writes to some unmanaged object like pipe/socket anything else that might allocate unmanaged memory.

Thanks

yahorsi commented 5 years ago

The suspect is C:\Dev\ApplicationInsights-dotnet\src\Microsoft.ApplicationInsights\Extensibility\Implementation\RichPayloadEventSource.TelemetryHandler.cs

Here are fields in the exact order I see them in the dump memory: var writeMethod = writeGenericMethod.MakeGenericMethod(new { IKey = (string)null, Id = (string)null, Name = (string)null, RootId = (string)null }.GetType());

I only can't see Operation in the beginning. What this code does and what happens when it is called many times?

pharring commented 5 years ago

@yahorsi Thanks for the report.

That code path (MakeGenericMethod) is for .NET 4.5 only. You said your app is .NET Core 2.2. So, if that code path is being hit, you've somehow pulled in the .NET 4.5 flavor of Application Insights.

However, there is another version of RichPayloadEventSource that will be used on other platforms (.NET 4.6+ and .NET Core) that doesn't use reflection but will also serialize telemetry in way you described. That's more likely to be the culprit.

As for why they're piling up in memory, I don't know. These are ETW events (an EventSource) and they should be emitted only when someone is listening (an EventListener) with the right keywords. In this case, these are "Operations" -- requests or dependencies -- things that have duration; they are recorded as pairs of ETW events with opcodes Start and Stop. They are usually used for performance measurement and profiling.

The Application Insights Profiler does use these events and it will enable ETW listeners for the operations keywords. So it's the most reason for why we're going down this code path. The serialization of these objects to native memory happens inside the EventSource code (in corefx) -- inside WriteEvent.

(I'll have to continue this later... I'm already running late for something else. But I do think there might be something there... We need a good repro of the memory growth. If you can get one outside of Azure, that would also help eliminate a large complicating factor.)

yahorsi commented 5 years ago

I wonder if ETW/PerfView guru @vancem could give us some advise how to investigate it further Currently the issue cannot be reproduced locally and only reproduces on some of our Azure environments.

We have 3 dumps of the same process (x86, dotnet.exe) made in the different points in time.

yahorsi commented 5 years ago

Dumps could be found here https://drive.google.com/file/d/1otuVIB_ZX0CTptkFENOYkPv1jk4IwR5e/view?usp=sharing

SergeyKanzhelev commented 5 years ago

@pharring are you on this?

@yahorsi perhaps you can tell what environment and settings you are running with. Even the best of us need some help narrowing down scope before reading dumps =)

SergeyKanzhelev commented 5 years ago

by environment I mean - is Applicaiton Inisights Profiler enabled for instance. Any custom confguraition for the Application Insights? Windows version?

yahorsi commented 5 years ago

.NET Core 2.2.3 running as Azure App Service. Azure uses x86 .NET core runtime so it is x86 process running I beleive on x64 Windows. Windows Version is hard to say, dump might say more, but again tomorrow :) Private bytes grows till 1.7G and then app gets restarted (or crashed), not sure how that works.

image_2019_03_20T07_06_29_020Z image_2019_03_20T07_10_09_203Z image_2019_03_20T07_11_29_425Z image_2019_03_20T07_13_04_118Z

yahorsi commented 5 years ago

Windows version?

The one Azure uses :) We do not install or manage it.

by environment I mean - is Applicaiton Inisights Profiler enabled for instance

It should be enabled now. I'll make more screens tomorrow morning

SergeyKanzhelev commented 5 years ago

What version of Application Insights? Charts looks like CPU charts there: https://github.com/Microsoft/ApplicationInsights-aspnetcore/issues/690

I believe it was a memory leakage too

SergeyKanzhelev commented 5 years ago

CC: @cijothomas

yahorsi commented 5 years ago

I believe it was a memory leakage too

It is definitely leak. The questions is where :) I did my best to catch it in the managed world and failed. As for me, managed heap look ok and not increasing that much as private bytes grows. As I understand here we have something that causes unmanaged heap to grow.

But I might be mistaken.

SergeyKanzhelev commented 5 years ago

@yahorsi so what version of AI SDK do you have? As I said we fixed a leak in HttpClient in the past - see associated issue. And I believe the issue was leaking http connections, which also unmanaged memory.

yahorsi commented 5 years ago

From the dump: Windows 10 Version 14393 MP (2 procs) Free x86 compatible Product: Server, suite: TerminalServer DataCenter SingleUserTS 10.0.14393.2430 (rs1_release_inmarket_aim.180806-1810)

Microsoft.ApplicationInsights.dll Version 2.08.1.22898 Microsoft.ApplicationInsights.AspNetCore.dll Version 2.05.1.0

pharring commented 5 years ago

Sorry to leave you hanging. I've been heads down all day on other things. Thanks for all the extra info, @yahorsi.

The extra wrinkle here is how ETW events are captured from within the App Service sandbox. We (App Insights Profiler) have to go through a special service to enable/disable ETW providers. As far as I know, though, that shouldn't cause events to pile up in the sending process. EventSource's WriteEvent code passes pinned memory to the equivalent Win32 API..

I will run some experiments locally to see if I can reproduce the problem. An ASP.NET Core 2.2 app with App Insights 2.8.1 (base) and 2.5.1 (AspNetCore). I'll use a "no-op" telemetry sink, turn on the profiler's listener and blast the app with requests to see if it blows up. If it does, I would suspect something in .NET Core 2.2 (corefx) because this stuff has been working fine for years on .NET FX and earlier versions of .NET Core.

One more thing that may or may not be relevant: I have had a couple of anecdotal reports that the RichPayloadEventSource code can get enabled even though "no-one is listening". This also shows up (surprisingly) in the Snapshot Debugger because it tries to play nice with the Profiler and avoid taking snapshots when a detailed profile is being collected. Snapshot Collector says "not taking a snapshot because profiler is running" even though (through other telemetry) we know the profiler was NOT running. It's as if the ETW provider gets stuck on. Very recently, I had problems with the unit test in that area.

pharring commented 5 years ago

I took a look at the largest of the 3 dump files you shared and I see the same strings you saw. The way they're laid out in memory doesn't match the serialization format of a raw ETW event. These are just the field names -- there's no intervening data (and they happen to fit nicely in LFH blocks, so they're all gathered together). I do see other native heap blocks that might be the remaining pieces of the serialized payloads (they iKey, operationId and rootId).

So, I cooked up a repro running locally (not in Azure). In fact, it's a .NET Core 2.2 console app, x86 and all it does is use App Insights' Start/StopOperation in a tight loop (with a null telemetry channel). As soon as the ETW provider is enabled, then memory starts climbing. When the ETW provider is disabled (stop listening), memory stops climbing, but it is not freed. I was able to reproduce this behavior without even using Application Insights. So, this is looking like a .NET Core bug -- possibly a regression from 2.1. I will narrow down the repro and report it in the appropriate dotnet GitHub repo.

pharring commented 5 years ago

Oh, and the reason you're seeing this in Azure is because of the Application Insights Profiler. It turns on the ETW provider for Application Insights during detailed profiling. That runs for 2 minutes at a time, so native memory will be leaked for every request or dependency that's processed in that 2 minute window.

Good find. I'll chase it down.

yahorsi commented 5 years ago

In your repo have you looked at the unmanaged memory? Do you have same "strings" as in my dumps? Now I wonder how we could trace what triggers these Start/Stop calls on production.

PS I've already created one issue here, seems it should be closed. https://github.com/aspnet/AspNetCore/issues/8648

yahorsi commented 5 years ago

0.20 to add, you'ge got a really cool repo. But, I still do not understand one thing we have AI=ON on all our environments but we observe leak just on the production env. Any idea how could we try to repo leak on other environments? Any setting or so?

pharring commented 5 years ago

In my repro, yes I looked at unmanaged memory and I see the strings from my event payloads -- the same pattern we see with the real Application Insights RichPayloadEventSource.

In your other environments do you have the same kind of load as in production? The amount of memory allocated depends on the number of Application Insights operations (Requests or Dependencies) logged. The more requests you have, the larger the leak.

These events get turned on by the Application Insights Profiler. Indeed, the best workaround for now for production would be to disable just the Profiler, leaving other Application Insights features on.

yahorsi commented 5 years ago

I still unable to repo this. I made an API app and add AI, deployed it to Azure and run httpping. No leak ((

pharring commented 5 years ago

How did you "add AI"? Did you enable the Profiler? How long did you keep the app running before you created a memory dump? Note that the profiler runs 'randomly' for 2 minutes at a time about once an hour. Memory will start growing only during those 2 minute profile sessions and only if the site is active -- processing requests. If you need to control when the profiler runs, you can do "on-demand profiling" from within the Performance section of Application Insights -- although it's still not "instantaneous".

yahorsi commented 5 years ago

How did you "add AI"?

public static IWebHostBuilder CreateWebHostBuilder(string[] args) => WebHost.CreateDefaultBuilder(args) .UseStartup<Startup>() .UseApplicationInsights("11111111-2222-3333-4444-555555555555");

Did you enable the Profiler?

I did enable it in the code and in the portal

How long did you keep the app running before you created a memory dump?

In my test I didn't, I just run something like below for ~12 hours. it should produce about 4-5 requests per second. So, for 2 minutes it is about 1200 requests. http-ping.exe -t -i 0 https://yahorsieu.azurewebsites.net/api/values

And then was looking at how memory grows. I didn't notice any big difference on private bytes behavior with or without profiler enabled.

yahorsi commented 5 years ago

Just for the case, project is here https://YahorSiOrg@dev.azure.com/YahorSiOrg/yahorsi/_git/yahorsi

And could be pinged by curl-ing https://yahorsieu.azurewebsites.net/api/values

pharring commented 5 years ago

I have a repro in Antares App Service using a 5 minute load test with 250 users. It's a lot of steps now that I've written it up, but I think it's pretty simple. The repro doesn't use Application Insights Profiler directly; it starts/stops ETW sessions using the same APIs that Profiler uses.

Create a new Azure App Service with an S1 (or greater) service plan. Do NOT enable Application Insights.
Make sure that "Always On" is enabled (that's to stop IIS shutting down when there's no activity).
In Visual Studio, create a new ASP.NET Core 2.2 Web App
Add the Application Insights SDK using the "Connected Services" UI in Visual Studio; let it create a new Application Insights resource for you and add the SDK.
Publish the app to the existing App Service you created in step 1.
Open Kudu for the site.
Go to the "Process Explorer" tab and note the private_memory usage of "w3wp.exe" (not the "SCM" site, but the main site where your application is hosted). Note that ASP.NET Core 2.2 uses an in-proc hosting model, so it's running inside w3wp.exe. Earlier versions launched "dotnet.exe" as a child process.
Also note the process ID (pid) of w3wp.exe. You'll need it in a couple of steps.
Go to the Debug Console, Command prompt

Run the following commands (note that you have to fill in the pid of w3wp.exe that you got a couple of steps ago)

set PID=[pid-of-w3wp.exe]
set VSD="%SystemDrive%\Program Files\Microsoft Visual Studio 15.0\Team Tools\DiagnosticsHub\Collector\VSDiagnostics.exe"
%VSD% start 42 /scratchLocation:"%LOCAL_EXPANDED%\Temp" /loadAgent:F5091AA9-80DC-49FF-A7CF-BD1103FE149D;ServiceProfilerAgent.dll
%VSD% update 42 /attach:%PID%
%VSD% postString 42 "AddProvider:a62adddb-6b4b-519d-7ba1-f983d81623e0:0x400:4" /agent:F5091AA9-80DC-49FF-A7CF-BD1103FE149D

Explanation: This starts a new profiling session in the sandbox enabling only the Application Insights "data" event source with the "Operations" keyword at information level (that's the command in the quotes on the 'postString' line. The session is now running.

Now run your load test. I used the load test feature of Azure DevOps to simulate get 250 users for 5 minutes. I got ~195,000 requests (about 650 requests per second)
When the load test is finished, go back to the Kudu console and stop the session with this:
```
%VSD% stop 42 /output:%LOCAL_EXPANDED%\Temp\my-load-test.diagsession
```
Now go back to Kudu's "Process Explorer" tab and note the private_memory usage of "w3wp.exe". In mine, it had gone up to just over 200,000 KB (200MB).
If you want, you can create a full memory dump of that process from within Kudu. Right-click on the w3wp.exe process and select "Download Memory Dump > Full Dump". It might take a minute to create it, but you can download it and view it in windbgX (!heap -a to show all the heap segments. The leaked strings will probably be in the first heap -- the OS primary heap -- and you can use the memory window to look in the last 2 or 3 segments; there's a good chance you'll find them).

pharring commented 5 years ago

Just adding a link to the coreclr bug: https://github.com/dotnet/coreclr/issues/23562

yahorsi commented 5 years ago

Sorry for disrupting :) I'm still trying to understand the problem :) In our environment, we see that memory grows evenly, without spikes. If leak is caused by AI profiler then we should see spike every hour for 2 minutes. So, I wonder could there be other ETW Profilers that's running always?

pharring commented 5 years ago

You might be hitting #1108 Do you do anything in your startup code that might send telemetry even before you call "AddApplicationInsights". You have to get in a situation where the RichPayloadEventSource is created before you call AddApplicationInsights or access TelemetryConfiguration.Active.

yahorsi commented 5 years ago

I'm not sure, below is our host builder code. How these 2 issues related, you mentioned that leak occures only when there is ETW profiler available and collecting events. That should be happening only for 2 minutes during hour, that should generate spikes in memory private bytes every 2 minutes per hour. But, we do not see such spikes, we see that memory grous evenly. That is the mystery and the thing I still do not understand

        public static IWebHost BuildWebHost(string[] args)
        {
            var host = new WebHostBuilder()
                .UseKestrel()
                .UseContentRoot(Directory.GetCurrentDirectory())
                .ConfigureServices(services => services.AddAutofac())
                .ConfigureAppConfiguration((hostingContext, config) =>
                {
                    var hostingEnvironment = hostingContext.HostingEnvironment;
                    config
                        .AddJsonFile("appsettings.json", true, true)
                        .AddJsonFile($"appsettings.{hostingEnvironment.EnvironmentName}.json", true, true)
                        .AddEnvironmentVariables();
                })
                .ConfigureLogging((hostingContext, logging) =>
                {
                    logging.AddConfiguration(hostingContext.Configuration.GetSection("Logging"));
                    logging.AddDebug();
                })
                .UseIISIntegration()
                .UseApplicationInsights()
                .UseStartup<Startup>()
                .UseNLog() // NLog: setup NLog for Dependency injection
                .Build();

            return host;
        }

pharring commented 5 years ago

That looks pretty standard. Yes, if the App Insights Profiler were responsible for enabling RichPayloadEventSource, then you'd only see growth during the 2 minute collection intervals. If you've disabled the profiler and you're still seeing steady growth, then something else has enabled the RichPayloadEventSource. If it's not #1108, then the next thing I'd turn to is to look for other ETW sessions on the machine -- unfortunately, that's impossible to check from "within the sandbox". Contact me in email {same-as-my-GitHub-name}@microsoft.com and I'll try and help.

Plasma commented 5 years ago

@yahorsi we have the same climbing graphs as you; our app is very lean and doesn't do much.

We have just turned off the profiler in app insights, will see what affect it has within the next 24 hours.

Plasma commented 5 years ago

@pharring @yahorsi I am seeing the same memory climb graphs as discussed over time.

Analyzing a 2GB debug dump of the process showed the .NET Managed memory at ~200 MB, the rest is "Native memory".

I have just turned OFF App Insights profiler, will see what affect it has on our memory in the next 24 hours.

Plasma commented 5 years ago

It's been about 4 hours and the trend on the graph seems to be fixed; its leveled out (and gone down a bit). Red line indicates when profiler was turned off (and app was restarted):

This shows me its App Insights Profiler being ON that has caused the leak.

Plasma commented 5 years ago

12 hours on, looks fixed with the profiler off (since it uses ETW):

cweining commented 5 years ago

@Plasma You are correct - if you turn profiler off, the memory issue goes away. There is a bug in .net core that causes a memory leak when ETW event sources are turned on. Profiler enables event sources in .net core, so turning off profiler prevents the problem. There will be a fix released soon by the .net core team.

AshasTob commented 5 years ago

How do we fix it? I noticed that if I use an app where profiler was On for some time - memory leaks.

But if I create new fresh app - memory seems stable if I not touch tab 'application insights' on a web app page. Can we recreate app on prod and keep it - memory leak not gonna come for us :)?

Also I noticed our app crashes with out of memory at point of 1.8Gb memory used as if it is a x32 app. But fresh app that I used to reproduce from scratch goes up to S3 webapps limit which is 7Gb. Also it restart more gracefully, without exceptions there. Any hints on this? Do we have another problem that causes this?

Plasma commented 5 years ago

@AshasTob On the App Insights area, turn off the profiler.

AshasTob commented 5 years ago

@Plasma , I did so. Memory behaves much better, but still not optimal. It still grows, even though, very slowly. I have app deployed twice to different app services with only one difference - first has .UseApplicationInsights(key) and second does not. See memory charts below. app_with_no_AI

This is app that has 110 requests per second and no application Insights.

no_profiler Same App first, with profiler ON, and then with profiler OFF. I am concerned with the fact that memory still grows, slowly. (last pike is smaller - as i restarted app to turn off the profiler in azure)

This is my repo, just in case: https://github.com/AshasTob/MemoryLeakRepro

Plasma commented 5 years ago

At a cursory glance, I suspect that’s general .net memory use climbing due to some internal caches, etc.

Or it could be something in your application code.

I suspect your main leak is fixed due to profiler being off; and now your noticing a smaller, potential, memory leak, that is more related to your code.

If you see it still trending up after 24-48 hours, via the app service diagnostics area you can choose “memory dump” (note: this will freeze your site!) and then you can analyse it to see what’s holding onto memory.

The above profiler leak was confusing because it never used .net memory.

cijothomas commented 5 years ago

Welcome back issue 1102 :) we missed you.

yahorsi commented 5 years ago

My account was flagged by mistake by GitHub spam detecting system, and, so, issue seems was hide and now it's back.

BTW, I have another issue with IntelliCode and most likely ApplicationsInside but now in the VS (latest). What's the best channel to report it?

pharring commented 5 years ago

@yahorsi Thanks for the explanation. For the IntelliCode issue, I suggest using the Feedback button from the Visual Studio IDE to report it.

yahorsi commented 5 years ago

Sorry it is IntelliTrace :) Anyway, will first try VS channel

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 300 days with no activity. Remove stale label or comment or this will be closed in 7 days.

microsoft / ApplicationInsights-dotnet

Does AppInsights use unmanaged memory to store data? (possible mem leak) #1102