Closed yahorsi closed 3 years ago
The suspect is C:\Dev\ApplicationInsights-dotnet\src\Microsoft.ApplicationInsights\Extensibility\Implementation\RichPayloadEventSource.TelemetryHandler.cs
Here are fields in the exact order I see them in the dump memory: var writeMethod = writeGenericMethod.MakeGenericMethod(new { IKey = (string)null, Id = (string)null, Name = (string)null, RootId = (string)null }.GetType());
I only can't see Operation in the beginning. What this code does and what happens when it is called many times?
@yahorsi Thanks for the report.
That code path (MakeGenericMethod) is for .NET 4.5 only. You said your app is .NET Core 2.2. So, if that code path is being hit, you've somehow pulled in the .NET 4.5 flavor of Application Insights.
However, there is another version of RichPayloadEventSource that will be used on other platforms (.NET 4.6+ and .NET Core) that doesn't use reflection but will also serialize telemetry in way you described. That's more likely to be the culprit.
As for why they're piling up in memory, I don't know. These are ETW events (an EventSource) and they should be emitted only when someone is listening (an EventListener) with the right keywords. In this case, these are "Operations" -- requests or dependencies -- things that have duration; they are recorded as pairs of ETW events with opcodes Start and Stop. They are usually used for performance measurement and profiling.
The Application Insights Profiler does use these events and it will enable ETW listeners for the operations keywords. So it's the most reason for why we're going down this code path. The serialization of these objects to native memory happens inside the EventSource code (in corefx) -- inside WriteEvent.
(I'll have to continue this later... I'm already running late for something else. But I do think there might be something there... We need a good repro of the memory growth. If you can get one outside of Azure, that would also help eliminate a large complicating factor.)
I wonder if ETW/PerfView guru @vancem could give us some advise how to investigate it further Currently the issue cannot be reproduced locally and only reproduces on some of our Azure environments.
We have 3 dumps of the same process (x86, dotnet.exe) made in the different points in time.
Dumps could be found here https://drive.google.com/file/d/1otuVIB_ZX0CTptkFENOYkPv1jk4IwR5e/view?usp=sharing
@pharring are you on this?
@yahorsi perhaps you can tell what environment and settings you are running with. Even the best of us need some help narrowing down scope before reading dumps =)
by environment I mean - is Applicaiton Inisights Profiler enabled for instance. Any custom confguraition for the Application Insights? Windows version?
.NET Core 2.2.3 running as Azure App Service. Azure uses x86 .NET core runtime so it is x86 process running I beleive on x64 Windows. Windows Version is hard to say, dump might say more, but again tomorrow :) Private bytes grows till 1.7G and then app gets restarted (or crashed), not sure how that works.
Windows version?
The one Azure uses :) We do not install or manage it.
by environment I mean - is Applicaiton Inisights Profiler enabled for instance
It should be enabled now. I'll make more screens tomorrow morning
What version of Application Insights? Charts looks like CPU charts there: https://github.com/Microsoft/ApplicationInsights-aspnetcore/issues/690
I believe it was a memory leakage too
CC: @cijothomas
I believe it was a memory leakage too
It is definitely leak. The questions is where :) I did my best to catch it in the managed world and failed. As for me, managed heap look ok and not increasing that much as private bytes grows. As I understand here we have something that causes unmanaged heap to grow.
But I might be mistaken.
@yahorsi so what version of AI SDK do you have? As I said we fixed a leak in HttpClient in the past - see associated issue. And I believe the issue was leaking http connections, which also unmanaged memory.
From the dump: Windows 10 Version 14393 MP (2 procs) Free x86 compatible Product: Server, suite: TerminalServer DataCenter SingleUserTS 10.0.14393.2430 (rs1_release_inmarket_aim.180806-1810)
Microsoft.ApplicationInsights.dll Version 2.08.1.22898 Microsoft.ApplicationInsights.AspNetCore.dll Version 2.05.1.0
Sorry to leave you hanging. I've been heads down all day on other things. Thanks for all the extra info, @yahorsi.
The extra wrinkle here is how ETW events are captured from within the App Service sandbox. We (App Insights Profiler) have to go through a special service to enable/disable ETW providers. As far as I know, though, that shouldn't cause events to pile up in the sending process. EventSource's WriteEvent code passes pinned memory to the equivalent Win32 API..
I will run some experiments locally to see if I can reproduce the problem. An ASP.NET Core 2.2 app with App Insights 2.8.1 (base) and 2.5.1 (AspNetCore). I'll use a "no-op" telemetry sink, turn on the profiler's listener and blast the app with requests to see if it blows up. If it does, I would suspect something in .NET Core 2.2 (corefx) because this stuff has been working fine for years on .NET FX and earlier versions of .NET Core.
One more thing that may or may not be relevant: I have had a couple of anecdotal reports that the RichPayloadEventSource code can get enabled even though "no-one is listening". This also shows up (surprisingly) in the Snapshot Debugger because it tries to play nice with the Profiler and avoid taking snapshots when a detailed profile is being collected. Snapshot Collector says "not taking a snapshot because profiler is running" even though (through other telemetry) we know the profiler was NOT running. It's as if the ETW provider gets stuck on. Very recently, I had problems with the unit test in that area.
I took a look at the largest of the 3 dump files you shared and I see the same strings you saw. The way they're laid out in memory doesn't match the serialization format of a raw ETW event. These are just the field names -- there's no intervening data (and they happen to fit nicely in LFH blocks, so they're all gathered together). I do see other native heap blocks that might be the remaining pieces of the serialized payloads (they iKey, operationId and rootId).
So, I cooked up a repro running locally (not in Azure). In fact, it's a .NET Core 2.2 console app, x86 and all it does is use App Insights' Start/StopOperation in a tight loop (with a null telemetry channel). As soon as the ETW provider is enabled, then memory starts climbing. When the ETW provider is disabled (stop listening), memory stops climbing, but it is not freed. I was able to reproduce this behavior without even using Application Insights. So, this is looking like a .NET Core bug -- possibly a regression from 2.1. I will narrow down the repro and report it in the appropriate dotnet GitHub repo.
Oh, and the reason you're seeing this in Azure is because of the Application Insights Profiler. It turns on the ETW provider for Application Insights during detailed profiling. That runs for 2 minutes at a time, so native memory will be leaked for every request or dependency that's processed in that 2 minute window.
Good find. I'll chase it down.
In your repo have you looked at the unmanaged memory? Do you have same "strings" as in my dumps? Now I wonder how we could trace what triggers these Start/Stop calls on production.
PS I've already created one issue here, seems it should be closed. https://github.com/aspnet/AspNetCore/issues/8648
0.20 to add, you'ge got a really cool repo. But, I still do not understand one thing we have AI=ON on all our environments but we observe leak just on the production env. Any idea how could we try to repo leak on other environments? Any setting or so?
In my repro, yes I looked at unmanaged memory and I see the strings from my event payloads -- the same pattern we see with the real Application Insights RichPayloadEventSource.
In your other environments do you have the same kind of load as in production? The amount of memory allocated depends on the number of Application Insights operations (Requests or Dependencies) logged. The more requests you have, the larger the leak.
These events get turned on by the Application Insights Profiler. Indeed, the best workaround for now for production would be to disable just the Profiler, leaving other Application Insights features on.
I still unable to repo this. I made an API app and add AI, deployed it to Azure and run httpping. No leak ((
How did you "add AI"? Did you enable the Profiler? How long did you keep the app running before you created a memory dump? Note that the profiler runs 'randomly' for 2 minutes at a time about once an hour. Memory will start growing only during those 2 minute profile sessions and only if the site is active -- processing requests. If you need to control when the profiler runs, you can do "on-demand profiling" from within the Performance section of Application Insights -- although it's still not "instantaneous".
How did you "add AI"?
public static IWebHostBuilder CreateWebHostBuilder(string[] args) => WebHost.CreateDefaultBuilder(args) .UseStartup<Startup>() .UseApplicationInsights("11111111-2222-3333-4444-555555555555");
Did you enable the Profiler?
I did enable it in the code and in the portal
How long did you keep the app running before you created a memory dump?
In my test I didn't, I just run something like below for ~12 hours. it should produce about 4-5 requests per second. So, for 2 minutes it is about 1200 requests. http-ping.exe -t -i 0 https://yahorsieu.azurewebsites.net/api/values
And then was looking at how memory grows. I didn't notice any big difference on private bytes behavior with or without profiler enabled.
Just for the case, project is here https://YahorSiOrg@dev.azure.com/YahorSiOrg/yahorsi/_git/yahorsi
And could be pinged by curl-ing https://yahorsieu.azurewebsites.net/api/values
I have a repro in Antares App Service using a 5 minute load test with 250 users. It's a lot of steps now that I've written it up, but I think it's pretty simple. The repro doesn't use Application Insights Profiler directly; it starts/stops ETW sessions using the same APIs that Profiler uses.
set PID=[pid-of-w3wp.exe]
set VSD="%SystemDrive%\Program Files\Microsoft Visual Studio 15.0\Team Tools\DiagnosticsHub\Collector\VSDiagnostics.exe"
%VSD% start 42 /scratchLocation:"%LOCAL_EXPANDED%\Temp" /loadAgent:F5091AA9-80DC-49FF-A7CF-BD1103FE149D;ServiceProfilerAgent.dll
%VSD% update 42 /attach:%PID%
%VSD% postString 42 "AddProvider:a62adddb-6b4b-519d-7ba1-f983d81623e0:0x400:4" /agent:F5091AA9-80DC-49FF-A7CF-BD1103FE149D
Explanation: This starts a new profiling session in the sandbox enabling only the Application Insights "data" event source with the "Operations" keyword at information level (that's the command in the quotes on the 'postString' line. The session is now running.
%VSD% stop 42 /output:%LOCAL_EXPANDED%\Temp\my-load-test.diagsession
!heap -a
to show all the heap segments. The leaked strings will probably be in the first heap -- the OS primary heap -- and you can use the memory window to look in the last 2 or 3 segments; there's a good chance you'll find them).Just adding a link to the coreclr bug: https://github.com/dotnet/coreclr/issues/23562
Sorry for disrupting :) I'm still trying to understand the problem :) In our environment, we see that memory grows evenly, without spikes. If leak is caused by AI profiler then we should see spike every hour for 2 minutes. So, I wonder could there be other ETW Profilers that's running always?
You might be hitting #1108
Do you do anything in your startup code that might send telemetry even before you call "AddApplicationInsights". You have to get in a situation where the RichPayloadEventSource is created before you call AddApplicationInsights or access TelemetryConfiguration.Active
.
I'm not sure, below is our host builder code. How these 2 issues related, you mentioned that leak occures only when there is ETW profiler available and collecting events. That should be happening only for 2 minutes during hour, that should generate spikes in memory private bytes every 2 minutes per hour. But, we do not see such spikes, we see that memory grous evenly. That is the mystery and the thing I still do not understand
public static IWebHost BuildWebHost(string[] args)
{
var host = new WebHostBuilder()
.UseKestrel()
.UseContentRoot(Directory.GetCurrentDirectory())
.ConfigureServices(services => services.AddAutofac())
.ConfigureAppConfiguration((hostingContext, config) =>
{
var hostingEnvironment = hostingContext.HostingEnvironment;
config
.AddJsonFile("appsettings.json", true, true)
.AddJsonFile($"appsettings.{hostingEnvironment.EnvironmentName}.json", true, true)
.AddEnvironmentVariables();
})
.ConfigureLogging((hostingContext, logging) =>
{
logging.AddConfiguration(hostingContext.Configuration.GetSection("Logging"));
logging.AddDebug();
})
.UseIISIntegration()
.UseApplicationInsights()
.UseStartup<Startup>()
.UseNLog() // NLog: setup NLog for Dependency injection
.Build();
return host;
}
That looks pretty standard. Yes, if the App Insights Profiler were responsible for enabling RichPayloadEventSource, then you'd only see growth during the 2 minute collection intervals. If you've disabled the profiler and you're still seeing steady growth, then something else has enabled the RichPayloadEventSource. If it's not #1108, then the next thing I'd turn to is to look for other ETW sessions on the machine -- unfortunately, that's impossible to check from "within the sandbox". Contact me in email {same-as-my-GitHub-name}@microsoft.com and I'll try and help.
@yahorsi we have the same climbing graphs as you; our app is very lean and doesn't do much.
We have just turned off the profiler in app insights, will see what affect it has within the next 24 hours.
@pharring @yahorsi I am seeing the same memory climb graphs as discussed over time.
Analyzing a 2GB debug dump of the process showed the .NET Managed memory at ~200 MB, the rest is "Native memory".
I have just turned OFF App Insights profiler, will see what affect it has on our memory in the next 24 hours.
It's been about 4 hours and the trend on the graph seems to be fixed; its leveled out (and gone down a bit). Red line indicates when profiler was turned off (and app was restarted):
This shows me its App Insights Profiler being ON that has caused the leak.
12 hours on, looks fixed with the profiler off (since it uses ETW):
@Plasma You are correct - if you turn profiler off, the memory issue goes away. There is a bug in .net core that causes a memory leak when ETW event sources are turned on. Profiler enables event sources in .net core, so turning off profiler prevents the problem. There will be a fix released soon by the .net core team.
How do we fix it? I noticed that if I use an app where profiler was On for some time - memory leaks.
But if I create new fresh app - memory seems stable if I not touch tab 'application insights' on a web app page. Can we recreate app on prod and keep it - memory leak not gonna come for us :)?
Also I noticed our app crashes with out of memory at point of 1.8Gb memory used as if it is a x32 app. But fresh app that I used to reproduce from scratch goes up to S3 webapps limit which is 7Gb. Also it restart more gracefully, without exceptions there. Any hints on this? Do we have another problem that causes this?
@AshasTob On the App Insights area, turn off the profiler.
@Plasma , I did so. Memory behaves much better, but still not optimal. It still grows, even though, very slowly. I have app deployed twice to different app services with only one difference - first has .UseApplicationInsights(key) and second does not. See memory charts below.
This is app that has 110 requests per second and no application Insights.
Same App first, with profiler ON, and then with profiler OFF. I am concerned with the fact that memory still grows, slowly. (last pike is smaller - as i restarted app to turn off the profiler in azure)
This is my repo, just in case: https://github.com/AshasTob/MemoryLeakRepro
At a cursory glance, I suspect that’s general .net memory use climbing due to some internal caches, etc.
Or it could be something in your application code.
I suspect your main leak is fixed due to profiler being off; and now your noticing a smaller, potential, memory leak, that is more related to your code.
If you see it still trending up after 24-48 hours, via the app service diagnostics area you can choose “memory dump” (note: this will freeze your site!) and then you can analyse it to see what’s holding onto memory.
The above profiler leak was confusing because it never used .net memory.
Welcome back issue 1102 :) we missed you.
My account was flagged by mistake by GitHub spam detecting system, and, so, issue seems was hide and now it's back.
BTW, I have another issue with IntelliCode and most likely ApplicationsInside but now in the VS (latest). What's the best channel to report it?
@yahorsi Thanks for the explanation. For the IntelliCode issue, I suggest using the Feedback button from the Visual Studio IDE to report it.
Sorry it is IntelliTrace :) Anyway, will first try VS channel
This issue is stale because it has been open 300 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Hi Guys!
We have .NET Core 2.2 app(s) running in Azure with AppInsights enabled. In all our app's we see that app private bytes grows asignificantly up until the moment the app stops responding.
We've made several memory dumps and from dumps it looks like there is NO problems with managed heap. But, UNmanaged heap grows.
After !heap -stat –h for the heap that grows we found suspect block size 6e (110 bytes) that uses %50 of the heap. And after !heap -flt s we've got a list of entries on the heap. When we look at the memory for the every entry we see few strings: Operation IKey Id Name RootID
In the order I described them above. Looks like some serialized objects. So, the question is following, could it be AppInsights using unmanaged heap and having for some reason leak? Or may be AppInsights writes to some unmanaged object like pipe/socket anything else that might allocate unmanaged memory.
Thanks