Configuration Guidelines or Suggestions for profiling in high-density environments

esbenbach commented 2 years ago

We have a series of service running in Service Fabric.

To get the most out of the environment, each Fabric Node (VM) is running a "large number" of services, each possibly running multiple replicas.

In production all of these services and processes use the same App Insights instance.

When we deploy new versions, its rather common that a series of processes are started at the exact same time, causing the profiler to consume a significant portion of memory in a "synchronized" manner (i.e. they all run at the same time every X sampling rate).

Our application is mostly memory consuming, each process in itself is not using that much memory, but because of the density we are constantly seeing 80% or more memory use.

We can move the trigger threshold but its not really helping, because when the threshold is crossed, ALL the services starts profiling, causing even more memory use.

Are there any recommendations for handling such a high density scenario? Ideally we would like to have a per process memory consumption trigger or have it triggered by the SF governance/health subsystem.

Secondly it would be really awesome if the sampling was randomized especially on the initial session as it is a bit unfortunate to have it running all at the same time during deployments.

The obvious answer is to have enough memory to handle the issue, but that leaves us with a "gap" of unused memory most of the time.

I imagine that there are similar issues with CPU usage being the main driver.

xiaomi7732 commented 2 years ago

Hi @esbenbach Thanks for posting the question. Help us understand the environment a bit more:

Is your app running on Windows?
Are you using Microsoft.ApplicationInsights.Profiler.AspNetCore NuGet package for profiling or are you using the Service Fabric Profiler extension for profiling?

esbenbach commented 2 years ago

The app is running on Windows, but my question is of a general natur to understand what we can/can't do (and what we should do).

Currently we instrument the aspnet core services using the asp profiler nuget package, but to get the non-aspnet core services included we are also using the SF Profiler Extension.

xiaomi7732 commented 2 years ago

Hi @esbenbach Thanks for the reply. Generally, it is recommended to avoid profiling multiple instances at the same time.

For the NuGet package, there are several customizations that could help:

InitialDelay: Delay before starting the very first session after the application starts up.
- Setting it up with a random number of minutes could avoid having the very first profiling happen at the same time;
IsDisabled: Kill switch to turn off Profiler.
- Set it to true by various conditions, reduce profiler enabled on replicas.

For example, on initialization code:

services.AddServiceProfiler(opt => {
    // Write a helper to generate random timespan for minutes
    opt.InitialDelay = GetRandomeTimeSpan(2, 5);

    // Figure out some condition to disable profiler for given instances but keeps the other on. Maybe a hash of the clould_RoleInstance or so.
    if ( someCondition ) {
        opt.IsDisabled = true;
    }
});

Refer to Configurations.md for more info. Let us know.

esbenbach commented 2 years ago

Okay that seems reasonable. So far we have mostly been using the portal settings. We can switch to use a code based configuration, thus granting us the needed control.

Given that we are running on 1 app insights instance for multiple regular .Net 6 services (i.e. Not aspnet) and we can't configure the SF extension in a random delayed manner is there anything to do but disable the profiler for these and enable it "on demand"?

Also feedback for the docs would be to write down stuff like : dont profile multiple processes at the same time. Thinking about the statement it makes perfekt sense, but its not something that comes to mind immediately.

esbenbach commented 2 years ago

Since we are running everything inside SF, should we just stop using the NuGet package altogether? And just let the SF profiler handle things? And will it take care to not launch the profiler multiple times even though we have multiple services running when a memory threshold is crossed for instance?

xiaomi7732 commented 2 years ago

Good questions. The answer is a bit complex:

As of today, using SF profiler won't help too much with over profiling. Based on the app architecture, specifically, multiple applications/instances reporting to the same application insights resource. The level of control portal settings have is on the application insights resource level. All apps report to the same application insights uses the same settings. Triggers, based on Performance Counters, are global for windows applications as well.

The only benefits is that you won't have multiple profiler on a same VM.

It is best if you could break the telemetries down to multiple application insights resources.

If you can't change your application's architecture, the NuGet package + configuration is something that could work today.

Pop out a bit, with regarding to picking which profiler to use, please also consider how helpful it is of the trace file. The SF profiler generates ETL, profiles the whole machine (multiple processes), you can download the etl for deep analysis. The NuGet generates more lightweight netperf/nettrace file, which contains only your apps process info. The analysis could be quicker and more effective under some circumstance.

All that said, we have an effort going on to grant more controls to role level or even instance level. It is made possible due to a backend change we had recently. I don't have a good ETA on that, let me know if you are interested to find out more about it.

esbenbach commented 2 years ago

In terms of profiling "machine level" vs "process level" i am pretty sure our answer is "it depends".

But I am inclined to believe that process level profiling would be preferable as we can possibly see time spent in reaching out to (and waiting for) external dependencies should that be an issue in itself.

Re-arranging the App Insights structure would probably make sense now that we have Azure Monitor and workspace based logging. However we started out with separate App Insights instances, and it was a nightmare to find anything useful ;)

I would like to be kept in the loop about role/instance level configuration Given that the profiler can already be added for AspNetCore services and we are using .NET for the other services, we would partially be able to solve our issue by having a profiler nuget package that works with any .NET Generic Host implementation (but only partially because memory/cpu threshold thing on machine level still makes absolutely no sense from a single process view).

Feel free to close the issue if you want, I think I have the answers I needed.

xiaomi7732 commented 2 years ago

Thanks for the feedback. We will post more info this this thread about the effort that I believe will address the trigger issue you mentioned.

esbenbach commented 1 year ago

@xiaomi7732 Any updates on the role/instance level configuration? Or are we still on the path of having to split into multiple insights instances?

Seems like a rather large overhead (managing multiple App Insights resources) just to be able to configure the triggers on a per process/replica basis.

Basically if we were to take this to production and have our processes individually configured for triggering we would currently have 99 App Insights instances, taking it out to the primary replica levels (which I would never even consider), we would end up with somewhere around 160 instances.

Realistically we could probably come up with some scheme where we group processes of a specific type and then group by that instead of logical/domain based roles. Regardless, it seems "wrong" :) (yes thats vague, sorry)

xiaomi7732 commented 1 year ago

Hey @esbenbach, sorry, there's not too much progress on this feature. I understand, it is a unrealistic situation to manage that many instances. When time permits, we probably are going to deliver the feature agent by agent. Which Profiler do you decide to use? The EventPipe one (this nuget package) or the SF profiler?

esbenbach commented 1 year ago

We ended up using the "SF Profiler".

Mostly because we otherwise need to take in dependencies on asp.net core in our non-asp processes.

I may add that the IaaSDiagnosticsExtension profiler thing is largely undocumented - but im guessing it works for stuff thats NOT Service Fabric related.

And lastly, since "everything else" is migrating to the Azure Monitor Agent, it would be just dandy if we could put this profiler setup into the same category (though given that this is a profiler and not a simple data collection thing, that may be more difficult to achieve).

xiaomi7732 commented 1 year ago

Thanks for the info. That makes sense. At this point, I wish I had a timeline for you, but I don't. I will update you once I hear anything.

microsoft / ApplicationInsights-Profiler-AspNetCore

Configuration Guidelines or Suggestions for profiling in high-density environments #180