microsoft / ApplicationInsights-Profiler-AspNetCore

Application Insights Profiler sample and documentation
MIT License
66 stars 22 forks source link

No profiler upload on net6 in AKS #172

Closed CodedBeard closed 2 years ago

CodedBeard commented 2 years ago

We've been trying to get the profiler to work for our aspnet core app running in AKS. While forcing a profiling session from the portal appears to correctly start the tracing, the results are never uploaded to the portal, and no errors are recorded anywhere that we can find.

If we exec into the container, we can see there is a TraceUpload.zip file being created in the container, but there are no logs to suggest why the upload keeps timing out.

Versions Used:

SDK: Net6
Container: mcr.microsoft.com/dotnet/aspnet:6.0.0-bullseye-slim-amd64
Nugets

<PackageReference Include="Microsoft.ApplicationInsights.AspNetCore" Version="2.19.0" />
<PackageReference Include="Microsoft.ApplicationInsights.Kubernetes" Version="2.0.3" />
<PackageReference Include="Microsoft.ApplicationInsights.Profiler.AspNetCore" Version="2.4.0-beta1" />

We also tried using the 2.3.1 version of the package, but that had the same issue.

xiaomi7732 commented 2 years ago

@CodedBeard, thank you for filing the issue. I am sorry to hear that Profiler's not working for you.

The packages you referenced should work with .NET 6. That said, there are various reasons to not having any profiling trace. I'd like to know more for troubleshooting.

Just for some quick check:

Firstly, during the profiling session, is there traffic to your application? No traffic no profile :-) Secondly, do you have a writable folder on your container for uploader to work? We assume /tmp to be writable but there had been exceptions and we need special settings for those read-only environments.

If those don't hit, we will need logs to tell us what is going on.

Could you please turn on logs for category of Microsoft.ApplicationInsights.Profiler and see what is missing?

Just in case, one quick way to do it is, in appsettings.json, you add/append:

{
...
  "Logging": {
    "LogLevel": {
      "Default": "...",  // You might have some other settings
      "Microsoft.ApplicationInsights.Profiler": "Debug" // Turn on debug logs for Profiler
    }
  }
...
}

Refer to Configurations for more details on configuration. I would suggest you start with Debug and if necessary, use Trace.

You should see information like Profiler started, activity captured, and when the session is ready for uploading, how is uploader located and more. See if there's error or warnings.

If you want, post the logs and I can help take a look as well.

CodedBeard commented 2 years ago

Thanks for the tips @xiaomi7732.

Yes, there is definetly traffic hitting the system, as we are in the process of testing a large new release :)

I've got to go to a conference today, but will give the other suggestions a try when im back.

xiaomi7732 commented 2 years ago

Please let us know.

CodedBeard commented 2 years ago

@xiaomi7732 I've confirmed that /tmp is writtable within the containers.

After enabling the Debug logging level, we've found the following exception being thrown:

{"EventId":0,"LogLevel":"Debug","Category":"Microsoft.ApplicationInsights.Profiler.Core.Orchestration.RemoteProfilerSettingsService","Message":"Unexpected error contacting service profiler service endpoint for settings. Details: System.ArgumentNullException: Value cannot be null. (Parameter \u0027command\u0027)\n   at Microsoft.ServiceProfiler.DataContract.Settings.SettingsParser.ParseAgentSettings(String command, AgentSettings baseSettings)\n   at Microsoft.ServiceProfiler.DataContract.Settings.SettingsParser.ParseManyAgentSettings(String[] commands, AgentSettings baseSettings)\n   at Microsoft.ApplicationInsights.Profiler.Core.Contracts.ProfilerSettings.SetFromSettingsContract(SettingsContract settingsContract)\n   at Microsoft.ApplicationInsights.Profiler.Core.Orchestration.RemoteProfilerSettingsService.FetchRemoteSettingsAsync(CancellationToken cancellationToken)","State":{"Message":"Unexpected error contacting service profiler service endpoint for settings. Details: System.ArgumentNullException: Value cannot be null. (Parameter \u0027command\u0027)\n   at Microsoft.ServiceProfiler.DataContract.Settings.SettingsParser.ParseAgentSettings(String command, AgentSettings baseSettings)\n   at Microsoft.ServiceProfiler.DataContract.Settings.SettingsParser.ParseManyAgentSettings(String[] commands, AgentSettings baseSettings)\n   at Microsoft.ApplicationInsights.Profiler.Core.Contracts.ProfilerSettings.SetFromSettingsContract(SettingsContract settingsContract)\n   at Microsoft.ApplicationInsights.Profiler.Core.Orchestration.RemoteProfilerSettingsService.FetchRemoteSettingsAsync(CancellationToken cancellationToken)","0":"System.ArgumentNullException: Value cannot be null. (Parameter \u0027command\u0027)\n   at Microsoft.ServiceProfiler.DataContract.Settings.SettingsParser.ParseAgentSettings(String command, AgentSettings baseSettings)\n   at Microsoft.ServiceProfiler.DataContract.Settings.SettingsParser.ParseManyAgentSettings(String[] commands, AgentSettings baseSettings)\n   at Microsoft.ApplicationInsights.Profiler.Core.Contracts.ProfilerSettings.SetFromSettingsContract(SettingsContract settingsContract)\n   at Microsoft.ApplicationInsights.Profiler.Core.Orchestration.RemoteProfilerSettingsService.FetchRemoteSettingsAsync(CancellationToken cancellationToken)","{OriginalFormat}":"Unexpected error contacting service profiler service endpoint for settings. Details: {0}"}}
xiaomi7732 commented 2 years ago

Hey @CodedBeard Thanks for the info.

Based on what provided, I have 2 points that I want to share with you:

  1. The call stack points to a bug in our code base that null is not well checked before calling parser
    1. While we are here, do you mind share your iKey with us (mail me: saars@microsoft.com, or whatever way you are comfortable with)? It is interesting to see what the profiler settings you have in our backend to understand why we will encounter null case.
  2. However, that is not the reason for profiler not working because the success to that method is optional.
    1. While you have the settings, can you share the console log for the first few minutes (2 to 3 minutes) on starting the pods? I'd like to dig more into why profiler is not happening.

Details

CodedBeard commented 2 years ago

Thanks @xiaomi7732.

I've sent you a mail with the Ikey and the log buffer from the pod.

One thing I've noticed, is it appears that error has now stopped, and the RandomSchedulingPolicy seems to have started firing and uploading results. But attempts to start a profiling session from the portal are still resulting in a timeout error.

CodedBeard commented 2 years ago

I've just tried another Profile Now run, and this time it worked. A little confused as we've not changed anything further, so it seems like something was going wrong on the service side related to the original exception.

xiaomi7732 commented 2 years ago

Hey @CodedBeard Thanks for the updating and the logs.

Unfortunately there's no timestamp on the log so I can't tell when. But here's what I observed:

  1. The exception we mentioned in the last conversation was resolved about half way of the logs (line 3397) - combing it with our backend data, I assume you or someone owns your application insights resource at that moment somehow touched the settings of the profiler on the portal, that's when the buggy code path got bypassed.

  2. I was thinking the remote settings shouldn't impact the execution of the profiler, however, I might oversighted the fact that bad settings (by exception) might trick the agent to think that OnDemand policy or other policy got disabled. We will need to investigate more about it.

At the same time, I am leaning toward changing the Profiler settings, thus generate the settings instead of using default, is a work around, just in case anyone encountered the same:

image

I'd keep this issue open until we reach the bottom of the investigation.

CodedBeard commented 2 years ago

@xiaomi7732 Ah, now you mention it, I did try lowering the memory threshold to below what I knew the pod was using at some point to see if that would cause it to work, but as far as I recall, it was a good hour or so before the first trace fired which is why I completely forgot about it.

That was certainly the first and only time I changed the Trigger settings (and I'm the only one who has access to do so), so I think the hypothesis that doing so fixed the settings error makes sense.

xiaomi7732 commented 2 years ago

@CodedBeard Thanks for the quick turn-around. I think we have enough information at this point to find out the root cause. Thank you so much for providing the information.

CodedBeard commented 2 years ago

No problem @xiaomi7732. Just to add some further confirmation, I've now replicated this in another branch of the same project.

  1. Cherry picked the same changes into another branch that uses a seperate environment and app insights resource
  2. deploy changes and observer the same exception being logged
  3. Attempt to use Profile now and receieve a timeout error.
  4. Change the memory threshold to 40% in the portal
  5. The exception immediatley stops in the logs, and this time the memory profiler kicked in straight away
  6. Profile now still seems to be timing out right now, but maybe thats due to the profiler kicking off for the memory threshold.
xiaomi7732 commented 2 years ago

Sorry, one thing that I forgot to check with you:

I understand without the workaround, it appears that Profile Now is timing out; However - can you see other profiling sessions show up in the portal listed as Recent profiling sessions? And if there's any, mind telling us what are the values for Triggers?

That info's gonna to help us narrow down the scope - whether it is globally not working or is it just one or two specific policy.

CodedBeard commented 2 years ago

For the branch I just applied it to, no, there are no entries yet.

The original branch fired twice on the RandomSchedulingPolicy with memory values of 92% and 89% before the Profile now first succedded.

xiaomi7732 commented 2 years ago

Thank you for the confirm!

xiaomi7732 commented 2 years ago

Hey @CodedBeard, while @JacobBovee is actively working on fixing the bug, we are looking forward to hear back from you about the investigation on the other branch, if you are willing to share. And we will let you know when there's a newer package that you could try on for the bug fix.

CodedBeard commented 2 years ago

Hi @xiaomi7732, sorry for the delay, I've been super busy at work.

Due to being focused on other projects i hadn't checked the other branch in a while, but i've just looked this morning and can see the profiler has since been firing correctly based on cpu/memory usage.

xiaomi7732 commented 2 years ago

@CodedBeard, No worries. I am glad to hear back from you :-) Actually, we released 2.4.0 with the bug fix and you don't need the workaround any further. Feel free to give it a shoot: https://www.nuget.org/packages/Microsoft.ApplicationInsights.Profiler.AspNetCore/

CodedBeard commented 2 years ago

@xiaomi7732 Great, I'll be deploying the updated package to prod this evening, will let you know if it all works.

CodedBeard commented 2 years ago

@xiaomi7732 Have now deployed 2.4 package to production and everything worked first time without having to change the settings, so looks like this can be closed now thanks.

xiaomi7732 commented 2 years ago

Thank you @CodedBeard. I am glad to hear that. Thanks for the prompt replies. With your help, Profiler is better now :-D