microsoft / ApplicationInsights-Profiler-AspNetCore

Application Insights Profiler sample and documentation
MIT License
66 stars 24 forks source link

"Process X not running compatible .NET Core runtime." #117

Closed andersekdahl closed 3 years ago

andersekdahl commented 3 years ago

We're using the profiler (2.2.0-beta2) on Azure Container Instances with Linux (mcr.microsoft.com/dotnet/core/aspnet:3.1-buster-slim) and we recieve a couple of errors in our logs from the profiler. Since it's beta we've thought it to be expected, but yesterday we got the following error in our logs:

Microsoft.Diagnostics.NETCore.Client.ServerNotAvailableException: Process 18 not running compatible .NET Core runtime.

Microsoft.Diagnostics.NETCore.Client.IpcClient.GetTransport(Int32 processId):206
Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessage(Int32 processId, IpcMessage message)
Microsoft.Diagnostics.NETCore.Client.EventPipeSession.Stop():26
ServiceProfiler.EventPipe.Client.TraceControls.DiagnosticsClientTraceControl.Disable():81

We're definitely running .NET Core and only running .NET Core. This error happened at the same time as the container crashed. Did this error cause our crash?

xiaomi7732 commented 3 years ago

Hi @andersekdahl thanks for filing the issue. The callstack is helpful. Please give me some time to review the code to see if that could be the cause of the crash.

Having a local reproduce of the issue will be helpful for us to tracking down the it. Does the crash happen all the time? Do you know what are the steps to repro the crash? How's the memory usage look like before the crash?

At the same time, considering some bug fixes around the the call stack you posted, would you mind to try on a newer profiler: 2.2.0-beta4? The concerned issues fixed are:

109

110

I'll let you know once the related code is checked.

andersekdahl commented 3 years ago

It doesn't crash all the time, there's usually a couple of days between the time we see this error. Have never had it happen locally, only sporadically on ACI. Unfortunately I can't see and direct patterns when it occurs, memory and CPU varies but is normal when it has happened.

I'll be sure to try the new beta as soon as OmI can, probably during the next couple of days.

xiaomi7732 commented 3 years ago

Hi @andersekdahl Thanks for the quick turn-around. I finished review the code. By my understanding, I think things happen the other way around:

  1. The container was crashing somehow;
  2. The file system is unloaded - it will be explained why this matters.
  3. Your application was shutting down because of the last steps.
  4. Profiler, that happen to be running, was shutting down - and that was the code path to call 'Disable()', otherwise, DisableAsync();
  5. Profiler is trying to close the EventPipe stream, which is the technology provided by .NET Core runtime through a IPC server that is based the filesystem, and at this moment, that exception will throw if the IPC pipeline was already gone because of step 2. Since we can't find the IPC pipeline on the file system, the library will treat it as if it was running on a version of .NET Core that was not supported, thus, the error message.
  6. There happen to be one improvement in 2.2.0-beat4 in that code path: if the same situation happens, you will get an additional message stating 'Unexpected exception stopping session' instead of saying .NET Core version is incorrect.
  7. Profiler caught the exception, but it won't actually re-throw it further - thus - us being able to find the error message in application insights logs.

There must be something else that is causing the crash of the container - within other parts of the profiler or out of profiler. We need to find out what that is.

Is there anything on the container instance side that we can review? Logs? Signals for crashing? Maybe try to file a support ticket on the container instance team for diagnosing what is causing the crash?

Let us know. Thanks.

xiaomi7732 commented 3 years ago

@andersekdahl Just a casual check, did you find anything new?

andersekdahl commented 3 years ago

Haven't been able to deploy the update for different reasons. We're still seeing the error every now and then, but it always happen at the same time as a crash. So I don't think this error is what caused the crash I reported.

dstj commented 3 years ago

@xiaomi7732 Just FYI, I just had the same error twice today with version v2.2.0.

It appears to be when the web server is shutting down.

2021-06-10T19:54:49.368262+00:00  Stopping all processes with SIGTERM
...
2021-06-10T19:54:55.491523+00:00 Unhandled exception. Microsoft.Diagnostics.NETCore.Client.ServerNotAvailableException: Process 3 not running compatible .NET Core runtime.
at Microsoft.Diagnostics.NETCore.Client.IpcClient.GetTransport(Int32 processId)
at Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessage(Int32 processId, IpcMessage message)
at Microsoft.Diagnostics.NETCore.Client.EventPipeSession.Stop()
at Microsoft.ApplicationInsights.Profiler.Core.TraceControls.DiagnosticsClientTraceControl.StopProfilerSession(Boolean disposeEventSessionImmediately)
at Microsoft.ApplicationInsights.Profiler.Core.TraceControls.DiagnosticsClientTraceControl.Disable()
at Microsoft.ApplicationInsights.Profiler.Core.TraceControls.DiagnosticsClientTraceControl.Dispose()
at Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope.DisposeAsync()
--- End of stack trace from previous location ---
at Microsoft.Extensions.Hosting.Internal.Host.DisposeAsync()
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Pym.Apps.WebServer.Program.Main(String[] args) in /src/Pym.Apps.WebServer/Program.cs:line 15
at Pym.Apps.WebServer.Program.<Main>(String[] args)
xiaomi7732 commented 3 years ago

@dstj Thanks for the report and the callstack. I'll take a quick look.

xiaomi7732 commented 3 years ago

This looks like a case where the application is shutting down, I'll seek to issue a fix to swallow ServerNotAvailableException for application shutting down.

xiaomi7732 commented 3 years ago

The fix is released in 2.3.0-beta2.