microsoft / ApplicationInsights-Profiler-AspNetCore

Application Insights Profiler sample and documentation
MIT License
66 stars 22 forks source link

Trace file is left when the process crashed before stopping the profiling #204

Open xiaomi7732 opened 7 months ago

xiaomi7732 commented 7 months ago

Describe the bug RT. Basically, when the process is terminated unexpectedly, the temp profiling file will be left there.

To Reproduce

  1. Start profiler.
  2. Kill the process in the middle.
  3. Check the tracing file.

Expect temp file be deleted.

Actual The temp file is still there.

xiaomi7732 commented 7 months ago

Since the process was killed, there's little we can do during the session. However, what we can do, is to clean the trace files up in the next session, should the process started again.

The feature is built into the new beta release, no user action is needed to leverage it:

https://www.nuget.org/packages/Microsoft.ApplicationInsights.Profiler.AspNetCore/2.6.0-beta2

At the same time, it is also possible to customize the behavior:

{
    "ServiceProfiler": {
        "TraceScavenger": {
            "InitialDelay": "00:00:00",    // Delay 0 seconds, no delay effectively.
            "Interval": "01:00:00",          // Run trace scavenger every 1 hour.
            "GracePeriod": "00:10:00"   // Do not delete the file if it was accessed within 10 minuts.
    }
}
esbenbach commented 7 months ago

Nice improvement, looking forward to it...

xiaomi7732 commented 7 months ago

@esbenbach Thanks for the comment. Will you be able to try it with the beta version or will you have to wait for the stable release? We are running tests but it will greatly increase our confidence of the feature if we have your confirmation that it is working.

esbenbach commented 7 months ago

Not sure we are going to test it with the beta. Regardless we are not going to release before january some time. Ill look into using the beta package for that relase though.

One thing im almost certain about is that this is only half a solution. There is still something causing the profiler to get stuck in the first place, generating huge files.

But one thing at a time

On Wed, 6 Dec 2023, 19:51 Saar Shen, @.***> wrote:

@esbenbach https://github.com/esbenbach Will you be able to try it with the beta version or will you have to wait for the stable release? We are running tests but it will greatly increase our confidence of the feature if we have your confirmation that it is working.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/ApplicationInsights-Profiler-AspNetCore/issues/204#issuecomment-1843504465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKQP7WNXDBJAV4NNLGCJXDYIC5EFAVCNFSM6AAAAAA74XQ2QSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTGUYDINBWGU . You are receiving this because you were mentioned.Message ID: <microsoft/ApplicationInsights-Profiler-AspNetCore/issues/204/1843504465@ github.com>

xiaomi7732 commented 5 months ago

2.6.0 is released. Please check out the stable package: https://www.nuget.org/packages/Microsoft.ApplicationInsights.Profiler.AspNetCore/2.6.0

esbenbach commented 2 weeks ago

@xiami7732 We finally had the time to run this in production a bit.

The file is now removed, but it is still generated in the first place. So the issue is NOT solved, as a crash needs to occur before the file is cleaned up (and stops growing).

Secondly, all subsequent profiler writes (different processes on the same VM), will fail with "not enough disk space", even AFTER the file is removed and there is enough disk space.

xiaomi7732 commented 1 week ago

@esbenbach Thanks for the confirming. There are several things that we need to touch base:

  1. The file is generated - yes, that's by design as a cache to holding all events happened during the applications' running; I am not sure holding all those envets in the memory is a good idea. But the deal is, after the running, if the app does NOT crash, it should clean after itself. And it crashes, the file scavenger will do the clean up upon next start. Am I missing something?

  2. A crash needs to occur before the file is cleaned up - that's not expected. The file scavenger should have been keep running. What happens if you set the running of it to more frequent:

    {
    "ServiceProfiler": {
        "TraceScavenger": {
            "InitialDelay": "00:00:00",    // Delay 0 seconds, no delay effectively.
            "Interval": "00:10:00",          // Run trace scavenger every 10 minutes.
            "GracePeriod": "00:5:00"   // Do not delete the file if it was accessed within 5 minuts.
    }
    }

    On the other hand, if there's no crash, files would have been removed automatically.

  3. If there's enough space, it would be weird that it reports back as not enough space. Do you mind post a callstack for when the issue happens?

  4. There's no other reports on this issue, which leads me to think, there's some special setup / environment that maybe I missed. Can you think of any?

esbenbach commented 1 week ago
  1. Let me re-phrase: A file of serveral gbs is still generated, meaning something still caused it to create a 70+GB file which is insane in the first plase.

  2. The crash part may very well be my mis-intepreting things - but the PR mentions killing the process (step 2), I am not 100% certain thats what happens...

  3. System.IO.IOException:
    at System.IO.RandomAccess+<>c.<WriteAtOffsetAsync>b__21_0 (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Threading.AsyncOverSyncWithIoCancellation+<InvokeAsync>d__7`1.MoveNext (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.ThrowForFailedGetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1+StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.IO.Strategies.BufferedFileStreamStrategy+<WriteAsyncSlowPath>d__48.MoveNext (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.ThrowForFailedGetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1+StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.IO.Stream+<<CopyToAsync>g__Core|27_0>d.MoveNext (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at Microsoft.ApplicationInsights.Profiler.Core.TraceControls.DiagnosticsClientTraceControl+<StartWriteAsync>d__20.MoveNext (Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2403.1501, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at Microsoft.ApplicationInsights.Profiler.Core.TraceControls.DiagnosticsClientTraceControl+<DisableAsync>d__16.MoveNext (Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2403.1501, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at Microsoft.ApplicationInsights.Profiler.Core.ServiceProfilerProvider+<StopServiceProfilerAsync>d__3.MoveNext (Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2403.1501, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1+ConfiguredTaskAwaiter.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at Microsoft.ApplicationInsights.Profiler.Core.Orchestration.OrchestratorEventPipe+<StopProfilingAsync>d__3.MoveNext (Microsoft.ServiceProfiler.EventPipe.Orchestration, Version=2.6.2403.1501, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
    at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
    at Microsoft.ServiceProfiler.Orchestration.SchedulingPolicy+<StartPolicyAsync>d__1.MoveNext (Microsoft.ServiceProfiler.Orchestration, Version=2.6.2403.1501, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
  4. Yes, as mentioned in the Azure support case that ended up with this PR (and as described here: https://github.com/microsoft/ApplicationInsights-Profiler-AspNetCore/issues/180) we are running this on SF with the SF profiler and the asp.net profiler (since they serve different data and purposes). And as mentioned in that other thread, there is really no recommended way to work with that setup. And I suspect this is ACTUAL cause of the file going crazy in the first place. The ASP.NET Profiler is mostly disabled by the settings in application insights (Only the Sampling Trigger is turned on).

xiaomi7732 commented 3 days ago

Thanks for the response. The callstack is also interesting. Based on those, I have spotted something that are diff from the expectations.

  1. Why file that worth of gbs - that is way larger than the usual trace file

    • The netperf trace is relatively low profile, a trace file is usually around the size of 50MB.
    • We have a 200MB circular buffer size by default - I assume you didn't overwrite the value to make it larger, could you please confirm?
    • I want to see the trace files to understand why. Could you please send me your application insights instrumentation key over the email (saars@microsoft.com)? With that, and your permission, I shall be able to check 1 piece of the netperf file to see more details about it.
  2. From the callstack, I see an IO exception, that is not necessary out of space exception unless checked by the error code, right? Did you do something to confirm it?

    • If it is out of space, and you still see free space, is it possible the %UserProfile%/temp is mounted somewhere that you have less space than the other folder?
    • Another possibility is that there's no permission to write the trace file to the target folder (from the buffer) - that happens on a read-only container, but also could happen anywhere when writing is not allowed.
    • An easy way to rule those out is to set the local cache folder to a different place: For example: ServiceProfiler__LocalCacheFolder = "c:\TraceCache"
    • The default is set to: %UserProfile%\temp\ on Windows, /tmp/ on Linux.
    • See here for more user configurations.
esbenbach commented 1 day ago
  1. No overwrite that i know of (I.e. if so its NOT done on purpose as I can't find any sort of setting for it). Ill contact you on email for the instrumentation key.
  2. The io exception message text says this (i did NOT check the exception code) : There is not enough space on the disk. : 'C:\SvcFab_App\Infosoft.S4.ConfigurationService.FabricType_App117\temp\ASPNETCORE_83e1964b-b195-4fb4-bc34-2cc49d5ae882.tmp'

There is nothing "easy" about doing cache path adjustments in an SF cluster with both the "sdk" and the "sf" profiler running at the same time :) But it points to the same place....

There are NO containers in play here btw, only SF services.....(which in this case hosts the kestrel process)