microsoft / ApplicationInsights-Profiler-AspNetCore

Application Insights Profiler sample and documentation
MIT License
67 stars 24 forks source link

Random exceptions during application run #182

Closed sloncho closed 2 years ago

sloncho commented 2 years ago

.Net 6 AspNet app - all packages/framework/runtime latest at the moment of writing. Microsoft.ApplicationInsights.Profiler.AspNetCore 2.4.0 Microsoft.ApplicationInsights.AspNetCore -Version 2.21.0

While running the app in kubernetes, we see the following exception:

System.IndexOutOfRangeException: Index was outside the bounds of the array.

at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.GetByIndex(Int32 index)

at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.Get(Double value)

at Microsoft.ApplicationInsights.Profiler.Core.Sampling.SampleActivityContainer.AddSample(SampleActivity activity)

at Microsoft.ApplicationInsights.Profiler.Core.EventListeners.TraceSessionListener.AppendSampleActivity(SampleActivity activity)

at Microsoft.ApplicationInsights.Profiler.Core.EventListeners.TraceSessionListener.OnRichPayloadEventWritten(EventWrittenEventArgs eventData)

at Microsoft.ApplicationInsights.Profiler.Core.EventListeners.TraceSessionListener.OnEventWritten(EventWrittenEventArgs eventData)

We were not able to reproduce this running the container locally.

xiaomi7732 commented 2 years ago

Thanks for the issue filing. That is bad and weird. @sloncho, could you please help understand how frequently you are seeing the error in your prod environment? Also, what's the scale of your application? Like what's the RPS? What's the memory status? How many replica do you have for your pods? This is the first report on this and please help us understand more of your system.

sloncho commented 2 years ago

@xiaomi7732 Leaving it only at 1 healthcheck call per minute (just returns OK, no processing at all, and no other requests), we can see 5-10 exceptions a day. The scale/mem/cpu data is as follows:

resources:
          limits:
            cpu: 156m
            memory: 320Mi
          requests:
            cpu: 125m
            memory: 256Mi

Minimum replicas: 2 Maximum replicas: 4 HPA: CPU Average Utilization 75% Memory working set: Average 175 MB, Limit 320 MB

This is not cpu intensive service.

xiaomi7732 commented 2 years ago

@sloncho That is super weird. I need the value of duration to understand why is it thrown. Is it possible for you to run the same code with a newer package (that I am preparing) and turn on the debug logging? That will help troubleshoot.

sloncho commented 2 years ago

Sorry for the delay. We found out that increasing the profiler startup delay to 1 minute helps, the exception rate decreases significantly, once in a few days. That caused the delay, we wanted to make sure.

We can attempt that custom package w/ debug output to help investigate the issue.

xiaomi7732 commented 2 years ago

Hey, @sloncho thanks for getting back to us. I have updated the code for troubleshooting this issue and it has been released. Do you mind trying it on the new package (I will put them at the bottom) and posting the logs? Please turn on the Debug log for the category of Microsoft.ApplicationInsights.Profiler.Core.Sampling.

For example:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning",
      "Microsoft.ApplicationInsights.Profiler.Core.Sampling": "Debug"  //<----- Add this line to turn on Debug
    }
  }
}

References

Package (2.5.0-beta1)

https://www.nuget.org/packages/Microsoft.ApplicationInsights.Profiler.AspNetCore/2.5.0-beta1

How to configure logging:

https://github.com/microsoft/ApplicationInsights-Profiler-AspNetCore/blob/main/Configurations.md#sets-the-logging-level-for-profiler

Related code

namespace Microsoft.ApplicationInsights.Profiler.Core.Sampling;
...
ValueBucketer<SampleActivityBucket, SampleActivity> bucketer = _operations.GetOrAdd(activity.OperationName, (operationName) => new ValueBucketer<SampleActivityBucket, SampleActivity>());
double activityDurationInMs = activity.Duration.TotalMilliseconds;
_logger.LogDebug("Activity duration(ms): {duration}", activityDurationInMs); // this is the value that I want to see.
SampleActivityBucket bucket = bucketer.Get(activityDurationInMs);
...
sloncho commented 2 years ago

Hey @xiaomi7732 pls find attached the complete log. profiler_logs.zip

xiaomi7732 commented 2 years ago

Thanks @sloncho. I checked the logs, I believe I see the data that I want. I'll need some time to dig into the data a bit and I'll get back to you later. Thank you for the help!

xiaomi7732 commented 2 years ago

Hey @sloncho, I want to give you an update. I can occasionally reproduce the issue when it is under concurrent execution. I am wondering why we haven't had any report on this before. I also can't reproduce it consistently yet. But at least, I think I have a grasp. I'll keep looking into it.

Here's the code I used with the telemetry gathered to reproduce the issue:

double[] durations = new double[] { 6.6419, 54.3629, 1.9237, 106.432, 1.5464 };

ILogger<SampleActivityContainer> logger = new NullLogger<SampleActivityContainer>();
SampleActivityContainer sampleActivityContainer = new SampleActivityContainer(logger);

Parallel.ForEach(durations, new ParallelOptions(), (duration, state, index) =>
{
    sampleActivityContainer.AddSample(new SampleActivity() { OperationName = nameof(ShouldWorkWithBasicBucketing), Duration = TimeSpan.FromMilliseconds(duration) });
});

Here's a call stack I captured when it repros:

ServiceProfiler.EventPipe.Client.Tests.SampleActivityContainerTests.ShouldWorkWithBasicBucketing:
    Outcome: Failed
    Error Message:
    System.AggregateException : One or more errors occurred. (Index was outside the bounds of the array.) (Index was outside the bounds of the array.)
---- System.IndexOutOfRangeException : Index was outside the bounds of the array.
---- System.IndexOutOfRangeException : Index was outside the bounds of the array.
    Stack Trace:
       at System.Threading.Tasks.TaskReplicator.Run[TState](ReplicatableUserAction`1 action, ParallelOptions options, Boolean stopOnFirstFailure)
   at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
   at System.Threading.Tasks.Parallel.ThrowSingleCancellationExceptionOrOtherException(ICollection exceptions, CancellationToken cancelToken, Exception otherException)
   at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](TSource[] array, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable`1 source, ParallelOptions parallelOptions, Action`3 body)
   at ServiceProfiler.EventPipe.Client.Tests.SampleActivityContainerTests.ShouldWorkWithBasicBucketing() in d:\Repos\ServiceProfiler3\tests\ServiceProfiler.EventPipe.Client.Tests\SampleActivityContainerTests.cs:line 22
----- Inner Stack Trace #1 (System.IndexOutOfRangeException) -----
   at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.GetByIndex(Int32 index) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.DiagnosticsAgent\Collectors\ValueBucketer.cs:line 43
   at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.Get(Double value) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.DiagnosticsAgent\Collectors\ValueBucketer.cs:line 58
   at Microsoft.ApplicationInsights.Profiler.Core.Sampling.SampleActivityContainer.AddSample(SampleActivity activity) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.EventPipe\ServiceProfiler.EventPipe.Client\EventListeners\Sampling\SampleActivityContainer.cs:line 32
   at ServiceProfiler.EventPipe.Client.Tests.SampleActivityContainerTests.<>c__DisplayClass0_0.<ShouldWorkWithBasicBucketing>b__0(Double duration, ParallelLoopState state, Int64 index) in d:\Repos\ServiceProfiler3\tests\ServiceProfiler.EventPipe.Client.Tests\SampleActivityContainerTests.cs:line 24
   at System.Threading.Tasks.Parallel.<>c__DisplayClass32_0`2.<ForEachWorker>b__2(Int32 i, ParallelLoopState state)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica`1.ExecuteAction(Boolean& yieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica.Execute()
----- Inner Stack Trace #2 (System.IndexOutOfRangeException) -----
   at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.GetByIndex(Int32 index) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.DiagnosticsAgent\Collectors\ValueBucketer.cs:line 43
   at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.Get(Double value) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.DiagnosticsAgent\Collectors\ValueBucketer.cs:line 58
   at Microsoft.ApplicationInsights.Profiler.Core.Sampling.SampleActivityContainer.AddSample(SampleActivity activity) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.EventPipe\ServiceProfiler.EventPipe.Client\EventListeners\Sampling\SampleActivityContainer.cs:line 32
   at ServiceProfiler.EventPipe.Client.Tests.SampleActivityContainerTests.<>c__DisplayClass0_0.<ShouldWorkWithBasicBucketing>b__0(Double duration, ParallelLoopState state, Int64 index) in d:\Repos\ServiceProfiler3\tests\ServiceProfiler.EventPipe.Client.Tests\SampleActivityContainerTests.cs:line 24
   at System.Threading.Tasks.Parallel.<>c__DisplayClass32_0`2.<ForEachWorker>b__2(Int32 i, ParallelLoopState state)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica`1.ExecuteAction(Boolean& yieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica.Execute()
xiaomi7732 commented 2 years ago

It will take a while for a fix. I'll keep you posted. This issue will be kept open until the bug is addressed.

xiaomi7732 commented 2 years ago

Hi @sloncho, a new package with this fix is released, do you mind giving it a shot and getting back to us? https://www.nuget.org/packages/Microsoft.ApplicationInsights.Profiler.AspNetCore/2.5.0-beta2

sloncho commented 2 years ago

It appears now there is much more severe problem. Please see the attached logs. profiler_data.zip

Btw, it is strange the source code is not provided, we could try to look into it too.

xiaomi7732 commented 2 years ago

I'll take a look.

xiaomi7732 commented 2 years ago

Post an update, from the data, it looks like a different issue on the IPC side. See the callstack at the bottom.

Now, the investigation:

It will also help to turn on tracing for uploading, could you please try these:

Specifically, I am wondering what is the payload from the pipeline:

...
public async Task<T> ReadAsync<T>(TimeSpan timeout = default, CancellationToken cancellationToken = default)
{
    VerifyModeIsSpecified();
    string payload = await ReadMessageAsync(timeout, cancellationToken).ConfigureAwait(false);
    // What is the payload returned in this line that can't be deserialized?
    _logger.LogTrace("Reading payload from named pipe: {payload}", payload); 
    if (_payloadSerializer.TryDeserialize<T>(payload, out T result))
    {
        return result;
    }
    throw new UnsupportedPayloadTypeException("Can't deserialize message over the named pipe.");
}

Also, put the uploader in Development mode to see logs from the uploader side:

{
    "ServiceProfiler": {
        "UploaderEnvironment": "Development"
    }
}

Reference

Exception callstack

[{"id":"46889998","outerId":"0","type":"Microsoft.ApplicationInsights.Profiler.Core.IPC.UnsupportedPayloadTypeException","message":"Can't deserialize message over the named pipe.","

parsedStack":[{"level":0,   "method":"Microsoft.ApplicationInsights.Profiler.Core.IPC.DuplexNamedPipeService+<ReadAsync>d__14`1.MoveNext","line":0,"assembly":"Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":5,                 "method":"Microsoft.ApplicationInsights.Profiler.Core.ServiceProfilerProvider+<>c__DisplayClass4_0+<<PostStopProcessAsync>b__0>d.MoveNext","line":0,"assembly":"Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":9,                 "method":"Microsoft.ApplicationInsights.Profiler.Core.ServiceProfilerProvider+<PostStopProcessAsync>d__4.MoveNext","line":0,"assembly":"Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":13,                "method":"Microsoft.ApplicationInsights.Profiler.Core.ServiceProfilerProvider+<StopServiceProfilerAsync>d__3.MoveNext","line":0,"assembly":"Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":17,                "method":"Microsoft.ApplicationInsights.Profiler.Core.Orchestration.OrchestratorEventPipe+<StopProfilingAsync>d__3.MoveNext","line":0,"assembly":"Microsoft.ServiceProfiler.EventPipe.Orchestration, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":21,                "method":"Microsoft.ServiceProfiler.Orchestration.SchedulingPolicy+<StartPolicyAsync>d__1.MoveNext","line":0,"assembly":"Microsoft.ServiceProfiler.Orchestration, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"}],"severityLevel":"Error"}]
xiaomi7732 commented 2 years ago

I also created a k8s cluster to run an example here:

So far, profiler 2.5.0-beta2 works for me. I am keeping it running in the cluster. Let me know if you could see anything that I miss. Specifically, in k8s.yaml.

xiaomi7732 commented 2 years ago

Update: 28 iterations/profiler sessions in the last 24 hours ish with configured readiness and liveness probes, and I am not seeing failures.

xiaomi7732 commented 2 years ago

Hi @sloncho any updates from your side, are you still seeing the errors?

sloncho commented 2 years ago

@xiaomi7732 We haven't seen any exceptions since. I guess that fixes it. Any timeline on official package?

xiaomi7732 commented 2 years ago

Thanks for the confirm. I'll post it when the stable release is there.

xiaomi7732 commented 2 years ago

Hi @sloncho, please find the new package here: https://www.nuget.org/packages/Microsoft.ApplicationInsights.Profiler.AspNetCore/2.5.0

sloncho commented 2 years ago

@xiaomi7732 The final 2.5.0 caused all our integration tests to break. This problem does not exists with 2.5.0-beta2 package. We use WebApplicationFactory to create test instances of the API in order to run tests against it. They fail with:

System.InvalidOperationException : No process is associated with this object.
   at System.Diagnostics.Process.EnsureState(State state)
   at System.Diagnostics.Process.EnsureState(State state)
   at System.Diagnostics.Process.get_Id()
   at Microsoft.ServiceProfiler.Utilities.CurrentProcessProvider.get_ProcessId()
   at Microsoft.Extensions.DependencyInjection.ServiceCollectionBuilder.<>c.<Build>b__0_4(IServiceProvider p)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitDisposeCache(ServiceCallSite transientCallSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitIEnumerable(IEnumerableCallSite enumerableCallSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.Resolve(ServiceCallSite callSite, ServiceProviderEngineScope scope)
   at Microsoft.Extensions.DependencyInjection.ServiceProvider.CreateServiceAccessor(Type serviceType)
   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
   at Microsoft.Extensions.DependencyInjection.ServiceProvider.GetService(Type serviceType, ServiceProviderEngineScope serviceProviderEngineScope)
   at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetService[T](IServiceProvider provider)
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.Start(IHost host)
   at Microsoft.AspNetCore.Mvc.Testing.WebApplicationFactory`1.CreateHost(IHostBuilder builder)
   at Microsoft.AspNetCore.Mvc.Testing.WebApplicationFactory`1.ConfigureHostBuilder(IHostBuilder hostBuilder)
  at Microsoft.AspNetCore.Mvc.Testing.WebApplicationFactory`1.EnsureServer()
   at Microsoft.AspNetCore.Mvc.Testing.WebApplicationFactory`1.CreateDefaultClient(DelegatingHandler[] handlers)

This happens when tests are ran in parallel, if ran one by one, the problem does not exist. We do not have any global/static state, and we create a new instance of WebApplicationFactory for each test.

xiaomi7732 commented 2 years ago

Thanks for the report. It seems like a different issue. Duplicate to #186. Close this one.