Closed sloncho closed 2 years ago
Thanks for the issue filing. That is bad and weird. @sloncho, could you please help understand how frequently you are seeing the error in your prod environment? Also, what's the scale of your application? Like what's the RPS? What's the memory status? How many replica do you have for your pods? This is the first report on this and please help us understand more of your system.
@xiaomi7732 Leaving it only at 1 healthcheck call per minute (just returns OK, no processing at all, and no other requests), we can see 5-10 exceptions a day. The scale/mem/cpu data is as follows:
resources:
limits:
cpu: 156m
memory: 320Mi
requests:
cpu: 125m
memory: 256Mi
Minimum replicas: 2 Maximum replicas: 4 HPA: CPU Average Utilization 75% Memory working set: Average 175 MB, Limit 320 MB
This is not cpu intensive service.
@sloncho That is super weird. I need the value of duration to understand why is it thrown. Is it possible for you to run the same code with a newer package (that I am preparing) and turn on the debug logging? That will help troubleshoot.
Sorry for the delay. We found out that increasing the profiler startup delay to 1 minute helps, the exception rate decreases significantly, once in a few days. That caused the delay, we wanted to make sure.
We can attempt that custom package w/ debug output to help investigate the issue.
Hey, @sloncho thanks for getting back to us. I have updated the code for troubleshooting this issue and it has been released. Do you mind trying it on the new package (I will put them at the bottom) and posting the logs?
Please turn on the Debug
log for the category of Microsoft.ApplicationInsights.Profiler.Core.Sampling
.
For example:
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft.AspNetCore": "Warning",
"Microsoft.ApplicationInsights.Profiler.Core.Sampling": "Debug" //<----- Add this line to turn on Debug
}
}
}
https://www.nuget.org/packages/Microsoft.ApplicationInsights.Profiler.AspNetCore/2.5.0-beta1
namespace Microsoft.ApplicationInsights.Profiler.Core.Sampling;
...
ValueBucketer<SampleActivityBucket, SampleActivity> bucketer = _operations.GetOrAdd(activity.OperationName, (operationName) => new ValueBucketer<SampleActivityBucket, SampleActivity>());
double activityDurationInMs = activity.Duration.TotalMilliseconds;
_logger.LogDebug("Activity duration(ms): {duration}", activityDurationInMs); // this is the value that I want to see.
SampleActivityBucket bucket = bucketer.Get(activityDurationInMs);
...
Hey @xiaomi7732 pls find attached the complete log. profiler_logs.zip
Thanks @sloncho. I checked the logs, I believe I see the data that I want. I'll need some time to dig into the data a bit and I'll get back to you later. Thank you for the help!
Hey @sloncho, I want to give you an update. I can occasionally reproduce the issue when it is under concurrent execution. I am wondering why we haven't had any report on this before. I also can't reproduce it consistently yet. But at least, I think I have a grasp. I'll keep looking into it.
Here's the code I used with the telemetry gathered to reproduce the issue:
double[] durations = new double[] { 6.6419, 54.3629, 1.9237, 106.432, 1.5464 };
ILogger<SampleActivityContainer> logger = new NullLogger<SampleActivityContainer>();
SampleActivityContainer sampleActivityContainer = new SampleActivityContainer(logger);
Parallel.ForEach(durations, new ParallelOptions(), (duration, state, index) =>
{
sampleActivityContainer.AddSample(new SampleActivity() { OperationName = nameof(ShouldWorkWithBasicBucketing), Duration = TimeSpan.FromMilliseconds(duration) });
});
Here's a call stack I captured when it repros:
ServiceProfiler.EventPipe.Client.Tests.SampleActivityContainerTests.ShouldWorkWithBasicBucketing:
Outcome: Failed
Error Message:
System.AggregateException : One or more errors occurred. (Index was outside the bounds of the array.) (Index was outside the bounds of the array.)
---- System.IndexOutOfRangeException : Index was outside the bounds of the array.
---- System.IndexOutOfRangeException : Index was outside the bounds of the array.
Stack Trace:
at System.Threading.Tasks.TaskReplicator.Run[TState](ReplicatableUserAction`1 action, ParallelOptions options, Boolean stopOnFirstFailure)
at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
at System.Threading.Tasks.Parallel.ThrowSingleCancellationExceptionOrOtherException(ICollection exceptions, CancellationToken cancelToken, Exception otherException)
at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](TSource[] array, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable`1 source, ParallelOptions parallelOptions, Action`3 body)
at ServiceProfiler.EventPipe.Client.Tests.SampleActivityContainerTests.ShouldWorkWithBasicBucketing() in d:\Repos\ServiceProfiler3\tests\ServiceProfiler.EventPipe.Client.Tests\SampleActivityContainerTests.cs:line 22
----- Inner Stack Trace #1 (System.IndexOutOfRangeException) -----
at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.GetByIndex(Int32 index) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.DiagnosticsAgent\Collectors\ValueBucketer.cs:line 43
at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.Get(Double value) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.DiagnosticsAgent\Collectors\ValueBucketer.cs:line 58
at Microsoft.ApplicationInsights.Profiler.Core.Sampling.SampleActivityContainer.AddSample(SampleActivity activity) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.EventPipe\ServiceProfiler.EventPipe.Client\EventListeners\Sampling\SampleActivityContainer.cs:line 32
at ServiceProfiler.EventPipe.Client.Tests.SampleActivityContainerTests.<>c__DisplayClass0_0.<ShouldWorkWithBasicBucketing>b__0(Double duration, ParallelLoopState state, Int64 index) in d:\Repos\ServiceProfiler3\tests\ServiceProfiler.EventPipe.Client.Tests\SampleActivityContainerTests.cs:line 24
at System.Threading.Tasks.Parallel.<>c__DisplayClass32_0`2.<ForEachWorker>b__2(Int32 i, ParallelLoopState state)
at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
at System.Threading.Tasks.TaskReplicator.Replica`1.ExecuteAction(Boolean& yieldedBeforeCompletion)
at System.Threading.Tasks.TaskReplicator.Replica.Execute()
----- Inner Stack Trace #2 (System.IndexOutOfRangeException) -----
at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.GetByIndex(Int32 index) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.DiagnosticsAgent\Collectors\ValueBucketer.cs:line 43
at Microsoft.ServiceProfiler.Collectors.ValueBucketer`2.Get(Double value) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.DiagnosticsAgent\Collectors\ValueBucketer.cs:line 58
at Microsoft.ApplicationInsights.Profiler.Core.Sampling.SampleActivityContainer.AddSample(SampleActivity activity) in d:\Repos\ServiceProfiler3\src\ServiceProfiler.EventPipe\ServiceProfiler.EventPipe.Client\EventListeners\Sampling\SampleActivityContainer.cs:line 32
at ServiceProfiler.EventPipe.Client.Tests.SampleActivityContainerTests.<>c__DisplayClass0_0.<ShouldWorkWithBasicBucketing>b__0(Double duration, ParallelLoopState state, Int64 index) in d:\Repos\ServiceProfiler3\tests\ServiceProfiler.EventPipe.Client.Tests\SampleActivityContainerTests.cs:line 24
at System.Threading.Tasks.Parallel.<>c__DisplayClass32_0`2.<ForEachWorker>b__2(Int32 i, ParallelLoopState state)
at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`1.<ForWorker>b__1(RangeWorker& currentWorker, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
at System.Threading.Tasks.TaskReplicator.Replica`1.ExecuteAction(Boolean& yieldedBeforeCompletion)
at System.Threading.Tasks.TaskReplicator.Replica.Execute()
It will take a while for a fix. I'll keep you posted. This issue will be kept open until the bug is addressed.
Hi @sloncho, a new package with this fix is released, do you mind giving it a shot and getting back to us? https://www.nuget.org/packages/Microsoft.ApplicationInsights.Profiler.AspNetCore/2.5.0-beta2
It appears now there is much more severe problem. Please see the attached logs. profiler_data.zip
Btw, it is strange the source code is not provided, we could try to look into it too.
I'll take a look.
Post an update, from the data, it looks like a different issue on the IPC side. See the callstack at the bottom.
Now, the investigation:
I tried the package (2.5.0-beta2) on my local, also inside a container, and it didn't reproduce.
What is the base image of the app?
Did it happen every time when a profiler session was finished or did it only happen every once in a while?
Is there some special characters in the requests that might cause serialization issue? Do you mind sharing the request-telemetry during the profiling session?
/health/read
& /health/live
, is there anything else by chance?It will also help to turn on tracing for uploading, could you please try these:
Enable trace logging for this category: Microsoft.ApplicationInsights.Profiler.Core
:
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft.AspNetCore": "Warning",
"Microsoft.ApplicationInsights.Profiler.Core": "Trace" //<----- Add this line to turn on Trace
}
}
}
Specifically, I am wondering what is the payload from the pipeline:
...
public async Task<T> ReadAsync<T>(TimeSpan timeout = default, CancellationToken cancellationToken = default)
{
VerifyModeIsSpecified();
string payload = await ReadMessageAsync(timeout, cancellationToken).ConfigureAwait(false);
// What is the payload returned in this line that can't be deserialized?
_logger.LogTrace("Reading payload from named pipe: {payload}", payload);
if (_payloadSerializer.TryDeserialize<T>(payload, out T result))
{
return result;
}
throw new UnsupportedPayloadTypeException("Can't deserialize message over the named pipe.");
}
Also, put the uploader in Development
mode to see logs from the uploader side:
{
"ServiceProfiler": {
"UploaderEnvironment": "Development"
}
}
Exception callstack
[{"id":"46889998","outerId":"0","type":"Microsoft.ApplicationInsights.Profiler.Core.IPC.UnsupportedPayloadTypeException","message":"Can't deserialize message over the named pipe.","
parsedStack":[{"level":0, "method":"Microsoft.ApplicationInsights.Profiler.Core.IPC.DuplexNamedPipeService+<ReadAsync>d__14`1.MoveNext","line":0,"assembly":"Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":5, "method":"Microsoft.ApplicationInsights.Profiler.Core.ServiceProfilerProvider+<>c__DisplayClass4_0+<<PostStopProcessAsync>b__0>d.MoveNext","line":0,"assembly":"Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":9, "method":"Microsoft.ApplicationInsights.Profiler.Core.ServiceProfilerProvider+<PostStopProcessAsync>d__4.MoveNext","line":0,"assembly":"Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":13, "method":"Microsoft.ApplicationInsights.Profiler.Core.ServiceProfilerProvider+<StopServiceProfilerAsync>d__3.MoveNext","line":0,"assembly":"Microsoft.ApplicationInsights.Profiler.Core, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":17, "method":"Microsoft.ApplicationInsights.Profiler.Core.Orchestration.OrchestratorEventPipe+<StopProfilingAsync>d__3.MoveNext","line":0,"assembly":"Microsoft.ServiceProfiler.EventPipe.Orchestration, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"},
{"level":21, "method":"Microsoft.ServiceProfiler.Orchestration.SchedulingPolicy+<StartPolicyAsync>d__1.MoveNext","line":0,"assembly":"Microsoft.ServiceProfiler.Orchestration, Version=2.6.2210.2401, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"}],"severityLevel":"Error"}]
I also created a k8s cluster to run an example here:
So far, profiler 2.5.0-beta2 works for me. I am keeping it running in the cluster. Let me know if you could see anything that I miss. Specifically, in k8s.yaml.
Update: 28 iterations/profiler sessions in the last 24 hours ish with configured readiness and liveness probes, and I am not seeing failures.
Hi @sloncho any updates from your side, are you still seeing the errors?
@xiaomi7732 We haven't seen any exceptions since. I guess that fixes it. Any timeline on official package?
Thanks for the confirm. I'll post it when the stable release is there.
Hi @sloncho, please find the new package here: https://www.nuget.org/packages/Microsoft.ApplicationInsights.Profiler.AspNetCore/2.5.0
@xiaomi7732 The final 2.5.0 caused all our integration tests to break. This problem does not exists with 2.5.0-beta2 package. We use WebApplicationFactory to create test instances of the API in order to run tests against it. They fail with:
System.InvalidOperationException : No process is associated with this object.
at System.Diagnostics.Process.EnsureState(State state)
at System.Diagnostics.Process.EnsureState(State state)
at System.Diagnostics.Process.get_Id()
at Microsoft.ServiceProfiler.Utilities.CurrentProcessProvider.get_ProcessId()
at Microsoft.Extensions.DependencyInjection.ServiceCollectionBuilder.<>c.<Build>b__0_4(IServiceProvider p)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitDisposeCache(ServiceCallSite transientCallSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitConstructor(ConstructorCallSite constructorCallSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitIEnumerable(IEnumerableCallSite enumerableCallSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.VisitRootCache(ServiceCallSite callSite, RuntimeResolverContext context)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteVisitor`2.VisitCallSite(ServiceCallSite callSite, TArgument argument)
at Microsoft.Extensions.DependencyInjection.ServiceLookup.CallSiteRuntimeResolver.Resolve(ServiceCallSite callSite, ServiceProviderEngineScope scope)
at Microsoft.Extensions.DependencyInjection.ServiceProvider.CreateServiceAccessor(Type serviceType)
at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
at Microsoft.Extensions.DependencyInjection.ServiceProvider.GetService(Type serviceType, ServiceProviderEngineScope serviceProviderEngineScope)
at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetService[T](IServiceProvider provider)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.Start(IHost host)
at Microsoft.AspNetCore.Mvc.Testing.WebApplicationFactory`1.CreateHost(IHostBuilder builder)
at Microsoft.AspNetCore.Mvc.Testing.WebApplicationFactory`1.ConfigureHostBuilder(IHostBuilder hostBuilder)
at Microsoft.AspNetCore.Mvc.Testing.WebApplicationFactory`1.EnsureServer()
at Microsoft.AspNetCore.Mvc.Testing.WebApplicationFactory`1.CreateDefaultClient(DelegatingHandler[] handlers)
This happens when tests are ran in parallel, if ran one by one, the problem does not exist. We do not have any global/static state, and we create a new instance of WebApplicationFactory for each test.
Thanks for the report. It seems like a different issue. Duplicate to #186. Close this one.
.Net 6 AspNet app - all packages/framework/runtime latest at the moment of writing. Microsoft.ApplicationInsights.Profiler.AspNetCore 2.4.0 Microsoft.ApplicationInsights.AspNetCore -Version 2.21.0
While running the app in kubernetes, we see the following exception:
We were not able to reproduce this running the container locally.