Closed raffaeler closed 2 years ago
Is this on Windows or Linux that you are seeing this? Those are two very different codepaths.
I am seeing this on Windows x64, but I am interested in understanding the problem for other platforms as well. Thanks
There may be a bug here, or it may be expected. Here's a quick description of when it would be expected:
ClrMD uses clr's stackwalker to attempt to unwind the stack. By design, if the CLR stackwalker encounters a non-managed instruction pointer in the eip
or rip
register it assumes that we have normally exited managed code and put a clr!Frame
on the stack. It will attempt to look for that clr!Frame
object on the stack to help unwind and start the unwind from that object (note that clr!DebuggerU2MCatchHandlerFrame
is a clr!Frame
object).
It's possible that we can be "unlucky" and called into native code briefly before setting up a frame and that will confuse CLR's stackwalker. It happens regularly but we mostly assume that to be rare for most cases. However if you are seeing this a lot then there might be a bug here, or we might be mistaken in thinking it's rare. In order to find out we'd need to probably implement your suggestion in #760, to save off snapshots to see what the real stack is in WinDbg.
If there's not a bug here, I'm not sure there's much we can do to fix this. ClrMD isn't designed to function as a sampling profiler, it's meant to be a runtime inspection tool. What we are reporting to you is the stack as the CLR runtime sees it (again...that's assuming no bug in ClrMD), and what it sees at the time you took the snapshot is that [DebuggerU2MCatchHandlerFrame]
was the top of the managed "stack".
I will try to reproduce this myself, and if I can't I'll see if we can implement #760 so we can get you to collect more information here. (Also FYI I just got back from a vacation so I'm a bit behind on work...it might be a few days before I get to this.)
Got it. Take your time and let me know if you need help to repro the issue. Let's say I try this operation 20 times, I can see the real stack just one or two times.
I suspect it is a bug because, even if I was working on a snapshot, enumerating two times the threads and the stack frames, I saw different results: the first time I could see the stack, the following times I could only see DebuggerU2MCatchHandlerFrame
.
The repro should be pretty simple. In any case tomorrow morning I will present a session in an online conference about diagnostics and CLRMD is one of the pillars :) As soon as I publish the demo I can point you to my code but I suspect that it won't be that useful.
Thanks
Additional notes:
I got dumps with [DebuggerU2MCatchHandlerFrame] on the stack.
Atleast ClrMD & Windbg (using SOS) shows the same output.
Hello, I'm facing the same issue on Linux.
We have a troubleshooting tool that uses CLRmd to show current "parallel stacks" of the application (connecting with dataTarget = DataTarget.AttachToProcess(pid, suspend = true);
). In my reproduction the stacks are always correct for the threads that are waiting (Thread.Sleep
, mre.WaitOne
, etc). However, for the threads that are actively consuming CPU, there's only one frame like below
OS Thread ID: 965 Managed thread id: 7 Alive: True Aborted: False State: 16912928 IsWorkerThread: True IsIOCompletionThread: False
Kind: Runtime Method: FrameName: DebuggerU2MCatchHandlerFrame IP: 610053002E0000 MethodByIP:
It's very stable; even after 100 iterations of creating/destroying dataTarget
I didn't catch a working stackwalk for the interesting thread.
When I take a full memory snapshot with dotnet-dump collect --type Full
, the stacktrace is shown correctly
> clrstack
OS Thread Id: 0x965 (21)
Child SP IP Call Site
00007F9E9E9A2440 00007F9EA33F0FF7 Ghost.TestApps.BurnCpuComponent.Process(Ghost.TestApps.BurnCpuParam, System.Threading.CancellationToken) [C:\dev\repos\troubleshooting\app-hosting\ghost\Ghost\Ghost.TestApps\BurnCpu.cs @ 45]
00007F9E9E9A2490 00007F9EA33EE91A Ghost.Core.ComponentMiddleware`4+<>c__DisplayClass6_0+<<Invoke>g__ProcessRequest|0>d[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].MoveNext() [C:\dev\repos\troubleshooting\app-hosting\ghost\Ghost\Ghost.Core\ComponentMiddleware.cs @ 75]
00007F9E9E9A2800 00007F9EA33ED002 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Ghost.Core.ComponentMiddleware`4+<>c__DisplayClass6_0+<<Invoke>g__ProcessRequest|0>d[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]], Ghost.Core.netcore]](<<Invoke>g__ProcessRequest|0>d<System.__Canon,System.__Canon,System.__Canon,System.__Canon> ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 38]
00007F9E9E9A2860 00007F9EA33ECEAF System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.ValueTuple`2[[System.__Canon, System.Private.CoreLib],[Ghost.Contract.InvocationResult, Ghost.Contract.netcore]], System.Private.CoreLib]].Start[[Ghost.Core.ComponentMiddleware`4+<>c__DisplayClass6_0+<<Invoke>g__ProcessRequest|0>d[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]], Ghost.Core.netcore]](<<Invoke>g__ProcessRequest|0>d<System.__Canon,System.__Canon,System.__Canon,System.__Canon> ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs @ 36]
00007F9E9E9A28A0 00007F9EA33ECA05 Ghost.Core.ComponentMiddleware`4+<>c__DisplayClass6_0[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].<Invoke>g__ProcessRequest|0()
00007F9E9E9A2930 00007F9E9EE39C21 System.Threading.Tasks.Task`1[[System.__Canon, System.Private.CoreLib]].InnerInvoke() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Future.cs @ 503]
00007F9E9E9A2950 00007F9EA2C1BE4E System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 268]
00007F9E9E9A2990 00007F9EA2C225B0 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2337]
00007F9E9E9A2A10 00007F9E9EC97F78 System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2271]
00007F9E9E9A2A30 00007F9E9EC97F17 System.Threading.Tasks.Task.ExecuteFromThreadPool(System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2262]
00007F9E9E9A2A40 00007F9EA1FA4FD3 System.Threading.ThreadPoolWorkQueue.Dispatch()
00007F9E9E9A2AB0 00007F9E9EC93BC5 System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs @ 63]
00007F9E9E9A2B80 00007F9E9EC78432 System.Threading.Thread.StartCallback() [/_/src/coreclr/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 105]
00007F9E9E9A2CF0 00007f9f18916a27 [DebuggerU2MCatchHandlerFrame: 00007f9e9e9a2cf0]
It also contains DebuggerU2MCatchHandlerFrame
but with the different IP. When I open this snapshot with CLRmd it shows correct result (same as clrstack
).
I think I've found a bug that explains this behavior on Linux!
In LinuxLiveDataReader.cs there's a ptrace(PTRACE_ATTACH, pid, 0, 0)
call to attach to the process. However, the ptrace manual says we should attach to every thread id separately. When I call PTRACE_ATTACH
for every tid the stacks are read correctly.
Also DataTarget.CreateSnapshotAndAttach(pid)
works fine on Linux.
When I call
PTRACE_ATTACH
for every tid the stacks are read correctly.
That's great! Can you submit a PR for this? Or even just a draft showing the code you changed? No worries if not, but it would speed things along.
Sure, I'll make a PR tomorrow!
Interestingly it means the original issue observed on Windows is unrelated. I ran my code on Windows couple of times but could not reproduce.
It seems my fix was not enough in the end: I reproduced the issue again and this time both on Windows and Linux.
First, I attach the repro code. Depending on the parameters, the program does 3 things:
while(true)
loopIn every environment I launch the program in test mode, then I launch another instance to show the stack traces and to inspect the dump I create with dotnet-dump collect -p <pid> --type Full
. I expect that the stack traces reported from the dump and from the live session are the same. I also expect that stacktraces are reported correctly for every test case in the app.
That's the results I get:
ManualResetEvent.Wait Fibonacci numbers DateTime.UtcNow
Linux clrmd live correct correct after PR #991 empty stack
Linux clrmd dump correct correct empty stack
Linux clrstack correct correct empty stack / correct native part with -f in lldb
Linux dumpstack correct correct correct in repro but wrong in real app
Windows clrmd live correct correct works occasionally
Windows clrmd dump correct correct works occasionally
Windows clrstack correct correct works occasionally
The most interesting test case is the loop querying DateTime.UtcNow
. I attach the full output for different cases.
On Linux clrstack
always returns a single DebuggerU2MCatchHandlerFrame
. In lldb clrstack -f
skips the managed frames: i.e. native frames are correct, but there are empty lines where the managed frames are supposed to be. The dumpstack
command always returns correct stack in repro but in the real app (http server doing DateTime.UtcNow
loop on request) the managed part of the stack seems to contain the method that should not be there, so I guess the stackwalk is confused at some point.
On Windows running the test 10 times I'd get 2 times when it works and 8 times when it does not for both dump and live session. Weirdly yesterday the dump seemed to be correct more often, but today it's on par with the live session. Probably just luck.
I'd appreciate if you try the same tests on your machine to compare the results. I ran the tests on WSL Ubuntu with dotnet 5.0.9 and Windows 10 dotnet 5.0.17 (tried different dotnet versions too btw doesn't seem to change anything).
Ahh thank you for this repro. I actually think this is working "as expected" but what is expected here may be surprising to you. This is due to how the CLR diagnostics layer works. Let me first state a few facts about how the runtime and CLR Diagnostics (libmscordaccore.so) works, then I'll dig into this repro.
DebuggerU2MCatchHandlerFrame
, HelperMethodFrame
, etc.With all of that said, what you are seeing here is that "DateTime.get_UtcNow" actually uses a PInvoke or internal call into libcoreclr.so. If you happen to pause the thread after we have called into native code but before we've erected a clr!Frame, then we will "skip" the managed part of the stack (by design) because we don't know it exists! UtcNow might also be implemented in such a way that we intentionally don't create a clr!Frame object, because the call is supposed to be very quick...so we don't need to make that code interruptible.
So, this all is by design I think. Unless I missed something. If you want to get the full callstack, always, without missing any managed frames then you have a couple of options:
I'm going to close the issue, but feel free to reply back if you have more questions or think I missed something here. Thanks!
I often can't get the managed stack for the code being executed either by attaching a process or snapshotting it. The threads are enumerated correctly by enumerating
ClrRuntime.Threads
. For each one of them I then callEnumerateStackTrace
obtaining the stack frames represented by aClrStackFrame
list. In very few cases I am able to see the real stack trace being executed at the snapshot/attach moment but, on those threads, in most of the cases I can only see a frame calledDebuggerU2MCatchHandlerFrame
which I believed was only a finalizer-related thing.What is this object, and how can I 'decode' to get back the real stack?
I did my tests by running some heavy work in an asp.net core application, printing the managed thread id of the thread executing the cpy code for several tenths of seconds.