goldshtn / msos

Command-line environment a-la WinDbg for executing SOS commands without having SOS available.
Other
96 stars 21 forks source link

Why exception after !Clrstack -a ? #61

Closed stej closed 7 years ago

stej commented 7 years ago

What is the reason, that !Clrstack -a ends with this exception?

90> !clrstack -a
Exception during command execution -- Win32Exception: 'Unknown error (0x81250006)'

   at msos.HR.Verify(Int32 hr) in d:\temp\code\msos\msos\HR.cs:line 29
   at msos.ClrThreadExtensions.FrameArgumentsAndLocalsRetriever.ProcessStackWalk(UInt32 osThreadId) in d:\temp\code\msos\msos\ClrThreadExtensions.cs:line 211
   at msos.ClrThreadExtensions.FrameArgumentsAndLocalsRetriever..ctor(ClrThread thread, IList`1 stackTrace, CommandExecutionContext context) in d:\temp\code\msos\msos\ClrThreadExtensions.cs:line 200
   at msos.ClrThreadExtensions.WriteStackTraceToContext(ClrThread thread, IList`1 stackTrace, CommandExecutionContext context, Boolean displayArgum
.........

I debuged the code and the exc is thrown at this line

HR.Verify(task.CreateStackWalk(0xf /*all flags*/, out tmp));

it's becuase task.CreateStackWalk returns some negative number.

Besides that !Clrstack shows only table header (sp IP function), but !mk shows probably correct native stack, but the managed stack is considered as unmanaged.

goldshtn commented 7 years ago

Can you share the dump file that causes this issue? This looks like a DAC issue (that's where the CreateStackWalk method is implemented), and I've seen issues like this in older DACs.

stej commented 7 years ago

That dump has over 3GB and also contains customer data, so not possible right now. It's dump from Azure VM. However, I didn't have problem when using sos and netext on this dump, so this might be problem of CLRMD itself.. ?

goldshtn commented 7 years ago

Yeah, but the CreateStackWalk call is into the DAC, which is the same thing SOS uses. Also, netext uses CLRMD internally.

Can you share some information on the environment? CLR version, bitness, anything else really?

stej commented 7 years ago

Tried to make a sample app and make a dump on the server. Everything is ok and I can !ClrStack -a on that dump..

So at least I ran report on the failing dump (removed some custom modules) -- msosreport-github.txt

It's azure VM, clr version 4.0.30319.34209. Output from msos:

Opened dump file 'ConsoleApp1.exe_161221_121530.dmp', architecture Amd64, 1 CLR versions detected.
#0 Flavor: Desktop, Version: v4.0.30319.34209
Symbol path: srv*c:\symbols*https://msdl.microsoft.com/download/symbols
Using Data Access DLL at: c:\symbols\mscordacwks_amd64_amd64_4.0.30319.34209.dll\5348a1ef9a0000\mscordacwks_amd64_amd64_4.0.30319.34209.dll
stej commented 7 years ago

I see the report failed and the reason is again because of threads. But this time something other.

Code (report.cs):

           var threadObjects = from obj in heap.EnumerateObjectAddresses()
                    let type = heap.GetObjectType(obj)
                    where type != null && type.Name == "System.Threading.Thread"
                    select obj;
var threadType = heap.GetTypeByName("System.Threading.Thread");
var nameField = threadType.GetFieldByName("m_Name");
var managedIdField = threadType.GetFieldByName("m_ManagedThreadId");

foreach (var threadObject in threadObjects)
{
    string name = (string)nameField.GetValue(threadObject);
    int id = (int)managedIdField.GetValue(threadObject);
    result.Add(id, name);
}

I put an action breakpoint on the result.Add(id, name) and when the exception occurred, there was this info in the output:

threadObject 1085618146504, name null, id 40 threadObject 1085618147344, name "threadObject Save Events", id 41 threadObject 1085618188576, name null, id 42 threadObject 1085623820104, name null, id 43 threadObject 1085628350968, name null, id 35 threadObject 1085639611512, name null, id 66 threadObject 1085641989688, name null, id 53 threadObject 1085641989904, name null, id 20 threadObject 1085642318952, name null, id 72 threadObject 1085642319112, name null, id 76 threadObject 1085645888432, name null, id 51 threadObject 1085723892912, name null, id 25 threadObject 1085726880064, name null, id 38 threadObject 1085759656376, name null, id 88 threadObject 1085771456232, name null, id 87 threadObject 1085772660152, name null, id 60 threadObject 1085775961888, name null, id 20

So the managed thread id is 20 twice for two different thread objects 1085641989904 and 1085775961888. Is that somehow realted to the !ClrStack problem? I would say no..

I'm sorry I can't give you the dump because there is some customer data inside.

goldshtn commented 7 years ago

The issue above is a genuine bug that needs to be fixed. It's a bit surprising but it looks like we can have two Thread objects on the heap with different thread ids. I would be surprised if both were running at the same time, but what probably happened is that the first one died and then the id got reused. Can you please open a separate issue for this bug?

goldshtn commented 7 years ago

I'm closing this and opened #64 which tracks the specific bug with repeated thread ids.

stej commented 7 years ago

Some more info to the original question.. although I guess still not enough for you :/ I enabled managed & native debugging and ran again !ClrStack -a

2017-03-23_11h45_50

So it's true that the problem is with DAC. I'm still in situation that I can't share the dump as it contains sensitive data.

goldshtn commented 7 years ago

Can you show the contents of that HRException? Specifically, what the HRESULT was?

stej commented 7 years ago

Trying to get more details about the HRException, but haven't found more than what's displayed. When I click on "Copy Details", I get only "Exception thrown at 0x00007FFC9CC87788 in msos.exe: Microsoft C++ exception: HRException at memory location 0x000000E9910FDF58. occurred".

The HResult that is passed to HR.Verify is -2128281594 .

stej commented 7 years ago

Would be still helpful to make some process dump from that machine that doesn't contain user data? I already tried some time ago, but without ability to reproduce the problem with DAC.

goldshtn commented 7 years ago

It would be helpful to see a dump file where the DAC fails to initialize the stack walk. If you have a reliable repro that you would be willing to share, we should probably report it to Microsoft, even.

stej commented 7 years ago

Ok, got some minimal dump file.

Repro command

msos -z d:\temp\msos\test_dump.dmp -c "!clrstack -a; q"

Output

d:\temp\msos>msos -z d:\temp\msos\test_dump.dmp -c "!clrstack -a; q"
Opened dump file 'd:\temp\msos\test_dump.dmp', architecture Amd64, 1 CLR versions detected.
#0 Flavor: Desktop, Version: v4.6.1637.00
_NT_SYMBOL_PATH is not set, using default
Symbol path: srv*C:\Users\j.stefan\AppData\Local\Temp\Symbols*http://msdl.microsoft.com/download/symbols
Using Data Access DLL at: C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll
#> !clrstack -a; q
Exception during command execution -- Win32Exception: 'Unknown error (0x81250006)'

   at msos.HR.Verify(Int32 hr)
   at msos.ClrThreadExtensions.FrameArgumentsAndLocalsRetriever.ProcessStackWalk(UInt32 osThreadId)
   at msos.ClrThreadExtensions.FrameArgumentsAndLocalsRetriever..ctor(ClrThread thread, IList`1 stackTrace, CommandExecutionContext context)
   at msos.ClrThreadExtensions.WriteStackTraceToContext(ClrThread thread, IList`1 stackTrace, CommandExecutionContext context, Boolean displayArgumentsAndLocals)
   at msos.ClrThreadExtensions.WriteCurrentStackTraceToContext(ClrThread thread, CommandExecutionContext context, Boolean displayArgumentsAndLocals)
   at msos.CLRStack.Execute(CommandExecutionContext context)
   at msos.CommandExecutionContext.ExecuteOneCommand(String command)
Proceed at your own risk, or restart the debugging session.

I'll send you the dump to goldshtn at gmail.com (found at github page). You might probably open this issue again?