microsoft / DTrace-on-Windows

Code for the cross platform, single source, OpenDTrace implementation
Other
473 stars 42 forks source link

dtrace for windows hangs forever after traced program finishes #4

Closed nico-abram closed 3 years ago

nico-abram commented 3 years ago

This is a duplicate of this feedback hub issue

I ran into this problem when trying to run cargo-flamegraph, this is the corresponding issue on it's repo with some context.

The following command showcases the problem:

dtrace "-x" "ustackframes=100" "-n" 'profile-997 /pid == $target/ { @[ustack(100)] = count(); }' "-o" "cargo-flamegraph.stacks" "-c" "bcdedit.exe"

Note: this command is meant for powershell Command for cmd.exe:

dtrace.exe -x ustackframes=100 -n "profile-997 /pid == $target/ { @[ustack(100)] = count(); }" -o cargo-flamegraph.stacks -c bcdedit.exe

version: Published 1/23/2020 on https://www.microsoft.com/en-us/download/details.aspx?id=100441 . dtrace -V prints dtrace: Sun D 1.13.1 windows version: 20H2 19042.844

nico-abram commented 3 years ago

The code in the repository seems to be outdated, since the last published version is from 1/23/2020 and the last commit from 30/08/2019. Despite that, I tried to locally compile the code in this repo and reproduce the issue. At least on my machine, I can reproduce it.

I know pretty much nothing about both dtrace the tool and the dtrace codebase, so I'm likely making mistakes here, but I can try to debug it.

It's stuck in this loop forever https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/cmd/dtrace.c#L2126-L2166 since dtrace_work never returns DTRACE_WORKSTATUS_DONE (Always DTRACE_WORKSTATUS_OKAY)

dtrace_work calls dtrace_status to get its return value. From what I've seen, it's always returning DTRACE_STATUS_OKAY from this if: https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/lib/libdtrace/common/dt_work.c#L168-L169

These status values are defined here: https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/lib/libdtrace/common/dtrace.h#L217-L221

I think the one that should be returning there is the EXITED case, and the only return for that in that function is this one: https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/lib/libdtrace/common/dt_work.c#L161-L166

So my guess is .dtst_exiting is not being set correctly somewhere else.

It seems that's the only place in /lib/ or /cmd/ that dtst_exiting is used. Assuming /sys/ is kernel code, I'm not sure there's much I can do to look into this further

Note: I had trouble getting debug symbols to work for functions in the dll project. Looking at the VS debugging modules window showed that it was actually loading the System dtrace.dll . I manually copied the lib/dtrace.dll file over to the /cmd/ build output folder and only then did symbols start working (Since it was loading the correct dll)

nico-abram commented 3 years ago

From some simple grepping, it looks like these are the only 4 places that set dtst_exiting: https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/sys/uts/common/dtrace/dtrace.c#L18040-L18041 https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/sys/dev/dtrace/dtrace_ioctl.c#L792-L793 https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/sys/compat/win32/dtrace.c#L19894-L19895 https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/sys/compat/win32/dtrace.c#L18662-L18663

All four seem to rely on dts_activity being DTRACE_ACTIVITY_DRAINING

I only see dts_acitivty being set to that value in two places: https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/sys/uts/common/dtrace/dtrace.c#L15222 https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/sys/compat/win32/dtrace.c#L15788 Both of these are in functions called dtrace_state_stop. At a glance, I would guess that function is either not being called when it should, or this early return that both functions have is being triggered https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/sys/compat/win32/dtrace.c#L15779-L15781

nico-abram commented 3 years ago

That dtrace_state_stop function seems to only be called here: https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/sys/compat/win32/dtrace.c#L19942 (There is another call here but it's in code that does not get compiled for windows because of #ifdefs)

That's in dtrace_ioctl in what I think is the dtrace kernel driver (In /sys/), and within a DTRACEIOC_STOP case in a switch

The only place where I see DTRACEIOC_STOP used (Outside of cases and #defines) is here, in dtrace_stop in the dtrace command line application code: https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/lib/libdtrace/common/dt_work.c#L251-L252

That dtrace_stop function seems to be called in 2 places: The loop we're stuck in mentioned in the earlier comment and in dtrace_status

Looking at the if in the loop in dtrace.c ((done || g_intr || (g_psc != 0 && g_pslive == 0))), g_intr or (g_psc != 0 && g_pslive == 0) seem interesting. Looking around, I think g_psc is the number of processes that dtrace was tracing initially, and g_pslive the number of those that are still alive (So basically, if we were tracing specific processes such that g_psc != 0, and they are all dead i.e g_pslive == 0 then we stop tracing). Debugging dtrace.exe, it looks like even after the target process exits g_pslive remains 1, and g_psc is correctly 1. I think that might be the bug, g_pslive is not being decremented. It's strange because dtrace: pid 23200 has exited gets printed to the console, so some part of dtrace clearly knows the process exited

Grepping for that message, I get to this line: https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/cmd/dtrace.c#L876

I don't see g_pslive-- anywhere in the windows code for the proc handler, but I do see it in other ifdefs. So I think that might be the bug (Or at least related to it)

nico-abram commented 3 years ago

Replacing these lines: https://github.com/microsoft/DTrace-on-Windows/blob/b51963db66f1112e9876fd0417aa53f1ccaea9a6/cmd/dtrace.c#L872-L877

With these:


    if (GetExitCodeProcess(h, &status) && (0 != status)) {
        notice("pid %d exited with status %d\n",
               pid, status);
    } else {
        notice("pid %d has exited\n", pid);
    }
    g_pslive--;

Seems to fix the problem, but I haven't tested it much