microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.01k stars 797 forks source link

How to debug complete subsystem hang? #2859

Closed mqudsi closed 5 years ago

mqudsi commented 6 years ago

I've experienced this a number of times, but pretty sporadically, and I'm not sure how to go about getting information that would help debug this.

Sometimes after performing an action in the WSL environment, I end up with a completely deadlocked lxss where all existing (launched) WSL processes immediately become unresponsive and any new commands (even just bash) hang indefinitely.

I just experienced this on 16299.

sunilmut commented 6 years ago

@mqudsi - You will need Windbg/kd for that. Once the system hangs, you will have to break in to the debugger and see if there is a deadlock between lxcore processes (try the !stacks debugger extension).

poizan42 commented 6 years ago

@sunilmut What are the "lxcore processes"? Right now on build 17074 I can see the following: The launcher process (wsl.exe), optionally distribution launcher (e.g. ubuntu.exe), the wsl host wslhost.exe and the LxssManager service running inside one of the svchost.exe instances. And then of course the wsl processes themselves, but you have blocked those from being opened with anything but PROCESS_QUERY_LIMITED_INFORMATION :( (what's the point in that anyways?)

benhillis commented 6 years ago

@poizan42 - He means ELF processes (/bin/bash, etc).

poizan42 commented 6 years ago

@benhillis, but you just get access denied if try to open those in windbg. I actually tried using Process Hacker to launch WinDbg with System integrity and all 31 privileges activated, and it is still blocked, so seemingly they can't be opened from usermode at all for for debugging.

benhillis commented 6 years ago

@poizan42 - If there's a deadlock it's going to be in kernel mode, not user mode so attaching a user mode debugger isn't going to be useful. The easiest way for us to debug subsystem hangs is by looking at a memory dump. It's likely this is #2849 for which we have a fix inbound.

mqudsi commented 6 years ago

fwiw, It's unlikely this was the same issue as it was the shutdown sequence for neovim that caused the issue in this particular case, which shouldn't have referenced anything outside the WSL environment.

benhillis commented 6 years ago

@mqudsi - In that case if you could collect a memory dump and forward it along to secure@microsoft.com it would be greatly appreciated.

poizan42 commented 6 years ago

@benhillis That makes sense, but since you need to have enabled kernel mode debugging already it won't help much if you are randomly encountering a hang unless you can reproduce it or happen to be running with kernel mode debugging enabled, which didn't sound like was the situation for @mqudsi.

Actually Process Hacker can use its kernel driver to show the kernel mode stack which might be the best thing available in this case.

mkarpoff commented 6 years ago

@mqudsi Did you find a fix for your freezing problem? I've started getting it today on build 17074.

sunilmut commented 6 years ago

@poizan42 - Yes, that's mostly correct. If you are encountering a hang in launching bash and it feels like a deadlock, then there are two options:

  1. Generate a full memory dump manually by following the steps here and send the dump over to secure@microsoft.com Make sure that the dump is set to full memory dump. Minidump will not be much use here.
  2. If you are feeling adventurous and luckily have a Windows kernel debugger hooked up to the system, then you can break into the debugger and go from there. If there are any tools out there that gives you the kernel mode stack for all the processes running on the system, then, yes, you can use that as well.
fpqc commented 6 years ago

@sunilmut Do you guys at MS run with live kernel debuggers, or do you generally generate and then debug crashes?

If you do run with live kernel debugging machines attached, I'm wondering if you guys literally frankenstein together two PCs or if you have special hardware.

mqudsi commented 6 years ago

@fpqc it used to be so hard, but these days a bidirectional usb 3.0 a-a cable is all it takes.

fpqc commented 6 years ago

@mqudsi Neat! Is local kdb suitable for debugging something like WSL? If not, what about offline debugging with the LiveKD tool? It looks like these features were added recently: https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/performing-local-kernel-debugging

mqudsi commented 6 years ago

They're actually old features, but the article was recently updated. So long as your PC isn't crashed (BSoD/GSoD) or totally hung, local kdb is fine.

benhillis commented 6 years ago

A typical setup is a dev box and a test machine or test VM with a kernel debugger attached. Personally I use one physical machine and a couple VMs with different memory and virtual processor counts.