microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.32k stars 814 forks source link

BSOD in LXCORE.sys (arising from oh-my-zsh in tmux) #1094

Closed fpqc closed 7 years ago

fpqc commented 8 years ago

I tried messing around with removing and re-adding the scripts in oh-my-zsh and tmux (see #1085) to see exactly which part of the script was causing it to hang.

I found one permutation (removing completion.zsh and compfix.zsh from ~/.oh-my-zsh/lib) that seemed like it was working fine in tmux.

Then, as I was about to save a configuration file in nano, I got a BSOD in Windows.

I can't find the memdump file (it's not in C:\Windows\Minidump or C:\Windows\MEMORY.DMP), but I'd be happy to upload it as soon as you tell me where I can find it.

@russalex I know you would probably want to be notified about BSODs, so I'm pinging you now.

EDIT: Strangely, it seems like the system died so quickly that it appears to not have even been able to create a memdump. My settings all say that I should have generated a memdump, but it seems like it is nowhere to be found. Given how much input I was entering, (holding down the enter key to stress-test each configuration to see if it would hang), followed by random bits of text entry, I may have accidentally fuzzed some kind of bug in lxcore (maybe related to whatever is making oh-my-zsh die horribly when completion.zsh is not removed?). I can't tell you without a dump, and I don't see a dump on my machine.

russalex commented 8 years ago

@fpqc Thanks for reproting.

So far we have not had much luck reproducing this one. Any way you could email me your repro steps? Please just email InsiderSupport@microsoft.com and put in the subject that it is for russalex.

fpqc commented 8 years ago

@russalex I haven't been able to reproduce either. I managed to do it after screwing around multiple times with the "libraries" (zsh scripts) in oh-my-zsh, slowly adding them back until they did the hang again. I had just rebooted, I started up zsh, and added in one more file (total # is 18, and I had just re-added everything but completion.zsh and compfix.zsh). Then all seemed to be working fine, I tested holding down the enter key for like 60 seconds, tested the mouse in tmux, scrolling up and down (was using that wsltty thing (the magic is actually happening in wslbridge)), so I then went into .zshrc to re-enable the theme (since at one point I had thought it was the theme that was causing the hangs), and when I pressed ctrl+x, enter to save, I got a bsod.

It hasn't happened since my report. I should have some kind of memdump, right? @bitcrazed said on twitter that it should automatically have been moved to a place from where it will be uploaded. I'm happy to upload it, but since I can't reproduce it myself, I don't think I can help you reproduce it beyond like.. tarring up my lxss and letting you guys have a go at it, lol!

Sorry I can't be of more help, but I tried for like an hour to get it to crash on me again, running through the same steps.

therealkenc commented 8 years ago

Don't know if it is the same BSOD but I caught one. Happened for me a few times now under high disk activity and memory pressure (linking) and apt-get, which might or might not be helpful. Nothing to do with tmux or zsh, natch. Dump file is here.

20160919_154451

fpqc commented 8 years ago

@russalex I turned off the easy and obvious telemetry info on Win10 install (I didn't edit the registry or mess with the group policy or try to block it aggressively). Could this have prevented a proper memdump?

I'm happy to help however I can with memdumps if I do run into any wsl-related bsods, so if I may have inadvertently disabled them, I'd like to reenable them to be able to provide them (assuming that they weren't already uploaded automatically or moved to be ready to upload as @bitcrazed said on twitter).

However, I think it might maybe possibly be related to KenC's BSOD. He said he got it during some heavier file i/o, and my crash happened just as I was saving a file with nano (just as I told nano to save changes and overwrite). Unfortunately I don't remember the stop code.

russalex commented 8 years ago

@therealkenc, thanks for the memory dump. I have it now. You can lock it down now.

Quick note for everyone on the board, it is possible that memory dumps can contain sensitive information. The best way to get those to us is over email. Either email secure@microsoft.com or InsiderSupport@microsoft.com. In either case add my alias (russalex) and the issue number number to the subject of the email.

@fpqc Did you remember if the BSOD screen stated that it created a dump? If you disabled error reporting then it's possible that the dump will not be uploaded automatically. As for the dump file location, by default it is under %SystemRoot%\MEMORY.DMP (where you looked) but it can be changed under System Properties -> Startup and Recovery Settings (best instructions I have found online are on Kapersky's site).

fpqc commented 8 years ago

@russalex I didn't see the dumpfile there, but when I checked that setting, it was set to save it there but the entry was set to "automatic memory dump". Do you want "complete memdumps" or do you only need a kernel memdump, or what?

Also, since it was set to automatic instead of complete, or kernel, is it being moved elsewhere or was it just not dumped for some reason?

Also, iirc, isn't there an (undocumented) option to dump the memory so it encrypts with the MS Kernel team's public key and the system's private key? I remember hearing/reading that those kinds of encrypted dumps could be created for crashing trustlets in certain cases.

fpqc commented 8 years ago

btw @Russalex. Mr. Ionescu is tweeting all about /usr/bin/winrun and the buildout of the COM interface with createprocess support, as well as setting winrun as the associated loader for portable executables in binfmt_misc.

I tried to get it to work, but idk I'm not a reverse engineer, and it bugs out trying to read /dev/lxss with access denied (Mr. Ionescu mentioned that winrun is currently just a symlink to init. Just wondering what's the state of play on that…

russalex commented 8 years ago

Changing to generate a complete dump will make the dump files huge (if you have 8GB of memory it will generate an 8GB file). Windows will attempt to upload the file if you're set up that way and you fall within our random user sample. Generally, we don't need that level of information so I would stick with Automatic unless someone on the team asks you to swap to a full dump when reproducing a bug. As for the keys, I'll ask around. From the one conversation I've had so far the key you are referring to is for trustlets only, but I'll keep pinging.

Now, onto winrun: official word that there is nothing to see here. That said, given interoperability is our number one ask on User Voice it should come as no surprise that we're working in that area. We should, hopefully, announce something soon, but it will look a little different than what you're seeing now.

fpqc commented 8 years ago

Waiting on the edge of my seat! A little disappointed in the dearth of blogposts, since the first four were so good.

A good topic I want to know about is how you are handling console I/O, that is, mainly ttys, once you make the first context switch from a Linux instance into the tty driver.

In real Linux, you'd basically have the tty listening on the keyboard and displaying through vga, when the kernel makes that decision.

However NT to my knowledge does not have exactly this same low-level kernel-mode notion of a tty. NT has console-type windows applications that run in conhost.

We also know that instances of bash.exe are attached to ttys by opening, I think, some kind of rpc session with the LXSSManager service, which is somehow making raw-mode tty devices available over COM.

Anyway, that's the blog post I would want ro read.

russalex commented 8 years ago

The blogs will be starting back up soon. Seth is showing up tomorrow to film two more. Still need to write them up so it'll take a little more time, but they are relatively close.

Good idea on the tty blog request. I'll start that ball rolling.

fpqc commented 8 years ago

@russalex Given that you are moving ahead with some kind of LX->Win32 interop unofficially/whatever, I think it might be wise to consider the fact that the first thing idiots like me are going to do is try to run cmd.exe inside tmux, and if that works the way I hope, probably see how many times I can go back and forth launching bash.exe inside cmd.exe inside bash.exe inside cmd.exe ad infinitum, so if you guys are still thinking about how you are going to do it, that seems like a good place to look for things potentially dying horribly. (Xilun's cbwin had a whole lot of trouble with this (at least in the earlier builds, don't know if he figured out a way to fix it)). Something to look out for will probably be if running

bash.exe

winrun cmd

then running

bash.exe

creates a new tty device or a new instance of conhost.

The obvious corner case here I think is where you start the sequence of winrun cmd.exe and bash.exe inside of a pty.

Just something to probably look at. The reason I bring up the corner case of a "base console" window being a pty is that it would break the one obvious idea I had, which would be creating one conhost window per Linux-side tty, then forcing the Windows-side console application's stdout to be directed into the conhost window associated with the tty in which it was winrun, and likewise forcing the bash.exe to print to the tty associated with the console where it was launched (assuming it was already a console attached to a bash.exe process).

But then again, maybe my idea for how you might try to do it doesn't make sense, since I don't know how the ttys in WSL really work!

hdave commented 7 years ago

+1 for this. I am getting a BSOD almost everyday. FWIW I am typically running my base under administrative privileges in order to allow vagrant to do its thing. I do use tmux, but i have not observed a common thread for these crashes other than the BSOD reports lxcore.sys.

russalex commented 7 years ago

@hdave, any chance you could send us a memory dump of your crash? From the contributing instructions:

Do not open Github issues for Windows crashes (BSODs) or security issues. Please direct all Windows crashes and security issues to secure@microsoft.com. Issues with security vulnerabilities may be edited to hide the vulnerability details.

You will find memory dumps at %WINDIR%\MEMORY.DMP or in %SystemRoot%\Minidump

Please include your Windows build number.

saxonww commented 7 years ago

I've had this happen twice in the past two days, just editing files in vim under tmux. I am using mintty/wsltty instead of the regular console.

I just sent something to secure@microsoft.com but managed not to put your name @russalex, sorry.

mqudsi commented 7 years ago

@russalex I just sent an email to secure@microsoft.com with an lxcore.sys BSOD (fish shell + git, under cmd) but it was a rather large attachment and I'm unsure if it reached. Am I supposed to receive some sort of automated confirmation?

fpqc commented 7 years ago

@mqudsi Question: Did this happen on 14986? One of the fixes listed was the resolution of a pty-related bugcheck (this Github issue).

saxonww commented 7 years ago

@mqudsi I got an email receipt for my submission several hours after I sent it.

russalex commented 7 years ago

@mqudsi Thanks for sending in the memory dump. Not certain how the security team sends out their email receipts for these. I suspect it will take some time. Let me know if you do not hear anything by Monday.

mqudsi commented 7 years ago

@fpqc 14393.479, actually. It's supposed to be a stable one ;)

@russalex @saxonww thanks. Any idea if there's a limit on the attachment size? It was a 2.2GiB dump RAR-compressed to around 500MiB.

fpqc commented 7 years ago

@mqudsi 14393's WSL is actually wayyy more unstable than the insider one. It's basically a snapshot of the project as of last summer, at a basically random time. To my knowledge, no extraordinary attempts were made to try to fix bugs in WSL for 14393.

pgrm commented 7 years ago

Hey, after the last update - Cumulative Update for Windows 10 Version Next for x64-based Systems (14986) (KB3206309) - my windows crashed over the weekend several times. Sometimes immediately after launching bash. The latest insider preview update I have is: Windows 10 Insider Preview 14986 (rs_prerelease).

I also have oh-my-zsh, but I don't recall experiencing that terrible issues before. Where can I send my memory dump? - If you're interested of course.

fpqc commented 7 years ago

@pgrm Did you catch what the message on the bugcheck was? There is a known bug in the console code atm, but that is not this. It's in the Windows console itself, not WSL as such.

Also, if you email your minidump file to secure@microsoft.com attn: Ben Hillis or Russ Alexander the team will look at it.

pgrm commented 7 years ago

@fpqc the stop code was the same - System Service Exception and at "What failed" it was written LXCORE.SYS. I couldn't see anything else.

This is what's written in the Event Logs:

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>...</System>
  <EventData>
    <Data Name="BugcheckCode">59</Data> 
    <Data Name="BugcheckParameter1">0xc0000005</Data> 
    <Data Name="BugcheckParameter2">0xfffff80c5e9b785b</Data> 
    <Data Name="BugcheckParameter3">0xffffaf8030a3be20</Data> 
    <Data Name="BugcheckParameter4">0x0</Data> 
    <Data Name="SleepInProgress">0</Data> 
    <Data Name="PowerButtonTimestamp">0</Data> 
    <Data Name="BootAppStatus">0</Data> 
    <Data Name="Checkpoint">0</Data> 
    <Data Name="ConnectedStandbyInProgress">false</Data> 
    <Data Name="SystemSleepTransitionsToOn">3</Data> 
    <Data Name="CsEntryScenarioInstanceId">0</Data> 
  </EventData>
</Event>
fpqc commented 7 years ago

If you can find a file in C:\Windows\minidump, please send it to MS at that email addr and attn line.

gurpreetatwal commented 7 years ago

@pgrm I'm having similar issues on the same build, I'm running stterm with zsh, oh-my-zsh and tmux

My system seems to crash whenever my stterm window is redrawn very quickly, either by zooming or by using tmux's "zoom feature"

Just sent a couple of minidumps to secure@microsoft.com

benhillis commented 7 years ago

@gurpreetatwal - I received your dumps, thank you for sending those. I verified this was a known issue that is fixed in build 15025 and later.

gurpreetatwal commented 7 years ago

@benhillis awesome! thanks for the quick reply :D

If you don't mind me asking, what was the cause of the issue?

benhillis commented 7 years ago

No problem. At a high level we were missing some synchronization around signals being sent and threads terminating.

bitcrazed commented 7 years ago

Thanks for reporting. Closing for now since this is known and resolved.