microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.27k stars 812 forks source link

Problems generating corefiles with WSL2 #11997

Open paul-haskell opened 2 weeks ago

paul-haskell commented 2 weeks ago

Discussed in https://github.com/microsoft/WSL/discussions/11992

Originally posted by **paul-haskell** September 3, 2024 Hi all -- I am trying to generate a corefile under WSL2. 1) I disabled apport (sudo service apport stop) 2) I set the kernel.core_pattern appropriately (sudo sysctl kernel.core_pattern=core.%e.%p) 3) I set the corefilesize limit to 'unlimited' 4) I verified the current directory is writable by all 5) I ran a few simple programs that should throw a core (abort(), integer divide-by-0) ...but I never get a core. I do reliably get a corefile on a native Ubuntu machine. Does anyone have any ideas to try? Thanks! (I am running WSL version 2.2.4.0 with default Ubuntu i.e. 22.04.3 LTS. I am running Windows version 10.0.22631.4037 .)
github-actions[bot] commented 2 weeks ago

Logs are required for review from WSL team

If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'. Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.

How to collect WSL logs Download and execute [collect-wsl-logs.ps1](https://github.com/Microsoft/WSL/blob/master/diagnostics/collect-wsl-logs.ps1) in an **administrative powershell prompt**: ``` Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1 Set-ExecutionPolicy Bypass -Scope Process -Force .\collect-wsl-logs.ps1 ``` The script will output the path of the log file once done. If this is a networking issue, please use [collect-networking-logs.ps1](https://github.com/Microsoft/WSL/blob/master/diagnostics/collect-networking-logs.ps1), following the instructions [here](https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues) Once completed please upload the output files to this Github issue. [Click here for more info on logging](https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#8-collect-wsl-logs-recommended-method) If you choose to email these logs instead of attaching to the bug, please send them to wsl-gh-logs@microsoft.com with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

paul-haskell commented 2 weeks ago

WslLogs-2024-09-05_14-19-48.zip

paul-haskell commented 2 weeks ago

I already tried the fixes in #1754.

github-actions[bot] commented 2 weeks ago
Diagnostic information ``` Detected appx version: 2.2.4.0 ```
zcobol commented 2 weeks ago

@paul-haskell is systemd-coredump installed? Your coredumps might be in the journal. Is there any output when you run coredumpctl list?

Also, if gdb is attached to a process, running generate-core-file does create a core dump, i.e. process 316 in this case:

zcobol@toto:~$ file core.316
core.316: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '-bash', real uid: 1002, effective uid: 1002, real gid: 1002, effective gid: 1002, execfn: '/bin/bash', platform: 'x86_64'

The kernel.core_pattern was not modified. This is the default:

zcobol@toto:~$ sysctl kernel.core_pattern
kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %h
paul-haskell commented 2 weeks ago

Hi there,

systemd-coredump is not installed. When I try to run 'coredumpctl' I get the message:

Command 'coredumpctl' not found, but can be installed with:

sudo apt install systemd-coredump

paul-haskell commented 1 week ago

When I run "sysctl -a | grep core_pattern" on my WSL instance, I get: /mnt/wslg/dumps/core.%e

/mnt/wslg/dumps is empty, even after I run my core-making program. The directory's file permissions are drwxrwxrwx

elsaco commented 1 week ago

In wsl-2.3.17 the value of core_pattern is different:

elsaco@eleven:~/test$ sysctl kernel.core_pattern
kernel.core_pattern = |/wsl-capture-crash %t %E %p %s

Using strings on /init it shows:

elsaco@eleven:~/test$ strings /init | grep crash
<3>WSL (%d) ERROR: %s:%u: Received error while trying to capture crash dump: %u
<6>WSL (%d): Capturing crash for pid: %s, executable: %s, signal: %s, port: %u
<3>WSL (%d) ERROR: %s:%u: Error while trying read crash dump from stdin, %u
/wsl-capture-crash
wsl-capture-crash
|/wsl-capture-crash %t %E %p %s
crash-dump

so it looks hardcoded into the WSL's own init

Using a simple divide-by-zero test it does crash dumps:

elsaco@eleven:~/test$ ./zero
Floating point exception (core dumped)

and the trace shows in the dmesg output:

[20108.624055] traps: zero[12868] trap divide error ip:563cda3a4184 sp:7ffe113bb0a0 error:0 in zero[563cda3a4000+1000]
[20108.624065] potentially unexpected fatal signal 8.
[20108.624066] CPU: 0 PID: 12868 Comm: zero Not tainted 5.15.153.1-microsoft-standard-WSL2 #1
[20108.624068] RIP: 0033:0x563cda3a4184
[20108.624071] Code: 00 75 07 b8 ff ff ff ff eb 07 8b 45 fc 99 f7 7d f8 5d c3 f3 0f 1e fa 55 48 89 e5 48 83 ec 10 b8 0a 00 00 00 b9 00 00 00 00 99 <f7> f9 89 45 f4 b8 00 00 00 00 b9 00 00 00 00 99 f7 f9 89 45 f8 8b
[20108.624072] RSP: 002b:00007ffe113bb0a0 EFLAGS: 00010206
[20108.624073] RAX: 000000000000000a RBX: 00007ffe113bb1d8 RCX: 0000000000000000
[20108.624074] RDX: 0000000000000000 RSI: 00007ffe113bb1d8 RDI: 0000000000000001
[20108.624075] RBP: 00007ffe113bb0b0 R08: 0000000000000000 R09: 00007fa6ffd20380
[20108.624076] R10: 00007ffe113badd0 R11: 0000000000000203 R12: 0000000000000001
[20108.624076] R13: 0000000000000000 R14: 0000563cda3a6dc0 R15: 00007fa6ffd53000
[20108.624077] FS:  00007fa6ffafe740 GS:  0000000000000000
[20108.624514] WSL (12869): Capturing crash for pid: 10759, executable: !home!elsaco!test!zero
[20108.624516] , signal: 8, port: 50005

and journalctl

Sep 08 21:00:46 eleven kernel: traps: zero[12868] trap divide error ip:563cda3a4184 sp:7ffe113bb0a0 error:0 in zero[563>
Sep 08 21:00:46 eleven kernel: potentially unexpected fatal signal 8.
Sep 08 21:00:46 eleven kernel: CPU: 0 PID: 12868 Comm: zero Not tainted 5.15.153.1-microsoft-standard-WSL2 #1
Sep 08 21:00:46 eleven kernel: RIP: 0033:0x563cda3a4184
Sep 08 21:00:46 eleven kernel: Code: 00 75 07 b8 ff ff ff ff eb 07 8b 45 fc 99 f7 7d f8 5d c3 f3 0f 1e fa 55 48 89 e5 4>
Sep 08 21:00:46 eleven kernel: RSP: 002b:00007ffe113bb0a0 EFLAGS: 00010206
Sep 08 21:00:46 eleven kernel: RAX: 000000000000000a RBX: 00007ffe113bb1d8 RCX: 0000000000000000
Sep 08 21:00:46 eleven kernel: RDX: 0000000000000000 RSI: 00007ffe113bb1d8 RDI: 0000000000000001
Sep 08 21:00:46 eleven kernel: RBP: 00007ffe113bb0b0 R08: 0000000000000000 R09: 00007fa6ffd20380
Sep 08 21:00:46 eleven kernel: R10: 00007ffe113badd0 R11: 0000000000000203 R12: 0000000000000001
Sep 08 21:00:46 eleven kernel: R13: 0000000000000000 R14: 0000563cda3a6dc0 R15: 00007fa6ffd53000
Sep 08 21:00:46 eleven kernel: FS:  00007fa6ffafe740 GS:  0000000000000000
Sep 08 21:00:46 eleven unknown: WSL (12869): Capturing crash for pid: 10759, executable: !home!elsaco!test!zero
Sep 08 21:00:46 eleven unknown: , signal: 8, port: 50005

However, I can't figure out this entry: WSL: Capturing crash for pid:. Where does wsl-capture-crash stores the actual core file!?

zcobol commented 6 days ago

In wsl-2.3.17 core dumps are stored in \AppData\Local\Temp\wsl-crashes folder under your Windows home directory. You'll notice this kind of entries when running dmesg:

WSL (573): Capturing crash for pid: 366, executable: !home!zcobol!test!zero, signal:8, port: 50005

WSL is capturing the crash and dumps in the wsl-crashes folder.

Sample file:

PS C:\Users\valli>\AppData\Local\Temp\wsl-crashes\wsl-crash-1726372480-366-_home_zcobol_test_zero-8.dmp

Run sysctl kernel.core_pattern and if you didn't mess with the settings it should be like:

zcobol@texas:~$ sysctl kernel.core_pattern
kernel.core_pattern = |/wsl-capture-crash %t %E %p %s

Using systemd-coredump didn't work because it would kill init:

systemd-coredump[544]: Due to PID 1 having crashed coredump collection will now be turned off

paul-haskell commented 5 days ago

I checked my system: I do not have a \AppData\Local\Temp\wsl-crashes directory. (I do have \AppData\Local\Temp) My dmesg output does not show any "Capturing crash" messages. My "sysctl kernel.core_pattern" shows "/mnt/wslg/dumps/core.%e". And I do not have any files in /mnt/wslg/dumps, though I do have that directory.

OneBlue commented 4 days ago

What @zcobol and and @elsaco said is right. We indeed added logic to capture coredumps in 2.3.17. The default path is %tmp%\wsl-crashes.

You can override the crash dump folder via:

[wsl2]
crashDumpFolder=C:\\path\\to\\folder

And you can completely disable this behavior via:

[wsl2]
maxCrashDumpCount=1

This will completely prevent WSL from touching core_pattern, which should allow to set your own custom path.

Let me know if this helps collecting coredumps for you !

OneBlue commented 4 days ago

@paul-haskell: You most likely have an older build installed. Try running: wsl --update --pre-release to get the latest.

paul-haskell commented 4 days ago

@OneBlue, thanks for your message -- I am a lot closer after upgrading to WSL 2.3.17.

First, I ran with the default kernel.core_pattern of "|/wsl-capture-crash %t %E %p %s". When I ran my program that calls abort(), I did not have a .../AppData/Local/Temp/wsl-crashes directory created.

Next, I tried:

  1. ulimit -c unlimited
  2. sudo sysctl kernel.core_pattern=/mnt/c/Users/phaskell/AppData/Temp/core.%e
  3. (ran my program that calls abort() ) and I got a corefile! But it was empty i.e. 0 bytes. Same result when I repeated the tests.

Any ideas why my corefiles are empty?

OneBlue commented 11 hours ago

@paul-haskell: Can you collect /logs of this happening (for both scenarios) ?

paul-haskell commented 9 hours ago

Here are the requested log for the second scenario i.e. set kernel.core_pattern=core.%e . Thanks for looking. (I will upload the other logs shortly.) WslLogs-2024-09-20_14-33-52.zip

github-actions[bot] commented 9 hours ago
Diagnostic information ``` Detected appx version: 2.3.17.0 ```
paul-haskell commented 9 hours ago

Here are the logs for the first scenario (kernel.core_pattern=|/wsl-capture-crash %t %E %p %s ) WslLogs-2024-09-20_14-40-21.zip

github-actions[bot] commented 9 hours ago
Diagnostic information ``` Detected appx version: 2.3.17.0 ```
OneBlue commented 8 hours ago

Thank you @paul-haskell. Looking at the logs, I see that a crash dump is generated:

Microsoft.Windows.Lxss.Manager  LinuxCrash  09-20-2024 14:40:57.101 "   "   "FullPath:  C:\Users\phaskell\AppData\Local\temp\wsl-crashes\wsl-crash-1726868457-485-_mnt_c_phaskell_CS221_Private_ClassDays_Day17_makeCore-6.dmp
Pid:    485
Signal:     6
process:    !mnt!c!phaskell!CS221!Private!ClassDays!Day17!makeCore
wslVersion:     2.3.17.0"               4996    14140   5       00000000-0000-0000-0000-000000000000        

Can you check the contents of C:\Users\phaskell\AppData\Local\temp\wsl-crashes\?

paul-haskell commented 7 hours ago

I do see a core in .../Local/Temp/wsl-crashes and it is nonempty. So "case 1" works! Thank you. Any idea why "case 2" i.e. overridden kernel.core_pattern only creates empty corefiles? (The reason I care is because I am teaching a class on system programming, and I want to make it easy for students on Windows and Mac platforms to be able to debug with corefiles. If I can get the corefiles in the current directory via some configuration script, it will make the students' lives easy.)

OneBlue commented 7 hours ago

@paul-haskell: Does disabling systemd and restarting the distro help with case 2?

paul-haskell commented 7 hours ago

I did a quick check, and I have 159 services managed by systemd. systemd manages all the startup services with Ubuntu, right? Can I really stop all of them?

(I tried stopping apport.service and setting kernel.core_pattern=core.%e but I still get empty corefiles.)

OneBlue commented 6 hours ago

Can you by setting

[boot]
systemd=false

in /etc/wsl.conf

paul-haskell commented 6 hours ago

Ok, I did that test: In /etc/wsl.conf I set systemd=false, and I restarted my Ubuntu.

The system boots really quickly now. Unfortunately my corefiles are still empty. I'll attach another log to the case.

paul-haskell commented 6 hours ago

WslLogs-2024-09-20_16-56-24.zip

Here are the logs with systemd=false in wsl.conf and with kernel.core_pattern=core.%e (and with empty corefiles)

github-actions[bot] commented 6 hours ago
Diagnostic information ``` Detected appx version: 2.3.17.0 ```