microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.27k stars 812 forks source link

Kernel Panic on GPU Process Exiting (dxgdevice_destroy, dxgprocess_adapter_destroy, dxgprocess_release) #10558

Open FremyCompany opened 11 months ago

FremyCompany commented 11 months ago

Windows Version

Microsoft Windows [Version 10.0.22623.1325]

WSL Version

2.0.1.0

Are you using WSL 1 or WSL 2?

Kernel Version

5.15.123.1-microsoft-standard-WSL2

Distro Version

Ubuntu-20.04

Other Software

NVIDIA RTX 3090 (driver version: 31.0.15.3742) Torch 2.1.0.dev20230821

Repro Steps

One program I wrote consistently crashes the Linux subsystem when finishing (after the last line of user code successfully ran). I have a diagnostic dump as well as a few instances below. What is peculiar is that seemingly identical programs (with different arguments) never cause this issue. Hope this reports helps uncover an interesting issue!

'417D5D8A-B284-4A60-A558-2890EEED33F7' has encountered a fatal error.  The guest operating system reported that it failed with the following error codes: ErrorCode0: 0x0, ErrorCode1: 0x0, ErrorCode2: 0x0, ErrorCode3: 0x0, ErrorCode4: 0x0.  PreOSId: 0.  If the problem persists, contact Product Support for the guest operating system.  (Virtual machine ID 417D5D8A-B284-4A60-A558-2890EEED33F7)

Guest message:
[   94.412856] CPU: 10 PID: 829 Comm: python Not tainted 5.15.123.1-microsoft-standard-WSL2 #1
[   94.413216] RIP: 0010:hmgrtable_free_handle+0x91/0xa0
[   94.413443] Code: 3b 00 0f b7 51 08 66 81 e2 00 bf 80 ca 80 66 89 51 08 83 40 1c 01 c7 41 04 00 00 00 ff 8b 50 14 89 11 48 c1 e2 04 48 03 50 08 <44> 89 4a 04 44 89 48 14 c3 cc cc cc cc 66 90 0f 1f 44 00 00 41 56
[   94.414232] RSP: 0018:ffffb779860ebc28 EFLAGS: 00010286
[   94.414455] RAX: ffff9b5200f28530 RBX: ffff9b5203bea600 RCX: ffffb77988f36000
[   94.414778] RDX: ffffb78978f36000 RSI: 0000000000000003 RDI: 0000000000000001
[   94.415123] RBP: ffff9b5200f28500 R08: 0000000000000003 R09: 0000000000000000
[   94.415471] R10: 0000000000000001 R11: 0000000000000008 R12: ffff9b5200f28530
[   94.415826] R13: ffff9b5203bea6d0 R14: ffff9b5203bea638 R15: ffff9b5203bea660
[   94.416153] FS:  0000000000000000(0000) GS:ffff9b6107a80000(0000) knlGS:0000000000000000
[   94.416482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   94.416780] CR2: ffffb78978f36004 CR3: 000000090e210005 CR4: 00000000003706a0
[   94.417134] Call Trace:
[   94.417246]  <TASK>
[   94.417361]  ? __die_body.cold+0x1a/0x1f
[   94.417527]  ? page_fault_oops+0xae/0x250
[   94.417694]  ? exc_page_fault+0x86/0x100
[   94.417859]  ? asm_exc_page_fault+0x22/0x30
[   94.418022]  ? hmgrtable_free_handle+0x91/0xa0
[   94.418252]  dxgdevice_destroy+0x2bc/0x340
[   94.418419]  dxgprocess_adapter_destroy+0x7f/0x140
[   94.418639]  dxgprocess_destroy+0x4a/0x130
[   94.418841]  dxgprocess_release+0x67/0xa0
[   94.419004]  dxgk_release+0x5d/0x90
[   94.419185]  __fput+0x82/0x240
[   94.419349]  task_work_run+0x5c/0x90
[   94.419512]  do_exit+0x331/0xa50
[   94.419677]  do_group_exit+0x33/0xa0
[   94.419842]  get_signal+0x13f/0x8c0
[   94.419994]  arch_do_signal_or_restart+0xf1/0x750
[   94.420221]  ? pick_next_task_fair+0x194/0x3a0
[   94.420450]  ? __x64_sys_futex+0x73/0x1d0
[   94.420618]  ? __x64_sys_futex+0x73/0x1d0
[   94.420788]  exit_to_user_mode_prepare+0xcd/0x120
[   94.421008]  syscall_exit_to_user_mode+0x1d/0x40
[   94.421231]  do_syscall_64+0x48/0xc0
[   94.421395]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[   94.421626] RIP: 0033:0x7fa85c95873d
[   94.421791] Code: Unable to access opcode bytes at RIP 0x7fa85c958713.
[   94.422076] RSP: 002b:00007fa78342c968 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[   94.422406] RAX: fffffffffffffe00 RBX: 0000000017039388 RCX: 00007fa85c95873d
[   94.422744] RDX: 0000000000000047 RSI: 0000000000000089 RDI: 0000000017039388
[   94.423100] RBP: 0000000000000047 R08: 0000000000000000 R09: 00007fa7ffffffff
[   94.423439] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa85c958720
[   94.423776] R13: 00007fa85cba2520 R14: 0000000017039380 R15: 00007fa78342c988
[   94.424118]  </TASK>
[   94.424228] Modules linked in:
[   94.424398] CR2: ffffb78978f36004
[   94.424564] ---[ end trace 516bd117c94e527f ]---
[   94.424784] RIP: 0010:hmgrtable_free_handle+0x91/0xa0
[   94.425012] Code: 3b 00 0f b7 51 08 66 81 e2 00 bf 80 ca 80 66 89 51 08 83 40 1c 01 c7 41 04 00 00 00 ff 8b 50 14 89 11 48 c1 e2 04 48 03 50 08 <44> 89 4a 04 44 89 48 14 c3 cc cc cc cc 66 90 0f 1f 44 00 00 41 56
[   94.425766] RSP: 0018:ffffb779860ebc28 EFLAGS: 00010286
[   94.425998] RAX: ffff9b5200f28530 RBX: ffff9b5203bea600 RCX: ffffb77988f36000
[   94.426331] RDX: ffffb78978f36000 RSI: 0000000000000003 RDI: 0000000000000001
[   94.426662] RBP: ffff9b5200f28500 R08: 0000000000000003 R09: 0000000000000000
[   94.426982] R10: 0000000000000001 R11: 0000000000000008 R12: ffff9b5200f28530
[   94.427309] R13: ffff9b5203bea6d0 R14: ffff9b5203bea638 R15: ffff9b5203bea660
[   94.427637] FS:  0000000000000000(0000) GS:ffff9b6107a80000(0000) knlGS:0000000000000000
[   94.427962] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   94.428241] CR2: ffffb78978f36004 CR3: 000000090e210005 CR4: 00000000003706a0
[   94.428563] Kernel panic - not syncing: Fatal exception
[   94.448896] Kernel Offset: 0x2a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
'CBE50D14-6A2A-4DB3-8399-BAA1E458EAAA' has encountered a fatal error.  The guest operating system reported that it failed with the following error codes: ErrorCode0: 0x0, ErrorCode1: 0x0, ErrorCode2: 0x0, ErrorCode3: 0x0, ErrorCode4: 0x0.  PreOSId: 0.  If the problem persists, contact Product Support for the guest operating system.  (Virtual machine ID CBE50D14-6A2A-4DB3-8399-BAA1E458EAAA)

Guest message:
[14820.507951] PGD 100000067 P4D 100000067 PUD 0 
[14820.508204] Oops: 0002 [#1] SMP NOPTI
[14820.508374] CPU: 7 PID: 1145 Comm: cuda-EvtHandlr Not tainted 5.15.123.1-microsoft-standard-WSL2 #1
[14820.508794] RIP: 0010:hmgrtable_free_handle+0x91/0xa0
[14820.509581] Code: 3b 00 0f b7 51 08 66 81 e2 00 bf 80 ca 80 66 89 51 08 83 40 1c 01 c7 41 04 00 00 00 ff 8b 50 14 89 11 48 c1 e2 04 48 03 50 08 <44> 89 4a 04 44 89 48 14 c3 cc cc cc cc 66 90 0f 1f 44 00 00 41 56
[14820.510618] RSP: 0018:ffffbb5905c37c28 EFLAGS: 00010286
[14820.510860] RAX: ffff9fea80e7b430 RBX: ffff9ff987dd3e00 RCX: ffffbb5905cd5000
[14820.511259] RDX: ffffbb68f5cd5000 RSI: 0000000000000003 RDI: 0000000000000001
[14820.511563] RBP: ffff9fea80e7b400 R08: 0000000000000003 R09: 0000000000000000
[14820.511875] R10: 0000000000000001 R11: 0000000000000003 R12: ffff9fea80e7b430
[14820.512202] R13: ffff9ff987dd3ed0 R14: ffff9ff987dd3e38 R15: ffff9ff987dd3e60
[14820.512513] FS:  0000000000000000(0000) GS:ffff9ff9879c0000(0000) knlGS:0000000000000000
[14820.512829] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14820.513091] CR2: ffffbb68f5cd5004 CR3: 00000001274dc002 CR4: 00000000003706a0
[14820.513410] Call Trace:
[14820.513789]  <TASK>
[14820.514035]  ? __die_body.cold+0x1a/0x1f
[14820.514360]  ? page_fault_oops+0xae/0x250
[14820.514807]  ? exc_page_fault+0x86/0x100
[14820.515051]  ? asm_exc_page_fault+0x22/0x30
[14820.515226]  ? hmgrtable_free_handle+0x91/0xa0
[14820.515439]  dxgdevice_destroy+0x2bc/0x340
[14820.515640]  dxgprocess_adapter_destroy+0x7f/0x140
[14820.515855]  dxgprocess_destroy+0x4a/0x130
[14820.516080]  dxgprocess_release+0x67/0xa0
[14820.516240]  dxgk_release+0x5d/0x90
[14820.516408]  __fput+0x82/0x240
[14820.516830]  task_work_run+0x5c/0x90
[14820.517166]  do_exit+0x331/0xa50
[14820.517406]  do_group_exit+0x33/0xa0
[14820.517564]  get_signal+0x13f/0x8c0
[14820.517760]  arch_do_signal_or_restart+0xf1/0x750
[14820.518217]  ? ktime_get_ts64+0x49/0xf0
[14820.518435]  exit_to_user_mode_prepare+0xcd/0x120
[14820.518811]  syscall_exit_to_user_mode+0x1d/0x40
[14820.519212]  do_syscall_64+0x48/0xc0
[14820.519388]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[14820.519634] RIP: 0033:0x7f045cbe599f
[14820.519864] Code: Unable to access opcode bytes at RIP 0x7f045cbe5975.
[14820.520144] RSP: 002b:00007f03722a8d80 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
[14820.520461] RAX: fffffffffffffdfc RBX: 0000000000000000 RCX: 00007f045cbe599f
[14820.520766] RDX: 0000000000000064 RSI: 0000000000000004 RDI: 000000002e65e410
[14820.521076] RBP: 00007f03722a8e40 R08: 0000000000000000 R09: 00000000000039e4
[14820.521384] R10: 0000000000000004 R11: 0000000000000293 R12: 0000000000000004
[14820.521692] R13: 0000000000000004 R14: 000000002db6e330 R15: 000000002e65e410
[14820.522003]  </TASK>
[14820.522122] Modules linked in:
[14820.522352] CR2: ffffbb68f5cd5004
[14820.522606] ---[ end trace 4d7c6585d48d6da3 ]---
[14820.522814] RIP: 0010:hmgrtable_free_handle+0x91/0xa0
[14820.523029] Code: 3b 00 0f b7 51 08 66 81 e2 00 bf 80 ca 80 66 89 51 08 83 40 1c 01 c7 41 04 00 00 00 ff 8b 50 14 89 11 48 c1 e2 04 48 03 50 08 <44> 89 4a 04 44 89 48 14 c3 cc cc cc cc 66 90 0f 1f 44 00 00 41 56
[14820.523823] RSP: 0018:ffffbb5905c37c28 EFLAGS: 00010286
[14820.524032] RAX: ffff9fea80e7b430 RBX: ffff9ff987dd3e00 RCX: ffffbb5905cd5000
[14820.524345] RDX: ffffbb68f5cd5000 RSI: 0000000000000003 RDI: 0000000000000001
[14820.524658] RBP: ffff9fea80e7b400 R08: 0000000000000003 R09: 0000000000000000
[14820.524962] R10: 0000000000000001 R11: 0000000000000003 R12: ffff9fea80e7b430
[14820.525265] R13: ffff9ff987dd3ed0 R14: ffff9ff987dd3e38 R15: ffff9ff987dd3e60
[14820.525579] FS:  0000000000000000(0000) GS:ffff9ff9879c0000(0000) knlGS:0000000000000000
[14820.525890] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14820.526148] CR2: ffffbb68f5cd5004 CR3: 00000001274dc002 CR4: 00000000003706a0
[14820.526486] Kernel panic - not syncing: Fatal exception
[14820.546859] Kernel Offset: 0x3a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Expected Behavior

Should not crash with kernel panic

Actual Behavior

Crashes with kernel panic

Diagnostic Logs

https://1drv.ms/u/s!Av6HldQ1OB8gsZJQquEOfFsRb3fKQg?e=hIl4RC

FremyCompany commented 11 months ago

Might be related to #10186

OneBlue commented 11 months ago

@iourit: can you help with this ?