kernel panics when single-stepping [SOLVED: KPTI #PF for kernel IRQ]

jovanbulck / sgx-step

A practical attack framework for precise enclave execution control

GNU General Public License v3.0

441 stars 83 forks source link

kernel panics when single-stepping [SOLVED: KPTI #PF for kernel IRQ] #45

Closed tonitick closed 2 years ago

tonitick commented 2 years ago

Hi, I am trying to run the single-step bench and sometimes encounter kernel bug especially when step over 1000s times. Here is an example from the kernel log:

[ 132.182650] BUG: unable to handle kernel paging request at 000055bb86c8b000 [ 132.182657] IP: 0x55bb86c8b000 [ 132.182658] PGD 80000007b65d0067 P4D 80000007b65d0067 PUD 7ad9f5067 PMD 7f8e31067 PTE 7bed47025 [ 132.182661] Oops: 0011 [#1] SMP PTI [ 132.182663] Modules linked in: sgx_step(OE) msr thunderbolt rfcomm cmac snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic bnep intel_wmi_thunderbolt wmi_bmof arc4 intel_rapl iwlmvm x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel mac80211 pcbc aesni_intel rtsx_pci_ms aes_x86_64 crypto_simd iwlwifi glue_helper memstick cryptd intel_cstate intel_rapl_perf btusb btrtl cfg80211 btbcm btintel joydev input_leds bluetooth ecdh_generic snd_hda_intel ir_rc6_decoder snd_hda_codec snd_hda_core snd_hwdep rc_rc6_mce snd_pcm ir_lirc_codec snd_seq_midi lirc_dev snd_seq_midi_event i915 snd_rawmidi ite_cir rc_core drm_kms_helper snd_seq video drm snd_seq_device snd_timer i2c_algo_bit fb_sys_fops syscopyarea acpi_pad sysfillrect mei_me snd sysimgblt wmi [ 132.182690] mei mac_hid soundcore intel_pch_thermal sch_fq_codel binfmt_misc kvm_intel kvm isgx(OE) parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid rtsx_pci_sdmmc ahci e1000e rtsx_pci libahci [ 132.182699] CPU: 1 PID: 3739 Comm: app Tainted: G OE 4.15.18+ #3 [ 132.182700] Hardware name: Intel Corporation NUC7i7BNH/NUC7i7BNB, BIOS BNKBL357.86A.0062.2018.0222.1644 02/22/2018 [ 132.182701] RIP: 0010:0x55bb86c8b000 [ 132.182702] RSP: 0000:ffffaac644e87ee8 EFLAGS: 00010002 [ 132.182703] RAX: 0000000000000008 RBX: 0000000000000008 RCX: 0000000000000000 [ 132.182704] RDX: ffff932c01c80000 RSI: 0000000000000008 RDI: ffffaac644e87f58 [ 132.182704] RBP: ffffaac644e87f28 R08: 0000000000000000 R09: 0000000000000000 [ 132.182705] R10: 0000000000000000 R11: 0000000000000000 R12: ffffaac644e87f58 [ 132.182706] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 132.182707] FS: 00007f34f50e4b80(0000) GS:ffff932c01c80000(0000) knlGS:0000000000000000 [ 132.182708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 132.182709] CR2: 000055bb86c8b000 CR3: 00000007acf9a001 CR4: 00000000000606e0 [ 132.182709] Call Trace: [ 132.182713] ? exit_to_usermode_loop+0x4f/0xd0 [ 132.182715] prepare_exit_to_usermode+0x83/0x90 [ 132.182718] retint_user+0x8/0x8 [ 132.182719] RIP: 0033:0x55bb86c8a2fd [ 132.182720] RSP: 002b:00007ffd60331b60 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff02 [ 132.182721] RAX: 0000000000000003 RBX: 00007f34f3a76000 RCX: 000055bb86c8a2fd [ 132.182721] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 132.182722] RBP: 00007ffd60332050 R08: 0000000000000000 R09: 0000000000000000 [ 132.182723] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 132.182723] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 132.182724] Code: Bad RIP value. [ 132.182726] RIP: 0x55bb86c8b000 RSP: ffffaac644e87ee8 [ 132.182726] CR2: 000055bb86c8b000 [ 132.182728] ---[ end trace cad0a7670dc9a000 ]--- [ 132.182829] mm/pgtable-generic.c:40: bad pmd 00000000b3c05ac0(00000007b2884047)

Some info that may help to reproduce the bug: commands: cd app/bench && NUM=10000 STRLEN=1 make run kernel version: Ubuntu-4.15.0-135.139 (git://kernel.ubuntu.com/ubuntu/ubuntu-bionic.git) cpu model: Intel(R) Core(TM) i7-7567U kernel parameters: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nox2apic iomem=relaxed no_timer_check nosmep nosmap clearcpuid=514 isolcpus=1 nmi_watchdog=0"

Can you help to check and advise what is the potential causes of this? Thanks so much

jovanbulck commented 2 years ago

Thanks for the report, not sure what goes wrong here exactly.

It seems the kernel says the RIP at 0x55bb86c8b000 is invalid. I'd have to see the program binary to understand why that is an invalid instruction pointer and how you end up there.

In principle sometimes things go wrong when the kernel and libsgxstep both want to access/program the APIC timer and the kernel interrupts the libsgxstep interrupt handler.. This used to be a frequent cause of kernel crashes, but has been much improved since, see #23

It could be related to this (but I do not see any #GP), or it could be something completely different. Maybe some page-table entries are corrupted somehow(?)

I'd have to investigate closer to reproduce and pinpoint this, but I won't have time for this any time soon I'm afraid -- hope the crashes are not too frequent and it is still usable for you!

tonitick commented 2 years ago

Hi, thanks so much for the suggestions!

Actually I follow all the system configurations in readme (use the same kernel version and microcode version), although I still see crashes, it is more stable now (step over 100000 instructions compared to 1000). Here is some kernel log (e.g., run $ NUM=100000 STRLEN=1 make run):

[ 4381.638366] BUG: unable to handle page fault for address: 0000561106405000 [ 4381.638369] #PF: supervisor instruction fetch in kernel mode [ 4381.638370] #PF: error_code(0x0011) - permissions violation [ 4381.638371] PGD 800000074b031067 P4D 800000074b031067 PUD 6be469067 PMD 769b53067 PTE 6c72b3025 [ 4381.638373] Oops: 0011 [#1] SMP PTI [ 4381.638375] CPU: 1 PID: 6866 Comm: app Tainted: G OE 5.4.0-109-generic #123~18.04.1-Ubuntu [ 4381.638376] Hardware name: Intel Corporation NUC7i7BNH/NUC7i7BNB, BIOS BNKBL357.86A.0062.2018.0222.1644 02/22/2018 [ 4381.638378] RIP: 0010:0x561106405000 [ 4381.638379] Code: 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 84 00 00 00 00 00 <48> 89 05 15 60 00 00 48 89 15 16 60 00 00 0f 31 89 05 fe 5f 00 00 [ 4381.638380] RSP: 0000:ffffba3383527ee8 EFLAGS: 00010002 [ 4381.638381] RAX: 0000000000000008 RBX: 0000000000000008 RCX: 0000000000000000 [ 4381.638382] RDX: ffff8deda1c80000 RSI: 0000000000000008 RDI: ffffba3383527f58 [ 4381.638383] RBP: ffffba3383527f28 R08: 0000000000000000 R09: 0000000000000000 [ 4381.638383] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba3383527f58 [ 4381.638384] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 4381.638385] FS: 00007f61a4073b80(0000) GS:ffff8deda1c80000(0000) knlGS:0000000000000000 [ 4381.638386] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4381.638386] CR2: 0000561106405000 CR3: 00000006c45d8005 CR4: 00000000000606e0 [ 4381.638387] Call Trace: [ 4381.638391] ? exit_to_usermode_loop+0x59/0x130 [ 4381.638393] prepare_exit_to_usermode+0x91/0xa0 [ 4381.638396] retint_user+0x8/0x8 [ 4381.638397] RIP: 0033:0x5611064042fd [ 4381.638398] Code: 3d 1c 3a 00 00 b8 00 00 00 00 e8 fe bd ff ff e8 b0 fd ff ff 89 c6 48 8d 3d 21 3a 00 00 b8 00 00 00 00 e8 e6 bd ff ff 90 c9 c3 <48> 89 1d a4 6d 00 00 48 8d 05 95 6d 00 00 48 8b 00 48 85 c0 74 02 [ 4381.638399] RSP: 002b:00007ffe286619a0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff02 [ 4381.638400] RAX: 0000000000000003 RBX: 00007f61a2a8c000 RCX: 00005611064042fd [ 4381.638400] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 4381.638401] RBP: 00007ffe28661e60 R08: 0000000000000000 R09: 0000000000000000 [ 4381.638401] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 4381.638402] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 4381.638403] Modules linked in: sgx_step(OE) msr thunderbolt rfcomm intel_rapl_msr cmac bnep mei_hdcp snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul iwlmvm ghash_clmulni_intel mac80211 libarc4 aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate iwlwifi rtsx_pci_ms wmi_bmof intel_wmi_thunderbolt memstick cfg80211 btusb btrtl btbcm btintel bluetooth input_leds snd_hda_intel joydev snd_intel_dspcfg ecdh_generic ecc snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi ir_rc6_decoder snd_seq_midi_event snd_rawmidi rc_rc6_mce ite_cir snd_seq i915 rc_core snd_seq_device drm_kms_helper snd_timer drm acpi_pad mac_hid intel_xhci_usb_role_switch roles snd i2c_algo_bit mei_me fb_sys_fops mei syscopyarea sysfillrect sysimgblt intel_pch_thermal soundcore binfmt_misc kvm_intel kvm sch_fq_codel isgx(OE) parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic [ 4381.638426] usbhid hid rtsx_pci_sdmmc e1000e rtsx_pci ahci libahci wmi video [ 4381.638430] CR2: 0000561106405000 [ 4381.638432] ---[ end trace a37d60e79aa28f2e ]--- [ 4381.638433] RIP: 0010:0x561106405000 [ 4381.638434] Code: 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 84 00 00 00 00 00 <48> 89 05 15 60 00 00 48 89 15 16 60 00 00 0f 31 89 05 fe 5f 00 00 [ 4381.638435] RSP: 0000:ffffba3383527ee8 EFLAGS: 00010002 [ 4381.638435] RAX: 0000000000000008 RBX: 0000000000000008 RCX: 0000000000000000 [ 4381.638436] RDX: ffff8deda1c80000 RSI: 0000000000000008 RDI: ffffba3383527f58 [ 4381.638437] RBP: ffffba3383527f28 R08: 0000000000000000 R09: 0000000000000000 [ 4381.638437] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba3383527f58 [ 4381.638438] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 4381.638439] FS: 00007f61a4073b80(0000) GS:ffff8deda1c80000(0000) knlGS:0000000000000000 [ 4381.638440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4381.638440] CR2: 0000561106405000 CR3: 00000006c45d8005 CR4: 00000000000606e0

I observe #PF here (not #GP you mentioned). Do you have any ideas if it is caused by the same issue in #23? I am not sure because I use the latest commit which should already include the fix in #23. Thanks!

jovanbulck commented 2 years ago

Hi tonitick,

Thanks for the additional information. This indeed seems like a bug..

(I am aware that SGX-Step can sometimes cause unpredictable crashes :/ In my experience the best to do at these points is hard rebooting the system and applying all the recommended config options for stabilization -- while it can certainly be very annoying, hopefully crashes are not too frequent and it remains usable.)

Still not sure what goes on here exactly. The root of the problem abstractly speaking I think is that SGX-Step performs kernel tasks and kernel resources like page tables and timers in user space and Linux is not at all expecting that and panics when it happens to interfere at the wrong times..

That being said, the log you provided may help pinpointing this issue and hopefully find a fix. Especially the first line seems interesting:

PF: supervisor instruction fetch in kernel mode

This may indicate a misconfiguration of page table or IDT entries setup by libsgxstep. I'd be interested to pinpoint this further down. Can you maybe provide compiled application+ enclave binaries corresponding to the fault you get in dmesg, so I can look which intructions correspond to the program counter values in the log? (maybe disable ASLR for that: echo 0 | sudo tee /proc/sys/kernel/randomize_va_space)

jovanbulck commented 2 years ago

FWIW: some further pointers to hopefully help narrowing this down:

this bug printout is generated by Linux here
the faulting PTE 6c72b3025 corresponds to:

+-------------------------------------------------------------------------------------------+
| XD | PK | IGN | RSVD | PHYS ADRS      | IGN | G | PAT | D | A | PCD | PWT | U/S | R/W | P | 
| 0  | x  | x   | 0    | 0x0006c72b3000 | x   | x | x   | 0 | 1 | x   | x   | 1   | 0   | 1 | 
+-------------------------------------------------------------------------------------------+

So it seems to be a user-space PTE that the kernel wants to execute. My first thought: did you make sure to disable SMEP with nosmep as described in the README?

tonitick commented 2 years ago

Hi, thanks so much for the reply!

So it seems to be a user-space PTE that the kernel wants to execute. My first thought: did you make sure to disable SMEP with nosmep as described in the README?

Yes. here is my grub parameter: linux /boot/vmlinuz-5.4.0-109-generic root=UUID=1d767d45-6f5f-4dee-8c1d-52b9275ab842 ro quiet splash nox2apic iomem=relaxed no_timer_check nosmep nosmap clearcpuid=514 isolcpus=1 nmi_watchdog=0 nokaslr $vt_handoff (I also add nokaslr for the ease of debugging)

The binaries can be found using the links. app binary: https://drive.google.com/file/d/1ylBV3r-BZ3YGqrvNuGCzn0GKUc5SMbkN/view?usp=sharing enclave binary: https://drive.google.com/file/d/1ulcq58oYP2pOwD1mlF8I-qGlZjcEROxC/view?usp=sharing They are generated by cd app/bench && NUM=500000 STRLEN=1 make run in commit b69f6b1a92280d304a2d107040986c63d5f8db26.

The corresponding kernel logs: [ 44.813371] BUG: unable to handle page fault for address: 000055555555a000 [ 44.813374] #PF: supervisor instruction fetch in kernel mode [ 44.813375] #PF: error_code(0x0011) - permissions violation [ 44.813375] PGD 8000000837dbb067 P4D 8000000837dbb067 PUD 85a1fa067 PMD 81b3f5067 PTE 7fbe55025 [ 44.813378] Oops: 0011 [#1] SMP PTI [ 44.813380] CPU: 1 PID: 2773 Comm: app Tainted: G OE 5.4.0-109-generic #123~18.04.1-Ubuntu [ 44.813380] Hardware name: Intel Corporation NUC7i7BNH/NUC7i7BNB, BIOS BNKBL357.86A.0062.2018.0222.1644 02/22/2018 [ 44.813383] RIP: 0010:0x55555555a000 [ 44.813384] Code: 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 84 00 00 00 00 00 <48> 89 05 15 60 00 00 48 89 15 16 60 00 00 0f 31 89 05 fe 5f 00 00 [ 44.813385] RSP: 0000:ffffc900014c3ee8 EFLAGS: 00010002 [ 44.813386] RAX: 0000000000000008 RBX: 0000000000000008 RCX: 0000000000000000 [ 44.813387] RDX: ffff888861c80000 RSI: 0000000000000008 RDI: ffffc900014c3f58 [ 44.813388] RBP: ffffc900014c3f28 R08: 0000000000000000 R09: 0000000000000000 [ 44.813388] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc900014c3f58 [ 44.813389] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 44.813390] FS: 00007ffff7f99b80(0000) GS:ffff888861c80000(0000) knlGS:0000000000000000 [ 44.813391] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 44.813391] CR2: 000055555555a000 CR3: 0000000855c8c003 CR4: 00000000000606e0 [ 44.813392] Call Trace: [ 44.813396] ? exit_to_usermode_loop+0x59/0x130 [ 44.813398] prepare_exit_to_usermode+0x91/0xa0 [ 44.813400] retint_user+0x8/0x8 [ 44.813401] RIP: 0033:0x5555555592fd [ 44.813402] Code: 3d 1c 3a 00 00 b8 00 00 00 00 e8 fe bd ff ff e8 b0 fd ff ff 89 c6 48 8d 3d 21 3a 00 00 b8 00 00 00 00 e8 e6 bd ff ff 90 c9 c3 <48> 89 1d a4 6d 00 00 48 8d 05 95 6d 00 00 48 8b 00 48 85 c0 74 02 [ 44.813403] RSP: 002b:00007fffffffdba0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff02 [ 44.813404] RAX: 0000000000000003 RBX: 00007ffff66ed000 RCX: 00005555555592fd [ 44.813405] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 44.813405] RBP: 00007fffffffe070 R08: 0000000000000000 R09: 0000000000000000 [ 44.813406] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 44.813406] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 44.813408] Modules linked in: sgx_step(OE) msr thunderbolt rfcomm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_rapl_msr bnep mei_hdcp snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_seq_midi crypto_simd snd_seq_midi_event cryptd glue_helper rapl snd_rawmidi intel_cstate iwlmvm mac80211 snd_seq libarc4 i915 snd_seq_device btusb rtsx_pci_ms snd_timer iwlwifi input_leds intel_wmi_thunderbolt joydev wmi_bmof btrtl btbcm drm_kms_helper snd memstick btintel mei_me soundcore cfg80211 bluetooth mei drm i2c_algo_bit ecdh_generic fb_sys_fops ecc syscopyarea sysfillrect intel_xhci_usb_role_switch sysimgblt intel_pch_thermal roles ir_rc6_decoder rc_rc6_mce ite_cir rc_core acpi_pad mac_hid binfmt_misc kvm_intel kvm sch_fq_codel isgx(OE) parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic [ 44.813430] usbhid hid rtsx_pci_sdmmc e1000e rtsx_pci ahci libahci wmi video [ 44.813434] CR2: 000055555555a000 [ 44.813436] ---[ end trace 81abd0123e0e853f ]--- [ 44.813437] RIP: 0010:0x55555555a000 [ 44.813438] Code: 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 84 00 00 00 00 00 <48> 89 05 15 60 00 00 48 89 15 16 60 00 00 0f 31 89 05 fe 5f 00 00 [ 44.813438] RSP: 0000:ffffc900014c3ee8 EFLAGS: 00010002 [ 44.813439] RAX: 0000000000000008 RBX: 0000000000000008 RCX: 0000000000000000 [ 44.813440] RDX: ffff888861c80000 RSI: 0000000000000008 RDI: ffffc900014c3f58 [ 44.813440] RBP: ffffc900014c3f28 R08: 0000000000000000 R09: 0000000000000000 [ 44.813441] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc900014c3f58 [ 44.813442] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 44.813442] FS: 00007ffff7f99b80(0000) GS:ffff888861c80000(0000) knlGS:0000000000000000 [ 44.813443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 44.813444] CR2: 000055555555a000 CR3: 0000000855c8c003 CR4: 00000000000606e0

Some other information: kernel version: 5.4.0-109-generic (as shown in the kernel log) sgx driver version: sgx_driver_2.11 (the submodule in commit b69f6b1a92280d304a2d107040986c63d5f8db26) sgx sdk version: linux-sgx @ 33f4499 (the submodule in commit b69f6b1a92280d304a2d107040986c63d5f8db26)

Please kindly let me know if there are any other information that may help. Thanks!

jovanbulck commented 2 years ago

Thanks for following up with additional info. My first thought: this could be an exception on the ss_irq_handler page -- could you also post the output of ./app and especially the first line [idt.c] locking IRQ handler pages X/Y? It would help to understand if X/Y correspond to the faulting page in dmesg -- please make sure to check dmesg again to see if the address changed (note: passing nokaslr only disables ASLR for the kernel, not the user space, I think)?

Page fault error code

Afais, it cannot be a non-present exception: From Figure 4-12. Page-Fault Error Code in Intel SDM and Page fault code 0x11:

The fault was caused by a page-level protection violation.
The fault was caused by an instruction fetch.

I/D flag (bit 4). This flag is 1 if (1) the access causing the page-fault exception was an instruction fetch; and (2) either (a) CR4.SMEP = 1; or (b) both (i) CR4.PAE = 1 (either PAE paging, 4-level paging, or 5-level paging is in use); and (ii) IA32_EFER.NXE = 1. Otherwise, the flag is 0. This flag describes the access causing the page-fault exception, not the access rights specified by paging.

From dmesg, CR4=00000000000606e0:

bit 5 PAE = 1
bit 20 SMEP = 0
bit 21 SMAP = 0

--> so then it seem it must be somehow that the address being fetched does not have the executable rights?!

Instruction-fetch fault

If CR4.SMEP = 0, access rights depend on the paging mode and the value of IA32_EFER.NXE: — For 32-bit paging or if IA32_EFER.NXE = 0, instructions may be fetched from any user-mode address. — For other paging modes with IA32_EFER.NXE = 1, instructions may be fetched from any user-mode address with a translation for which the XD flag is 0 in every paging-structure entry controlling the translation; instructions may not be fetched from any user-mode address with a translation for which the XD flag is 1 in any paging-structure entry controlling the translation.

--> so then XD=1 on one of the page-table levels somehow?

Page table walk

From dmesg above: PGD 8000000837dbb067 P4D 8000000837dbb067 PUD 85a1fa067 PMD 81b3f5067 PTE 7fbe55025; which corresponds to (using libsgxstep print_mapping()):

  |-> pgd
       |- base phys:  0x0
       |- index:      0
       |- value:      0x8000000837dbb067
       |    |- present:    1
       |    |- accessed:   1
       |    |- writeable:  1
       |    |- executable: 0
       |
       |-> pud
            |- base phys:  0x837dbb000
            |- index:      0
            |- value:      0x85a1fa067
            |    |- present:    1
            |    |- page size:  0
            |    |- accessed:   1
            |    |- writeable:  1
            |    |- executable: 1
            |
            |-> pmd
                 |- base phys:  0x85a1fa000
                 |- index:      0
                 |- value:      0x81b3f5067
                 |    |- present:    1
                 |    |- page size:  0
                 |    |- accessed:   1
                 |    |- writable:   1
                 |    |- executable: 1
                 |
                 |-> pte
                     |- base phys:  0x81b3f5000
                     |- index:      0
                     |- value:      0x7fbe55025
                     |    |- present:    1
                     |    |- accessed:   1
                     |    |- writable:   0
                     |    |- executable: 1
                     |    |- dirty:      0
                     |

--> so it seems somehow XD=1 on the top-level PGD entry for the faulting address?!

Conclusion: I have no idea why PGD.XD=1, but this is clearly a violation so something is going wrong somewhere. I'd be interested to better understand which address is causing this: can you check if this is the ss_irq_handler address as per the ./app output?

jovanbulck commented 2 years ago

So trying to further understand the dmesg output, it seems the faulting address indeed corresponds to the user-space ss_irq_handler (from the marked <48> onwards):

[ 44.813383] RIP: 0010:0x55555555a000
[ 44.813384] Code: 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 84 00 00 00 00 00 <48> 89 05 15 60 00 00 48 89 15 16 60 00 00 0f 31 89 05 fe 5f 00 00

in the app objdump:

0000000000006000 <__ss_irq_handler>:
    6000:       48 89 05 15 60 00 00    mov    %rax,0x6015(%rip)        # c01c <__ss_irq_rax>
    6007:       48 89 15 16 60 00 00    mov    %rdx,0x6016(%rip)        # c024 <__ss_irq_rdx>
    600e:       0f 31                   rdtsc  
    6010:       89 05 fe 5f 00 00       mov    %eax,0x5ffe(%rip)        # c014 <nemesis_tsc_aex>

Also, not sure why there is also the user-space segment being printed further on (I assume this is where exit_to_user wants to jump to). This corresponds to aep_trampoline in the app objdump:

 [ 44.813401] RIP: 0033:0x5555555592fd
[ 44.813402] Code: 3d 1c 3a 00 00 b8 00 00 00 00 e8 fe bd ff ff e8 b0 fd ff ff 89 c6 48 8d 3d 21 3a 00 00 b8 00 00 00 00 e8 e6 bd ff ff 90 c9 c3 <48> 89 1d a4 6d 00 00 48 8d 05 95 6d 00 00 48 8b 00 48 85 c0 74 02

00000000000052fd <sgx_step_aep_trampoline>:
    52fd:       48 89 1d a4 6d 00 00    mov    %rbx,0x6da4(%rip)        # c0a8 <sgx_step_tcs>
    5304:       48 8d 05 95 6d 00 00    lea    0x6d95(%rip),%rax        # c0a0 <sgx_step_aep_cb>
    530b:       48 8b 00                mov    (%rax),%rax
    530e:       48 85 c0                test   %rax,%rax
    5311:       74 02                   je     5315 <sgx_step_aep_trampoline+0x18>

jovanbulck commented 2 years ago

Update: I could reproduce this issue minimally on a separate branch in the commit linked above.

The problem seems to be that the PGD/P4D on user memory range seems to set XD bit: this does not cause a page fault for user mode dereferences, but somehow faults in kernel mode, even when CR4.SMEP/SMAP is cleared.

I am trying to find out if this x86 behavior is clearly documented, so as to write a proper patch. Interestingly the Intel SDM only includes the sentence for supervisor mode accesses:

instructions may not be fetched from any user-mode address with a translation for which the XD flag is 1 in any paging-structure entry controlling the translation.

On my machine the MWE gives:

$ ./app 
[idt.c] locking IRQ handler pages 0x557ba4d5e000/0x557ba4d64000
[main.c] dummy_fun at 0x557ba4d5e05b returned c0de
[pt.c] mapping->fun at 0x557ba4d5e05b returned c0de
[pt.c] /dev/sgx-step opened!
Killed

$ dmesg | tail -n 70
[ 7591.783861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7591.783869] CR2: 0000558a07c8c05b CR3: 0000000208d30003 CR4: 00000000000706f0
[ 7625.095838] [sgx-step] kernel module unloaded
[ 7625.134307] [sgx-step] listening on /dev/sgx-step
[ 7638.934609] cr4 before 706e0
[ 7638.934618] cr4 masked 706e0
[ 7638.934620] fun at ffffffffc1075040 with mapping:
[ 7638.934623] PGD 2d7015067 P4D 2d7015067 PUD 2d7017067 PMD 1049cc067 PTE 106923061
[ 7638.934632] returned badc0de
[ 7638.934635] fun at 557ba4d5e05b with mapping:
[ 7638.934637] PGD 80000001da3bb067 P4D 80000001da3bb067 PUD 23b7c9067 PMD 22f781067 PTE 1fc6e7025
[ 7638.934647] BUG: unable to handle page fault for address: 0000557ba4d5e05b
[ 7638.934651] #PF: supervisor instruction fetch in kernel mode
[ 7638.934655] #PF: error_code(0x0011) - permissions violation
[ 7638.934658] PGD 80000001da3bb067 P4D 80000001da3bb067 PUD 23b7c9067 PMD 22f781067 PTE 1fc6e7025
[ 7638.934666] Oops: 0011 [#8] SMP PTI
[ 7638.934671] CPU: 3 PID: 22467 Comm: app Tainted: G      D    OE     5.13.0-40-generic #45~20.04.1-Ubuntu
[ 7638.934677] Hardware name: Purism Librem 13 v2/Librem 13 v2, BIOS 4.9-Purism-2 11/13/2019
[ 7638.934679] RIP: 0010:0x557ba4d5e05b
[ 7638.934684] Code: Unable to access opcode bytes at RIP 0x557ba4d5e031.
[ 7638.934686] RSP: 0018:ffffa81b0387fd38 EFLAGS: 00010246
[ 7638.934690] RAX: 0000000000000000 RBX: 0000557ba4d5e05b RCX: 0000000000000027
[ 7638.934694] RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff918b6fda0988
[ 7638.934696] RBP: ffffa81b0387fd48 R08: ffff918b6fda0980 R09: ffffa81b0387fb00
[ 7638.934699] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000000706e0
[ 7638.934703] R13: 0000000000000040 R14: ffff91880dd93600 R15: ffffffffc1075690
[ 7638.934706] FS:  00007f37b34d6740(0000) GS:ffff918b6fd80000(0000) knlGS:0000000000000000
[ 7638.934710] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7638.934713] CR2: 0000557ba4d5e05b CR3: 000000022ffa6006 CR4: 00000000000706e0
[ 7638.934717] Call Trace:
[ 7638.934720]  <TASK>
[ 7638.934724]  ? do_fun+0x29/0x3a [sgx_step]
[ 7638.934732]  sgx_step_get_pt_mapping+0x5a/0x2e0 [sgx_step]
[ 7638.934736]  step_ioctl+0xb6/0x180 [sgx_step]
[ 7638.934741]  ? vfs_write+0x1c3/0x250
[ 7638.934747]  ? vfs_write+0x1c3/0x250
[ 7638.934751]  ? exit_to_user_mode_prepare+0x3d/0x1c0
[ 7638.934756]  ? ksys_write+0x67/0xe0
[ 7638.934759]  __x64_sys_ioctl+0x91/0xc0
[ 7638.934764]  do_syscall_64+0x61/0xb0
[ 7638.934767]  ? __x64_sys_write+0x1a/0x20
[ 7638.934770]  ? do_syscall_64+0x6e/0xb0
[ 7638.934772]  ? irqentry_exit+0x19/0x30
[ 7638.934775]  ? exc_page_fault+0x8f/0x170
[ 7638.934778]  ? asm_exc_page_fault+0x8/0x30
[ 7638.934784]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7638.934789] RIP: 0033:0x7f37b35ed3db
[ 7638.934793] Code: 0f 1e fa 48 8b 05 b5 7a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 7a 0d 00 f7 d8 64 89 01 48
[ 7638.934797] RSP: 002b:00007ffe72f77758 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 7638.934801] RAX: ffffffffffffffda RBX: 0000557ba4d5e5c0 RCX: 00007f37b35ed3db
[ 7638.934804] RDX: 0000557ba6bce6b0 RSI: 00000000c0404c01 RDI: 0000000000000003
[ 7638.934806] RBP: 00007ffe72f77780 R08: 0000000000000000 R09: 0000000000000034
[ 7638.934809] R10: 0000000000000000 R11: 0000000000000202 R12: 0000557ba4d5b000
[ 7638.934811] R13: 00007ffe72f778e0 R14: 0000000000000000 R15: 0000000000000000
[ 7638.934816]  </TASK>
[ 7638.934818] Modules linked in: sgx_step(OE) rfcomm xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc ccm aufs cmac algif_hash overlay algif_skcipher af_alg bnep binfmt_misc nls_iso8859_1 snd_soc_skl snd_hda_codec_hdmi snd_soc_hdac_hda snd_hda_ext_core snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_codec_realtek snd_soc_acpi_intel_match snd_hda_codec_generic snd_soc_acpi ledtrig_audio snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq ath9k intel_rapl_msr ath9k_common uvcvideo intel_tcc_cooling snd_seq_device x86_pkg_temp_thermal intel_powerclamp videobuf2_vmalloc ath9k_hw ath3k snd_timer videobuf2_memops coretemp videobuf2_v4l2 btusb videobuf2_common btrtl kvm_intel ath btbcm videodev
[ 7638.934905]  btintel kvm mc bluetooth rapl mac80211 intel_cstate snd joydev input_leds ecdh_generic ee1004 serio_raw ecc processor_thermal_device processor_thermal_rfim cfg80211 soundcore processor_thermal_mbox libarc4 processor_thermal_rapl intel_rapl_common intel_xhci_usb_role_switch topstar_laptop sparse_keymap int340x_thermal_zone intel_pch_thermal intel_soc_dts_iosf mac_hid sch_fq_codel ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp parport ip_tables x_tables autofs4 dm_crypt crct10dif_pclmul crc32_pclmul i915 ghash_clmulni_intel dwc3 ulpi udc_core i2c_algo_bit drm_kms_helper aesni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops crypto_simd cec cryptd rc_core psmouse drm ahci i2c_i801 i2c_smbus libahci xhci_pci video dwc3_pci xhci_pci_renesas pinctrl_sunrisepoint [last unloaded: sgx_step]
[ 7638.934983] CR2: 0000557ba4d5e05b
[ 7638.934987] ---[ end trace 3e1a52891a11ba4b ]---
[ 7638.934989] RIP: 0010:0x55c84b5e0000
[ 7638.934992] Code: Unable to access opcode bytes at RIP 0x55c84b5dffd6.
[ 7638.934994] RSP: 0018:ffffa81b0399fd60 EFLAGS: 00010246
[ 7638.934999] RAX: 000055c84b5e0000 RBX: ffffa81b0399fd88 RCX: 0000000000000000
[ 7638.935002] RDX: 0000000000000000 RSI: ffff918b6fd20980 RDI: ffff918b6fd20980
[ 7638.935004] RBP: ffffa81b0399fd78 R08: ffff918b6fd20980 R09: ffffa81b0399fb48
[ 7638.935007] R10: 0000000000000001 R11: 0000000000000001 R12: 000055c84babf6b0
[ 7638.935009] R13: 0000000000000040 R14: ffff918908f56500 R15: ffffffffc1075778
[ 7638.935012] FS:  00007f37b34d6740(0000) GS:ffff918b6fd80000(0000) knlGS:0000000000000000
[ 7638.935015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7638.935018] CR2: 0000557ba4d5e05b CR3: 000000022ffa6006 CR4: 00000000000706e0

jovanbulck commented 2 years ago

I confirmed that this MWE works perfectly when rebooting with the Linux kernel options nosmep nosmap noexec=off.

@tonitick : you could try the above kernel options as a quick patch that hopefully solves your problem -- do let me know whether it improves or solves the crashes you observed?

It should also be possible to write a patch that clears the XD bits in the PUD/PGD user-space entries at runtime, but I'd first like to properly understand and confirm this x86+Linux behavior to hopefully write a proper patch :)

For what it's worth, the reason why you only sometimes see this crash, is that, in my understanding, it would only get triggered when the kernel is executing during the APIC timer handler firing -- which is not the intention and normally doesn't happen, but it can ofc sometimes happen that the kernel interrupts the sgx-step application just before the apic timer fires (cf this caused the tricky bug in #23 that has since been fixed).

Reference output on my machine:

$ ./app 
[idt.c] locking IRQ handler pages 0x56385c3c4000/0x56385c3ca000
[main.c] dummy_fun at 0x56385c3c405b returned c0de
[pt.c] mapping->fun at 0x56385c3c405b returned c0de
[pt.c] /dev/sgx-step opened!
Mapping [address = 56385c3c405b -> 15e58905b ]
  |-> pgd
       |- base phys:  0x10f52a000
       |- index:      172
       |- value:      0x10e525067
       |    |- present:    1
       |    |- accessed:   1
       |    |- writeable:  1
       |    |- executable: 1
       |
       |-> pud
            |- base phys:  0x10e525000
            |- index:      225
            |- value:      0x106e56067
            |    |- present:    1
            |    |- page size:  0
            |    |- accessed:   1
            |    |- writeable:  1
            |    |- executable: 1
            |
            |-> pmd
                 |- base phys:  0x106e56000
                 |- index:      225
                 |- value:      0x1095fa067
                 |    |- present:    1
                 |    |- page size:  0
                 |    |- accessed:   1
                 |    |- writable:   1
                 |    |- executable: 1
                 |
                 |-> pte
                     |- base phys:  0x1095fa000
                     |- index:      452
                     |- value:      0x15e589025
                     |    |- present:    1
                     |    |- accessed:   1
                     |    |- writable:   0
                     |    |- executable: 1
                     |    |- dirty:      0
                     |
                     |-> PAGE
                           |- virt address:      0x56385c3c405b
                           |- index:             0x5b
                           |- base phys address: 0x15e589000
                           |- phys address:      0x15e58905b
jo@librem:~/Documents/sgx-step/app/idt$ dmesg | tail
[   74.360743] [sgx-step] listening on /dev/sgx-step
[   79.355691] process 'sgx-step/app/idt/app' started with executable stack
[   79.358030] cr4 before 706e0
[   79.358039] cr4 masked 706e0
[   79.358043] fun at ffffffffc0f1b040 with mapping:
[   79.358048] PGD 185015067 P4D 185015067 PUD 185017067 PMD 1135d5067 PTE 10dbc9061
[   79.358068] returned badc0de
[   79.358072] fun at 56385c3c405b with mapping:
[   79.358076] PGD 10e525067 P4D 10e525067 PUD 106e56067 PMD 1095fa067 PTE 15e589025
[   79.358090] returned c0de

jovanbulck commented 2 years ago

Interestingly, digging further with git blame on the Linux kernel source, I could pinpoint this behavior to this specific Linux commit which explicitly mentions that user-space PGD/P4D.NX bits are set when PTI is enabled:

With PAGE_TABLE_ISOLATION the user portion of the kernel page tables is poisoned with the NX bit so if the entry code exits with the kernel page tables selected in CR3, userspace crashes.

This leads to an interesting SGX-Step bug!

I think I now get what's going on here:

x86 always disallows execute when any of the XD bits is set in any of the paging levels (i.e., both for user and kernel mode, as logically expected and contrary to my earlier hypothesis).
Thus, the "normal" user-space page tables have no XD set in PGD/P4D
however, when PTI is enabled, the kernel maintains its own separate page table with its own separate view on user space. In this separate kernel page table, it sets PGD/P4D.XD so as to prevent user-space to (accidentally) execute with this kernel page table (as doing so would re-enable Meltdown). This is fine from Linux's point of view, as they never intend to execute user memory, whereas we do with libsgxstep IRQ handlers(!)
Now, in the (unlikely!) scenario described above and in #23 , our APIC timer interrupt may sometimes arrive while in kernel mode, i.e., while having the kernel page table with XD. In this case, the processor will generate a #PF and the kernel doesn't expect this and panics..

The MWE can be explained because libsgxstep works with /dev/sgx-step to do the page-table walk in kernel space, and, hence, retrieving the kernel page-table with XD set and not the user-space page table which will actually be used. This explains why my MWE crashes in kernel space, but works in user space: they use different page tables(!)

So, I should look into properly disabling this XD "poison" bit from /dev/sgx-step (i.e., without resorting to custom Linux kernel compilation).

In conclusion, before a proper patch is available, I'm quite confident these panics should go away by rebooting the kernel with noexec=off as I wrote above, or even disabling PTI entirely with pti=off. Let me know if this works for you?

jovanbulck commented 2 years ago

Closing this for now, as the bug has been identified and can be prevented by passing the noexec=off kernel boot parameter, as now also documented in the README.

I might later on still consider implementing a kernel fix later that clears the XD "poison" bits in the PGD/P4D kernel KPTI page table, but I think for now the noexec=off solution is cleaner and perhaps more future-proof, especially since SGX-Step already relies on certain kernel boot parameters to be set.

@tonitick Thanks again for reporting! This was an interesting bug and I'm glad I could pinpoint and fix this to hopefully have single-stepping much more stable now :)

tonitick commented 2 years ago

Hi @jovanbulck, thanks so much for the help. It works perfectly with my code. Sorry for the late reply because I was dealing with another issues in my code and just figured out that it is not related to this issue. Thanks again and I did learn a lot from your post!