Closed kitakar5525 closed 5 years ago
@qzed
have you looked at intel_context_pin?
Yes, I tried like this first:
- pin_ret = execlists_context_pin(dev_priv->engine[RCS], ipts_ctx);
+ pin_ret = intel_context_pin(ipts_ctx, dev_priv->engine[RCS0]);
if (IS_ERR(pin_ret)) {
DRM_DEBUG("lr context pinning failed : %ld\n", PTR_ERR(pin_ret));
goto err_ctx;
but it is not working, resulting in a black screen right after boot and no logs available regarding the boot using journalctl -k -b -1
Also, I had been misunderstanding execlists_context_pin()
is not working with ipts. I tested and the function is working instead of __execlists_context_pin()
. I updated ipts patch.
Just tested your regular 5.2 patch
I assume you applied all the patches under patch-5.2/0000-jakeday/
and not applied patches in another directory.
suspend seems to be broken for me
Strange... I did not modify other than cameras.patch and ipts.patch in 0000-jakeday
directory. Is suspend working on distribution's Linux 5.2.y kernel?
EDIT: Suspend is working on my SB1 (applying one another patch to ipts because I don't want to remove ipts module before suspend like this systemd/system-sleep/sleep on this repository)
After applying https://github.com/kitakar5525/ipts-fix-crash/blob/master/ipts-fix-crash-caused-by-calling-ipts_send_feedback-.patch touch doesn't work on my SB2
Hmmm...🤔 then, we should change the code to be able to toggle the changes by module parameter or something.
Is this the right patch or do I need to apply anything else?
Yes, it's also a right patch, no need to apply any other patches.
@kitakar5525
to_intel_context()
is not available anymore. So, I usedintel_context_pin_lock()
instead. I actually don't knowintel_context_pin_lock()
is appropriate for replacement of that function.
https://github.com/torvalds/linux/commit/c4d52feb2c46ddcdde4058cf03f8b9eb996bb09b replaces to_intel_context
with intel_context_lookup
, though I'm not sure if that is always a valid option. The patch series breaking IPTS introduce some sort of new locking system, so I guess we need to use intel_context_pin_lock
at least somewhere.
Yes, I tried like this first [...] but it is not working resulting in a black screen right after boot and no logs available regarding the boot using
journalctl -k -b -1
.
Got it.
I assume you applied all the patches under patch-5.2/0000-jakeday/ and not applied patches in another directory.
Correct.
Is suspend working on distribution's Linux 5.2.y kernel?
Works fine without IPTS. Thanks to the link to your modification I've closed in on the problem a bit: It's not actually resume that's causing the issue, it's the unloading (so should be intel_ipts_cleanup
). Unloading the module also causes the system to freeze, although in this case I think Gnome or GDM is somehow able to recover the system, at least when I'm on Wayland (although Wayland is not an option in GDM after that any more). When I'm on Xorg, it seems to work at first but eventually it freezes and the CPU gets hot and before that I get a lot of [drm:i915_guc_ipts_reacquire_doorbell [i915]] *ERROR* Not able to reacquire IPTS doorbell
complete warnings (including call traces) for Unexpected send: action=0x10
in intel_guc_send_nop
(drivers/gpu/drm/i915/intel_guc.c:385
).
Suspend/resume works properly with the suspend/resume functions removed (and systemd-sleep changed accordingly), so it really just is the unloading.
Hmmm... then, we should change the code to be able to toggle the changes by module parameter or something.
That would be an option, I'd prefer if we could find a way to automatically detect it, but that might not be possible.
I just tried a couple of things, but no success so far:
intel_context_pin_lock
in destropy_ipts_context [intel_ipts.c]
to intel_context_lookup
. No change at all.intel_context_pin_lock
in create_ipts_context [intel_ipts.c]
to intel_context_lookup
. Does not boot.intel_context_pin_unlock
in create_ipts_context [intel_ipts.c]
(just before the mutex_unlock
). No changes.intel_ipts
module:intel_ipts
module will lead to GPU hang on SB1. (I'm not sure if it was possible even before, though).
sync && sync && sync
sudo modprobe -r intel_ipts sudo rmmod -f mei_me sudo rmmod -f mei_hdcp sudo rmmod -f mei
sudo modprobe mei sudo modprobe mei_hdcp sudo modprobe mei_me
---
#### About suspend and IPTS:
under NOT reverting suspend/resume mechanism condition:
- on suspend, removing modules like [systemd/system-sleep/sleep](https://github.com/jakeday/linux-surface/blob/5.1.15-1/root/lib/systemd/system-sleep/sleep) sometimes work, sometimes don't work on older kernels. (I've not tested on 5.2.y)
---
under reverting suspend/resume mechanism condition:
- occasionally IPTS will break resumimg from suspend (I feel it will break about 1 time per 30 attempts). This will occur on older kernels, too. It just occurred yesterday on 5.2.1.
<details><summary>Short `journal -k -b -1` log</summary>
<p>
Jul 20 17:44:42 archlinux kernel: general protection fault: 0000 [#1] PREEMPT SMP PTI Jul 20 17:44:42 archlinux kernel: CPU: 0 PID: 11381 Comm: ipts_event_thre Tainted: G C OE 5.2.1-arch1-1-surface #1 Jul 20 17:44:42 archlinux kernel: Hardware name: Microsoft Corporation Surface Book/Surface Book, BIOS 91.2706.768 04/18/2019 Jul 20 17:44:42 archlinux kernel: RIP: 0010:kmem_cache_alloc_trace+0x87/0x1d0 Jul 20 17:44:42 archlinux kernel: Code: 4c 48 8b 70 08 48 39 f2 75 e7 4c 8b 20 4d 85 e4 0f 84 02 01 00 00 41 8b 58 20 49 8b 38 48 8d 8a 00 02 00 00 4c 89 e0 4c 01 e3 <48> 33 1b 49 33 98 30 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74 ae Jul 20 17:44:42 archlinux kernel: RSP: 0018:ffffadb10397b898 EFLAGS: 00010282 Jul 20 17:44:42 archlinux kernel: RAX: d112ea30f9428f43 RBX: d112ea30f9428f43 RCX: 000000000151d400 Jul 20 17:44:42 archlinux kernel: RDX: 000000000151d200 RSI: 000000000151d200 RDI: 000000000002f060 Jul 20 17:44:42 archlinux kernel: RBP: 0000000000000cc0 R08: ffff8a165ec03b00 R09: 0000000000000000 Jul 20 17:44:42 archlinux kernel: R10: 0000000000001000 R11: 0000000000000400 R12: d112ea30f9428f43 Jul 20 17:44:42 archlinux kernel: R13: 0000000000000010 R14: ffff8a165ec03b00 R15: ffffffffc0267c3d Jul 20 17:44:42 archlinux kernel: FS: 0000000000000000(0000) GS:ffff8a165f200000(0000) knlGS:0000000000000000 Jul 20 17:44:42 archlinux kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 20 17:44:42 archlinux kernel: CR2: 0000564083d5ee5c CR3: 000000045b0e4006 CR4: 00000000003606f0 Jul 20 17:44:42 archlinux kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 20 17:44:42 archlinux kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 20 17:44:42 archlinux kernel: Call Trace: Jul 20 17:44:42 archlinux kernel: i915_gem_object_get_pages_gtt+0xbd/0x5c0 [i915] Jul 20 17:44:42 archlinux kernel: ? percpu_counter_add_batch+0x1a/0xb0 Jul 20 17:44:42 archlinux kernel: i915_gem_object_get_pages+0x54/0x60 [i915] Jul 20 17:44:42 archlinux kernel: i915_vma_do_pin+0x297/0x4b0 [i915] Jul 20 17:44:42 archlinux kernel: intel_ipts_map_buffer+0x26c/0x2b0 [i915] Jul 20 17:44:42 archlinux kernel: ipts_map_buffer+0x4d/0x70 [intel_ipts] Jul 20 17:44:42 archlinux kernel: setup_kernel+0x847/0xe80 [intel_ipts] Jul 20 17:44:42 archlinux kernel: ipts_init_kernels+0x69/0xa0 [intel_ipts] Jul 20 17:44:42 archlinux kernel: ipts_allocate_raw_data_resource+0x11/0x20 [intel_ipts] Jul 20 17:44:42 archlinux kernel: ipts_handle_resp+0x2b7/0x410 [intel_ipts] Jul 20 17:44:42 archlinux kernel: ipts_mei_cl_event_thread+0x5d/0x90 [intel_ipts] Jul 20 17:44:42 archlinux kernel: kthread+0xfd/0x130 Jul 20 17:44:42 archlinux kernel: ? init_work_func+0x30/0x30 [intel_ipts] Jul 20 17:44:42 archlinux kernel: ? kthread_park+0x90/0x90 Jul 20 17:44:42 archlinux kernel: ret_from_fork+0x35/0x40 Jul 20 17:44:42 archlinux kernel: Modules linked in: mwifiex_pcie mwifiex cfg80211 rfcomm cmac overlay input_leds bnep btusb btrtl btbcm btintel bluetooth ecdh_generic ecc xt_CHECKSUM iptable_mangle xt_tcpudp xt_MASQUERADE xt_comment iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc iptable_filter usbhid uinput vmnet(OE) hid_sensor_als hid_sensor_rotation hid_sensor_accel_3d hid_sensor_gyro_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio intel_rapl intel_pmc_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_soc_skl snd_soc_hdac_hda msr snd_hda_ext_core snd_soc_skl_ipc kvm mousedev joydev snd_soc_sst_ipc hid_sensor_hub snd_soc_sst_dsp snd_soc_acpi_intel_match nls_iso8859_1 snd_soc_acpi nls_cp437 vfat snd_hda_codec_hdmi irqbypass fat snd_soc_core snd_compress crct10dif_pclmul fuse snd_hda_codec_realtek squashfs ac97_bus snd_pcm_dmaengine snd_hda_codec_generic ledtrig_audio crc32_pclmul loop snd_hda_intel Jul 20 17:44:42 archlinux kernel: ghash_clmulni_intel hid_multitouch hid_generic snd_hda_codec intel_ipts mei_hdcp ipu3_imgu(C) snd_hda_core ipu3_cio2 aesni_intel snd_hwdep v4l2_fwnode videobuf2_dma_sg videobuf2_memops snd_pcm aes_x86_64 crypto_simd videobuf2_v4l2 cryptd glue_helper intel_cstate videobuf2_common snd_timer intel_uncore mei_me snd videodev intel_rapl_perf pcspkr rfkill soundcore mei idma64 tpm_crb media intel_xhci_usb_role_switch roles intel_lpss_pci i2c_hid intel_pch_thermal intel_lpss hid surfacepro3_button battery soc_button_array ac evdev mac_hid tpm_tis tpm_tis_core tpm rng_core pcc_cpufreq vmmon(OE) vmw_vmci vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) sg scsi_mod crypto_user binder_linux(OE) ashmem_linux(OE) acpi_call(OE) ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci crc32c_intel xhci_hcd i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm intel_agp intel_gtt agpgart [last unloaded: cfg80211] Jul 20 17:44:42 archlinux kernel: ---[ end trace dc31499ee79cde45 ]---
</p>
</details>
Longer `journalctl -k -b -1` with html color output (rename it to .html and open in browser)
[2019-07-20-ipts-crash-on-resume-from-s2idle-journalctl-k-b_-1.html.txt](https://github.com/jakeday/linux-surface/files/3414388/2019-07-20-ipts-crash-on-resume-from-s2idle-journalctl-k-b_-1.html.txt)
Or, plain text log
[2019-07-20-ipts-crash-on-resume-from-s2idle-journalctl-k-b_-1.txt](https://github.com/jakeday/linux-surface/files/3414392/2019-07-20-ipts-crash-on-resume-from-s2idle-journalctl-k-b_-1.txt)
It seems it will hard to find the cause for me.
- I'm aware a long time ago removing only intel_ipts module will lead to GPU hang on SB1. (I'm not sure if it was possible even before, though).
Interesting, I didn't know that. I could remove the module before without any problems though.
under reverting suspend/resume mechanism condition:
- occasionally IPTS will break resumimg from suspend (I feel it will break about 1 time per 30 attempts). This will occur on older kernels, too. It just occurred yesterday on 5.2.1.
Right, might also be a good idea to look at that at some point and see if we can find a root cause for this, but that might be a bit hard.
Your log is interesting, i'm not sure where kmem_cache_alloc_trace
fits in there. On my device I get a different message from time to time (always the same but sometimes it's there, sometimes not):
``` Jul 21 13:56:03 xnb kernel: i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, no progress on rcs0 Jul 21 13:56:03 xnb kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Jul 21 13:56:03 xnb kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Jul 21 13:56:03 xnb kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Jul 21 13:56:03 xnb kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Jul 21 13:56:03 xnb kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error Jul 21 13:56:03 xnb kernel: i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0 Jul 21 13:56:11 xnb kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Jul 21 13:56:11 xnb kernel: [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x3 failed with error -110 0x3 Jul 21 13:56:11 xnb kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0 Jul 21 13:56:11 xnb kernel: i915 0000:00:02.0: GPU reset not supported ```
``` [ 86.924778] WARN_ON(intel_guc_send(guc, data, (sizeof(data) / sizeof((data)[0]) + (sizeof(struct { int:(-!!(__builtin_types_compatible_p(typeof((data)), typeof(&(data)[0])))); }))))) [ 86.924808] WARNING: CPU: 6 PID: 165 at drivers/gpu/drm/i915/intel_guc_submission.c:643 inject_preempt_context+0x214/0x240 [i915] [ 86.924808] Modules linked in: rfcomm bnep btusb btrtl btbcm btintel bluetooth ecdh_generic ecc 8021q garp mrp stp llc hid_sensor_als hid_sensor_gyro_3d hid_sensor_rotation hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio hid_sensor_hub intel_ishtp_hid battery intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel 8250_dw kvm irqbypass iTCO_wdt iTCO_vendor_support snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device nls_iso8859_1 crct10dif_pclmul crc32_pclmul gpio_keys ghash_clmulni_intel aesni_intel snd_hda_codec_hdmi aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_rapl_perf snd_hda_codec_realtek snd_hda_codec_generic pcspkr ledtrig_audio mwlwifi snd_hda_intel mac80211 joydev snd_hda_codec input_leds mwifiex_pcie cdc_ether snd_hda_core idma64 virt_dma snd_hwdep mwifiex ipu3_cio2 usbnet snd_pcm v4l2_fwnode r8152 videobuf2_dma_sg snd_timer mii videobuf2_memops cfg80211 nvidiafb [ 86.924819] processor_thermal_device snd videobuf2_v4l2 i2c_i801 vgastate fb_ddc soundcore mei_me videobuf2_common mei videodev intel_lpss_pci media intel_lpss intel_ish_ipc intel_pch_thermal intel_ishtp intel_soc_dts_iosf mac_hid pinctrl_sunrisepoint pinctrl_intel surface_acpi surfacebook2_dgpu_hps int3403_thermal surfacepro3_button int3400_thermal int340x_thermal_zone acpi_thermal_rel soc_button_array acpi_pad dptf_power sch_fq_codel crypto_user acpi_call(OE) ip_tables x_tables autofs4 uas usb_storage hid_multitouch hid_generic usbhid hid btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq i915 video i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm [last unloaded: intel_ipts] [ 86.924828] CPU: 6 PID: 165 Comm: kworker/u17:0 Tainted: G W OE 5.2.1-surface-dev #181 [ 86.924829] Hardware name: Microsoft Corporation Surface Book 2/Surface Book 2, BIOS 389.2706.768 04/18/2019 [ 86.924846] Workqueue: i915-guc_preempt inject_preempt_context [i915] [ 86.924862] RIP: 0010:inject_preempt_context+0x214/0x240 [i915] [ 86.924863] Code: 00 00 00 01 00 00 00 48 89 72 0c c7 42 14 00 00 00 00 e9 b0 fe ff ff 48 c7 c6 78 47 28 c0 48 c7 c7 ce 23 2a c0 e8 18 86 8e ef <0f> 0b 49 8d bd d8 02 00 00 49 0f ba b5 90 03 00 00 01 f0 49 0f ba [ 86.924864] RSP: 0018:ffff97ba40f73e08 EFLAGS: 00010286 [ 86.924865] RAX: 0000000000000000 RBX: ffff8b00254030c0 RCX: 0000000000000000 [ 86.924865] RDX: 00000000000000aa RSI: ffffffffb1554d8a RDI: 0000000000000246 [ 86.924866] RBP: ffff97ba40f73e60 R08: ffffffffb1554ce0 R09: 00000000000000aa [ 86.924866] R10: ffffffffb15550c0 R11: ffffffffb1554cdf R12: ffff97ba41364112 [ 86.924867] R13: ffff8b002184c000 R14: ffff8b0021860c78 R15: ffff8b0025403100 [ 86.924867] FS: 0000000000000000(0000) GS:ffff8b0027380000(0000) knlGS:0000000000000000 [ 86.924868] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 86.924868] CR2: 00007fb2a6a0fb10 CR3: 000000018620a004 CR4: 00000000003606e0 [ 86.924869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 86.924869] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 86.924870] Call Trace: [ 86.924872] process_one_work+0x1db/0x3c0 [ 86.924874] worker_thread+0x4d/0x400 [ 86.924875] kthread+0x104/0x140 [ 86.924876] ? process_one_work+0x3c0/0x3c0 [ 86.924877] ? kthread_park+0x90/0x90 [ 86.924879] ret_from_fork+0x35/0x40 [ 86.924880] ---[ end trace 6232d5de11da28ac ]--- [ 86.936846] [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x2 failed with error -110 0x2 ```
When I then reload under Xorg (Wayland is unavailable after that):
``` [ 426.852528] ------------[ cut here ]------------ [ 426.852530] Unexpected send: action=0x10 [ 426.852601] WARNING: CPU: 5 PID: 2792 at drivers/gpu/drm/i915/intel_guc.c:385 intel_guc_send_nop+0x17/0x20 [i915] [ 426.852602] Modules linked in: intel_ipts rfcomm bnep btusb btrtl btbcm btintel bluetooth ecdh_generic ecc hid_sensor_gyro_3d hid_sensor_rotation hid_sensor_accel_3d hid_sensor_als hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio hid_sensor_hub intel_ishtp_hid 8021q garp mrp stp llc battery intel_rapl x86_pkg_temp_thermal intel_powerclamp 8250_dw coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul nls_iso8859_1 iTCO_wdt ghash_clmulni_intel iTCO_vendor_support snd_usb_audio gpio_keys snd_usbmidi_lib snd_rawmidi snd_seq_device snd_hda_codec_hdmi aesni_intel snd_hda_codec_realtek aes_x86_64 snd_hda_codec_generic crypto_simd cryptd glue_helper mwlwifi intel_cstate ledtrig_audio mac80211 intel_rapl_perf snd_hda_intel snd_hda_codec mwifiex_pcie snd_hda_core mwifiex snd_hwdep cdc_ether ipu3_cio2 input_leds usbnet v4l2_fwnode idma64 joydev virt_dma snd_pcm videobuf2_dma_sg videobuf2_memops snd_timer r8152 videobuf2_v4l2 snd cfg80211 mii [ 426.852627] nvidiafb videobuf2_common pcspkr soundcore vgastate videodev fb_ddc intel_ish_ipc mei_me media processor_thermal_device intel_lpss_pci i2c_i801 mei intel_lpss intel_pch_thermal intel_ishtp intel_soc_dts_iosf mac_hid pinctrl_sunrisepoint surfacebook2_dgpu_hps pinctrl_intel surfacepro3_button surface_acpi int3400_thermal soc_button_array int3403_thermal acpi_thermal_rel dptf_power int340x_thermal_zone acpi_pad sch_fq_codel crypto_user acpi_call(OE) ip_tables x_tables autofs4 uas usb_storage hid_multitouch hid_generic usbhid hid btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq i915 video i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm [last unloaded: intel_ipts] [ 426.852650] CPU: 5 PID: 2792 Comm: ipts_event_thre Tainted: G W OE 5.2.1-surface-dev #181 [ 426.852650] Hardware name: Microsoft Corporation Surface Book 2/Surface Book 2, BIOS 389.2706.768 04/18/2019 [ 426.852700] RIP: 0010:intel_guc_send_nop+0x17/0x20 [i915] [ 426.852701] Code: 8b 87 10 fc ff ff 4c 89 c7 48 89 e5 e8 c2 a0 a5 e7 5d c3 0f 1f 44 00 00 55 8b 36 48 c7 c7 26 52 4a c0 48 89 e5 e8 85 82 ce e6 <0f> 0b b8 ed ff ff ff 5d c3 0f 1f 44 00 00 55 48 c7 c7 f0 72 48 c0 [ 426.852702] RSP: 0018:ffff97ac82f4fc18 EFLAGS: 00010286 [ 426.852703] RAX: 0000000000000000 RBX: ffffffffc04c1540 RCX: 0000000000000006 [ 426.852704] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff89a3e7357440 [ 426.852704] RBP: ffff97ac82f4fc18 R08: 00000000000004e3 R09: 0000000000000004 [ 426.852705] R10: 0000000000000000 R11: 0000000000000001 R12: ffff89a377290018 [ 426.852706] R13: ffff89a377290018 R14: ffff89a377290018 R15: ffff97ac82b0f9a0 [ 426.852707] FS: 0000000000000000(0000) GS:ffff89a3e7340000(0000) knlGS:0000000000000000 [ 426.852707] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 426.852708] CR2: 00007fd8e45774a6 CR3: 00000001f7c90002 CR4: 00000000003606e0 [ 426.852709] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 426.852709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 426.852710] Call Trace: [ 426.852758] i915_guc_ipts_reacquire_doorbell+0x54/0x80 [i915] [ 426.852789] intel_ipts_reacquire_db.isra.0+0x29/0x50 [i915] [ 426.852819] intel_ipts_get_wq_info+0x84/0xc0 [i915] [ 426.852823] ipts_open_gpu+0x8b/0xb0 [intel_ipts] [ 426.852825] ? notify_gfx_status+0x30/0x30 [intel_ipts] [ 426.852826] ? ipts_set_output_buffer+0x30/0x30 [intel_ipts] [ 426.852854] ? intel_ipts_reacquire_db.isra.0+0x50/0x50 [i915] [ 426.852881] ? intel_ipts_unmap_buffer+0x50/0x50 [i915] [ 426.852907] ? ipts_object_free+0x70/0x70 [i915] [ 426.852910] ipts_init_kernels+0x30/0xa0 [intel_ipts] [ 426.852912] ipts_allocate_raw_data_resource+0x12/0x20 [intel_ipts] [ 426.852914] ipts_handle_resp+0x309/0x470 [intel_ipts] [ 426.852917] ? __switch_to+0x32c/0x3f0 [ 426.852920] ? __switch_to_asm+0x40/0x70 [ 426.852922] ? __switch_to_asm+0x34/0x70 [ 426.852930] ? mei_io_cb_free.part.0+0x44/0x50 [mei] [ 426.852932] ? kfree+0x143/0x160 [ 426.852936] ? mei_io_cb_free.part.0+0x44/0x50 [mei] [ 426.852939] ? mei_io_cb_free+0x13/0x20 [mei] [ 426.852943] ? __mei_cl_recv+0x73/0x310 [mei] [ 426.852946] ? wait_woken+0x80/0x80 [ 426.852949] ipts_mei_cl_event_thread+0x50/0x80 [intel_ipts] [ 426.852952] ? __kthread_parkme+0x4c/0x70 [ 426.852953] kthread+0x104/0x140 [ 426.852954] ? init_work_func+0x30/0x30 [intel_ipts] [ 426.852956] ? kthread_park+0x90/0x90 [ 426.852958] ret_from_fork+0x35/0x40 [ 426.852959] ---[ end trace 57c154850807796b ]--- [ 426.852990] [drm:i915_guc_ipts_reacquire_doorbell [i915]] *ERROR* Not able to reacquire IPTS doorbell [ 426.863920] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: touch enabled 4 ```
And there's one more thing that I just noticed: If I do not touch the screen at all (neither before nor after unloading the module), it takes a lot longer to hang (i.e. it will still be usable for some time). As soon as I touch the screen it will crash almost immediately.
Had a look at the changes in i915, and the following now makes sense:
- Changed
intel_context_pin_lock
increate_ipts_context [intel_ipts.c]
tointel_context_lookup
. Does not boot.
intel_context_pin_lock
creates a context if it isn't set up via intel_context_instance
. intel_context_lookup
does not. Given that we want to initialize IPTS here, we have to choose a function that creates a context, so either intel_context_pin_lock
, intel_context_pin
, or intel_context_instance
(which is not exposed).
Also
- Changed
intel_context_pin_lock
indestropy_ipts_context [intel_ipts.c]
tointel_context_lookup
. No change at all.
makes sense because that function does not get called at all when unloading the intel_ipts
module. That's part of the in-gpu-driver stuff that's fixed and loaded with the i915 driver and not the MEI driver (module). So the problem can't reside in that function.
@kitakar5525 I think I have a solution. Basically it seems that putting a ipts_send_sensor_quiesce_io_cmd
into ipts_stop
fixes the issue. I've been able to unload/reload the module a couple of times now, no issues so far.
Here's the details, took me a bit to figure all this out (at least that's how I think it works): The actual IPTS device has two interfaces, one via the management engine (ME) and one directly with memory. The whole IPTS system uses compute kernels (something like OpenCL) to process the raw data (probably line/data smoothing, ..?). So when the mei driver (ipts-mei.c
) gets loaded, it sets up the kernels, buffers etc. (basically the processing pipeline) and tells the device (via ME interface) where the buffers are, and where the doorbell for submission is. The doorbell is basically like a signal that can be set by the IPTS device that indicates when a new work-item is ready to be written (or read, I guess). So the data-flow basically is:
Since the issue can be triggered by touching the screen, it has to be something in that pipeline. We can only check the last part (6), as everything else is basically out of kernel (two devices communicate with each other via the set up memory).
The memory that has been set up by the ME driver is also being cleaned up by it, so removing the driver removes the communication window for the devices. It looks like that is the problem, i.e. the IPTS controller still writes and wants to submit work items even though the memory it writes them to has been freed. The ipts_send_sensor_quiesce_io_cmd
seems like the straightforward fix for this, as it looks like it silences the communication.
There may be a better option, e.g. shutting down the device completely instead of silencing it. One alternative may be the ipts_send_sensor_clear_mem_window_cmd
which seems to release the communication window on the controller side. That (or a combination of the two) may even be the better way of doing things, I'll test that later (sending ipts_send_sensor_quiesce_io_cmd
was a bit easier to implement, so I tested that first).
Also I have absolutely no idea why this is now an issue on 5.2 and works fine on previous kernels.
One alternative may be the
ipts_send_sensor_clear_mem_window_cmd
which seems to release the communication window on the controller side. That (or a combination of the two) may even be the better way of doing things, I'll test that later
Seems to work fine, at least for unloading/reloading. For suspend I do have to unload the module, but I guess that's the expected behavior with a3a3ed3, right? If I revert the commit, suspend/resume seems to work fine without unloading the module, any idea why this was introduced?
I've also discovered a new issue: When I lock/unlock the screen (i.e. turn backlight off and on), touch stops working. This can be fixed by reloading the intel_ipts
module. At this point I guess that the ipts_send_sensor_quiesce_io_cmd
sent when disabling backlight is not reverted when backlight is turned on again, so the controller just doesn't know it should be sending data again.
Also what I've just noticed is that the ME driver gets removed during suspend, even without the sleep script. Looks like this is normal behavior for ME devices though.
At this point I guess that the ipts_send_sensor_quiesce_io_cmd sent when disabling backlight is not reverted when backlight is turned on again, so the controller just doesn't know it should be sending data again.
Seems to be that that's the issue. I only see the notify gfx status : 0
messages in the log. I'll try to track down why later.
putting a ipts_send_sensor_quiesce_io_cmd into ipts_stop fixes the issue.
Awesome work!
Yes, on SB1, I can confirm that change fixed:
```diff From a6f44e8b5bb6c3f0f4ab2c015e5ad51d6eb87f38 Mon Sep 17 00:00:00 2001 From: kitakar5525 <34676735+kitakar5525@users.noreply.github.com> Date: Wed, 24 Jul 2019 01:46:17 +0900 Subject: [PATCH] Add ipts_send_sensor_quiesce_io_cmd() into ipts_stop() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This commit fixes GPU hang on removing intel_ipts module. See: ・ Porting patches to Linux 5.2 (ipts) · Issue #544 · jakeday/linux-surface https://github.com/jakeday/linux-surface/issues/544#issuecomment-514258402 --- drivers/misc/ipts/ipts-msg-handler.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/misc/ipts/ipts-msg-handler.c b/drivers/misc/ipts/ipts-msg-handler.c index 66d832e74..87bc1d2b7 100644 --- a/drivers/misc/ipts/ipts-msg-handler.c +++ b/drivers/misc/ipts/ipts-msg-handler.c @@ -160,6 +160,7 @@ void ipts_stop(ipts_info_t *ipts) old_state == IPTS_STA_HID_STARTED) { ipts_free_default_resource(ipts); ipts_free_raw_data_resource(ipts); + ipts_send_sensor_quiesce_io_cmd(ipts); return; } -- 2.22.0 ```
When I lock/unlock the screen (i.e. turn backlight off and on), touch stops working.
🤔 I don't have the issue on SB1 on either 4.19/5.2, Wayland/X. I also apply another patch to ipts, however, the result was the same if I apply this patch or not.
Sorry, wrong button...
Here are all the changes I made after basically straightforward applying the 5.1 patch:
Make it compile:
```patch
From a19c73d0bcc5d9ab309acb894fca329ad5f48fa1 Mon Sep 17 00:00:00 2001
From: Maximilian Luz
Again, the ipts_send_sensor_clear_mem_window_cmd
isn't strictly necessary, but since we deallocate the buffers I think it's a good idea to also tell the IPTS controller that it shouldn't use them any more.
For suspend I do have to unload the module, but I guess that's the expected behavior with a3a3ed3, right? If I revert the commit, suspend/resume seems to work fine without unloading the module, any idea why this was introduced?
Before that commit, there were many people reporting
References
and after that commit, there are people (including me) reporting suspend is broken unless they remove the intel_ipts module before suspend. I'm not sure if this is expected behavior/regression for some people or not.
If we can solve these issues with your changes, maybe we can revert the commit.
I see, thanks! I don't think the changes will do anything to fix the situation without a3a3ed3. Although I'd like to find out why those issues were happening, I think there's probably a better fix than unloading everything (like maybe running ipts_send_sensor_quiesce_io_cmd
before suspend). I'll revert the commit for my personal kernel and see if I encounter any issues.
after that commit, there are people (including me) reporting suspend is broken unless they remove the intel_ipts module before suspend. I'm not sure if this is expected behavior/regression for some people or not.
As far as I can tell, this is expected behavior. With the changes in that commit, we need to unload the module because the in-gpu-driver part gets also unloaded. Basically, if we don't remove the module first, the in-gpu-driver basis for the module gets pulled from right under the module, while it is still using that.
I've also discovered a new issue: When I lock/unlock the screen (i.e. turn backlight off and on), touch stops working. This can be fixed by reloading the
intel_ipts
module. At this point I guess that theipts_send_sensor_quiesce_io_cmd
sent when disabling backlight is not reverted when backlight is turned on again, so the controller just doesn't know it should be sending data again.
I've messed up porting the patches to 5.2 and put the intel_ipts_notify_backlight_status(true)
in the wrong function. I feel a bit stupid now... With that fixed, everything seems to work.
So I've prepared the 5.2 patches and tested on 5.2.2, and there seems to be a warning when calling i915_guc_ipts_submission_disable
. When deallocating the doorbell, intel_guc_send_nop
gets called (which will emit the warning) instead of the correct communication function. Also happens on 5.2.1, not sure how I missed that previously.
This doesn't seem to cause any direct issues, but I'm worried that the doorbell does not get released properly (this all causes destroy_doorbell
to fail).
I'll try to figure out why that happens, one thing that will fix this though is reverting a3a3ed3, so I'm thinking that might be the way to go here.
I added debugfs entries to call these functions manually:
Patches
When I call intel_ipts_cleanup() alone, the whole system will freeze.
To stop IPTS:
sudo su -c "echo 1 > /sys/kernel/debug/ipts/ipts_stop"
sleep 3
sudo su -c "echo 1 > /sys/kernel/debug/dri/*/i915_intel_ipts_cleanup"
It seems IPTS stopped without freezing the system.
To start IPTS again:
sudo su -c "echo 1 > /sys/kernel/debug/dri/*/i915_intel_ipts_init"
sleep 3
sudo su -c "echo 1 > /sys/kernel/debug/ipts/ipts_start"
and IPTS started working again.
It seems that we should add ipts_stop() into intel_ipts_cleanup() and ipts_start() into intel_ipts_init() for suspend/resume to work with a3a3ed3. However, I don't know how to call these functions from i915 module.
I think we could do that via the intel_ipts_callback_t
type, basically register those functions as callback functions to be executed in intel_ipts_cleanup
and intel_ipts_init
if present.
With a3a3ed3, this could be a valid alternative to having to unload/reload the module. I don't quite like it because I don't quite like a3a3ed3 in the first place, but I guess its better than the two alternatives: Manual unloading/reloading or reverting a3a3ed3 to potentially break things for some people to get to the bottom of the issue.
I also found the reason for the warning: guc_disable_communication
sets guc->send = intel_guc_send_nop
. That function is called by intel_runtime_suspend
, which is the runtime-suspend function for the i915_pci_driver
.
On the other hand, intel_ipts_suspend
(and thus intel_ipts_cleanup
) gets called by i915_pm_suspend
and i915_pm_freeze
, which are the suspend/freeze functions.
As far as I know, the runtime-suspend function gets called first, so we should probably also suspend IPTS in that function. Edit: that change seems to work, pushed everything to https://github.com/qzed/linux-surface/tree/v5.2.
When reverting a3a3ed3, I do get some errors when suspending, specifically
ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: mei_cldev_send() error 0x4:-19
ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: mei_cldev_send() error 0x7:-19
ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: error in reading m2h msg
I think we can safely ignore them: The first two come from ipts_send_sensor_quiesce_io_cmd
and ipts_send_sensor_clear_mem_window_cmd
. During suspend, mei_cldev_send
will not work because the MEI device has already been disabled by the kernel before ipts_mei_cl_remove
and via that ipts_stop
are going to be called. In that case, calling the two functions is not necessary and we can let them fail.
The last line has always been present and is just the way the event loop is being stopped.
Would be nice to (at some point) hide those error messages, but for now we can just ignore them.
I think a3e3d3 was an attempt to solve #270 , but based on later comments it looks like it didn't work, just was never reverted.
By the way, I noticed two things:
```diff From 8452a78e5b1eed6e1956c5a3fffde6334c2c3e22 Mon Sep 17 00:00:00 2001 From: kitakar5525 <34676735+kitakar5525@users.noreply.github.com> Date: Thu, 25 Jul 2019 05:42:44 +0900 Subject: [PATCH] i915/intel_ipts.c: Do not use execlists funtions --- drivers/gpu/drm/i915/intel_ipts.c | 17 ----------------- drivers/gpu/drm/i915/intel_lrc.c | 8 +++++--- drivers/gpu/drm/i915/intel_lrc.h | 5 ----- 3 files changed, 5 insertions(+), 25 deletions(-) diff --git a/drivers/gpu/drm/i915/intel_ipts.c b/drivers/gpu/drm/i915/intel_ipts.c index 79d224ce3..f63ea8f36 100644 --- a/drivers/gpu/drm/i915/intel_ipts.c +++ b/drivers/gpu/drm/i915/intel_ipts.c @@ -200,18 +200,6 @@ static int create_ipts_context(void) goto err_unlock; } - ret = execlists_context_deferred_alloc(ce, ce->engine); - if (ret) { - DRM_DEBUG("lr context allocation failed : %d\n", ret); - goto err_ctx; - } - - ret = execlists_context_pin(ce); - if (ret) { - DRM_DEBUG("lr context pinning failed : %d\n", ret); - goto err_ctx; - } - /* Release the mutex */ mutex_unlock(&intel_ipts.dev->struct_mutex); @@ -222,10 +210,6 @@ static int create_ipts_context(void) return 0; -err_ctx: - if (ipts_ctx) - i915_gem_context_put(ipts_ctx); - err_unlock: mutex_unlock(&intel_ipts.dev->struct_mutex); @@ -250,7 +234,6 @@ static void destroy_ipts_context(void) return; } - execlists_context_unpin(ce); intel_context_unpin(ce); i915_gem_context_put(ipts_ctx); diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 10b5dd1a4..1599e5e22 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -166,6 +166,8 @@ #define ACTIVE_PRIORITY (I915_PRIORITY_NOSEMAPHORE) +static int execlists_context_deferred_alloc(struct intel_context *ce, + struct intel_engine_cs *engine); static void execlists_init_reg_state(u32 *reg_state, struct intel_context *ce, struct intel_engine_cs *engine, @@ -1181,7 +1183,7 @@ static void __context_unpin(struct i915_vma *vma) __i915_vma_unpin(vma); } -void execlists_context_unpin(struct intel_context *ce) +static void execlists_context_unpin(struct intel_context *ce) { struct intel_engine_cs *engine; @@ -1283,7 +1285,7 @@ __execlists_context_pin(struct intel_context *ce, return ret; } -int execlists_context_pin(struct intel_context *ce) +static int execlists_context_pin(struct intel_context *ce) { return __execlists_context_pin(ce, ce->engine); } @@ -2882,7 +2884,7 @@ static struct i915_timeline *get_timeline(struct i915_gem_context *ctx) return i915_timeline_create(ctx->i915, NULL); } -int execlists_context_deferred_alloc(struct intel_context *ce, +static int execlists_context_deferred_alloc(struct intel_context *ce, struct intel_engine_cs *engine) { struct drm_i915_gem_object *ctx_obj; diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index bfef20e4f..9091bac21 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h @@ -115,11 +115,6 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine, const char *prefix), unsigned int max); -int execlists_context_pin(struct intel_context *ce); -void execlists_context_unpin(struct intel_context *ce); -int execlists_context_deferred_alloc(struct intel_context *ce, - struct intel_engine_cs *engine); - u32 gen8_make_rpcs(struct drm_i915_private *i915, struct intel_sseu *ctx_sseu); -- 2.22.0 ```
```diff From 7ddbc149f105e41708862d0c76c8073389448528 Mon Sep 17 00:00:00 2001 From: kitakar5525 <34676735+kitakar5525@users.noreply.github.com> Date: Thu, 25 Jul 2019 18:44:44 +0900 Subject: [PATCH] i915/intel_ipts.c: Do not use execlists funtions for 4.19 --- drivers/gpu/drm/i915/intel_ipts.c | 18 ------------------ drivers/gpu/drm/i915/intel_lrc.c | 9 ++++++--- drivers/gpu/drm/i915/intel_lrc.h | 8 -------- 3 files changed, 6 insertions(+), 29 deletions(-) diff --git a/drivers/gpu/drm/i915/intel_ipts.c b/drivers/gpu/drm/i915/intel_ipts.c index 9899c91e4..5986d1499 100644 --- a/drivers/gpu/drm/i915/intel_ipts.c +++ b/drivers/gpu/drm/i915/intel_ipts.c @@ -175,7 +175,6 @@ static int create_ipts_context(void) struct i915_gem_context *ipts_ctx = NULL; struct drm_i915_private *dev_priv = to_i915(intel_ipts.dev); struct intel_context *ce = NULL; - struct intel_context *pin_ret; int ret = 0; /* Initialize the context right away.*/ @@ -201,18 +200,6 @@ static int create_ipts_context(void) goto err_unlock; } - ret = execlists_context_deferred_alloc(ipts_ctx, dev_priv->engine[RCS], ce); - if (ret) { - DRM_DEBUG("lr context allocation failed : %d\n", ret); - goto err_ctx; - } - - pin_ret = execlists_context_pin(dev_priv->engine[RCS], ipts_ctx); - if (IS_ERR(pin_ret)) { - DRM_DEBUG("lr context pinning failed : %ld\n", PTR_ERR(pin_ret)); - goto err_ctx; - } - /* Release the mutex */ mutex_unlock(&intel_ipts.dev->struct_mutex); @@ -223,10 +210,6 @@ static int create_ipts_context(void) return 0; -err_ctx: - if (ipts_ctx) - i915_gem_context_put(ipts_ctx); - err_unlock: mutex_unlock(&intel_ipts.dev->struct_mutex); @@ -251,7 +234,6 @@ static void destroy_ipts_context(void) return; } - execlists_context_unpin(ce); i915_gem_context_put(ipts_ctx); mutex_unlock(&intel_ipts.dev->struct_mutex); diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index f669087d6..f58de13b8 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -164,6 +164,9 @@ #define WA_TAIL_DWORDS 2 #define WA_TAIL_BYTES (sizeof(u32) * WA_TAIL_DWORDS) +static int execlists_context_deferred_alloc(struct i915_gem_context *ctx, + struct intel_engine_cs *engine, + struct intel_context *ce); static void execlists_init_reg_state(u32 *reg_state, struct i915_gem_context *ctx, struct intel_engine_cs *engine, @@ -1289,7 +1292,7 @@ static void execlists_context_destroy(struct intel_context *ce) i915_gem_object_put(ce->state->obj); } -void execlists_context_unpin(struct intel_context *ce) +static void execlists_context_unpin(struct intel_context *ce) { intel_ring_unpin(ce->ring); @@ -1376,7 +1379,7 @@ static const struct intel_context_ops execlists_context_ops = { .destroy = execlists_context_destroy, }; -struct intel_context * +static struct intel_context * execlists_context_pin(struct intel_engine_cs *engine, struct i915_gem_context *ctx) { @@ -2734,7 +2737,7 @@ populate_lr_context(struct i915_gem_context *ctx, return ret; } -int execlists_context_deferred_alloc(struct i915_gem_context *ctx, +static int execlists_context_deferred_alloc(struct i915_gem_context *ctx, struct intel_engine_cs *engine, struct intel_context *ce) { diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index 32159231a..4dfb78e3e 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h @@ -106,12 +106,4 @@ void intel_lr_context_resume(struct drm_i915_private *dev_priv); void intel_execlists_set_default_submission(struct intel_engine_cs *engine); -struct intel_context * -execlists_context_pin(struct intel_engine_cs *engine, - struct i915_gem_context *ctx); -void execlists_context_unpin(struct intel_context *ce); -int execlists_context_deferred_alloc(struct i915_gem_context *ctx, - struct intel_engine_cs *engine, - struct intel_context *ce); - #endif /* _INTEL_LRC_H_ */ -- 2.22.0 ```
At least IPTS is working regardless of the existence of execlists functions. Is this change safe?
EDIT:2019-08-02 Working now. See https://github.com/jakeday/linux-surface/issues/544#issuecomment-517626511
Not working on 4.19. ipts_stop or module removing is working.
- WARN_ON(i915_vma_unbind(vma)) ``` kern :warn : [ 118.249400] ------------[ cut here ]------------ kern :warn : [ 118.249403] WARN_ON(i915_vma_unbind(vma)) kern :warn : [ 118.249492] WARNING: CPU: 3 PID: 102 at drivers/gpu/drm/i915/i915_vma.c:827 i915_vma_destroy+0x15d/0x170 [i915] kern :warn : [ 118.249493] Modules linked in: rfcomm cmac bnep btusb btrtl btbcm btintel bluetooth ecdh_generic overlay input_leds xt_CHECKSUM iptable_mangle xt_tcpudp ipt_MASQUERADE xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc iptable_filter usbhid uinput vmnet(OE) joydev mousedev snd_soc_skl snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_acpi_intel_match hid_sensor_gyro_3d snd_soc_acpi hid_sensor_accel_3d snd_hda_codec_hdmi hid_sensor_rotation hid_sensor_als hid_sensor_trigger industrialio_triggered_buffer kfifo_buf msr hid_sensor_iio_common industrialio snd_soc_core hid_sensor_hub intel_rapl snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal snd_compress intel_powerclamp ac97_bus coretemp mwifiex_pcie nls_iso8859_1 kern :warn : [ 118.249528] snd_pcm_dmaengine nls_cp437 mwifiex snd_hda_intel vfat fat snd_hda_codec kvm_intel squashfs hid_multitouch fuse snd_hda_core loop crct10dif_pclmul hid_generic crc32_pclmul snd_hwdep ipu3_imgu ghash_clmulni_intel intel_ipts(OE) pcbc ipu3_cio2 cfg80211 snd_pcm v4l2_fwnode videobuf2_dma_sg videobuf2_memops videobuf2_v4l2 videobuf2_common snd_timer aesni_intel videodev aes_x86_64 crypto_simd cryptd glue_helper i2c_hid intel_cstate mei_me snd intel_xhci_usb_role_switch intel_uncore idma64 tpm_crb rfkill pcspkr surfacepro3_button mei roles soundcore intel_pch_thermal hid media intel_rapl_perf intel_lpss_pci intel_lpss battery soc_button_array ac tpm_tis tpm_tis_core evdev mac_hid tpm rng_core pcc_cpufreq vmmon(OE) vmw_vmci vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) sg scsi_mod crypto_user kern :warn : [ 118.249574] binder_linux(OE) ashmem_linux(OE) acpi_call(OE) ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto xhci_pci crc32c_intel xhci_hcd i915(OE) kvmgt vfio_mdev mdev vfio_iommu_type1 vfio kvm irqbypass i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm intel_agp intel_gtt agpgart kern :warn : [ 118.249600] CPU: 3 PID: 102 Comm: kworker/u8:2 Tainted: G OE 4.19.59-1-lts419-surface #1 kern :warn : [ 118.249602] Hardware name: Microsoft Corporation Surface Book/Surface Book, BIOS 91.2706.768 04/18/2019 kern :warn : [ 118.249633] Workqueue: i915 contexts_free_worker [i915] kern :warn : [ 118.249667] RIP: 0010:i915_vma_destroy+0x15d/0x170 [i915] kern :warn : [ 118.249669] Code: f3 dd 49 8b bd 88 06 00 00 5b 48 89 ee 5d 41 5c 41 5d e9 d6 4b f3 dd 48 c7 c6 40 d1 3d c0 48 c7 c7 10 d1 3d c0 e8 52 da d7 dd <0f> 0b e9 f6 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 kern :warn : [ 118.249670] RSP: 0018:ffffc190c1f0fdc0 EFLAGS: 00010282 kern :warn : [ 118.249672] RAX: 0000000000000000 RBX: ffff9b920c3f5d40 RCX: 0000000000000006 kern :warn : [ 118.249673] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff9b921f396590 kern :warn : [ 118.249674] RBP: ffff9b920c3f6640 R08: 00000000000003fa R09: 000000000000001d kern :warn : [ 118.249675] R10: ffffffff9f86c6c0 R11: ffffffff9f86c2df R12: ffff9b92191be278 kern :warn : [ 118.249676] R13: ffff9b92191be000 R14: ffff9b92191be188 R15: ffffc190c1f0fdf0 kern :warn : [ 118.249677] FS: 0000000000000000(0000) GS:ffff9b921f380000(0000) knlGS:0000000000000000 kern :warn : [ 118.249678] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kern :warn : [ 118.249679] CR2: 0000281b1e333000 CR3: 000000023620a004 CR4: 00000000003606e0 kern :warn : [ 118.249681] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kern :warn : [ 118.249682] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kern :warn : [ 118.249682] Call Trace: kern :warn : [ 118.249716] i915_ppgtt_release+0x91/0x150 [i915] kern :warn : [ 118.249745] i915_gem_context_free+0xbb/0xc0 [i915] kern :warn : [ 118.249773] contexts_free_worker+0x38/0x50 [i915] kern :warn : [ 118.249778] process_one_work+0x1da/0x3b0 kern :warn : [ 118.249782] worker_thread+0x4d/0x3f0 kern :warn : [ 118.249784] kthread+0xfb/0x130 kern :warn : [ 118.249787] ? process_one_work+0x3b0/0x3b0 kern :warn : [ 118.249789] ? kthread_park+0x90/0x90 kern :warn : [ 118.249793] ret_from_fork+0x35/0x40 kern :warn : [ 118.249795] ---[ end trace 27f0774c88c688b0 ]--- ``` - WARN_ON(i915_gem_object_has_pinned_pages(obj)) ``` kern :warn : [ 118.418470] ------------[ cut here ]------------ kern :warn : [ 118.418471] WARN_ON(i915_gem_object_has_pinned_pages(obj)) kern :warn : [ 118.418510] WARNING: CPU: 3 PID: 102 at drivers/gpu/drm/i915/i915_gem.c:4883 __i915_gem_free_objects+0x2ae/0x2c0 [i915] kern :warn : [ 118.418510] Modules linked in: rfcomm cmac bnep btusb btrtl btbcm btintel bluetooth ecdh_generic overlay input_leds xt_CHECKSUM iptable_mangle xt_tcpudp ipt_MASQUERADE xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc iptable_filter usbhid uinput vmnet(OE) joydev mousedev snd_soc_skl snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_acpi_intel_match hid_sensor_gyro_3d snd_soc_acpi hid_sensor_accel_3d snd_hda_codec_hdmi hid_sensor_rotation hid_sensor_als hid_sensor_trigger industrialio_triggered_buffer kfifo_buf msr hid_sensor_iio_common industrialio snd_soc_core hid_sensor_hub intel_rapl snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal snd_compress intel_powerclamp ac97_bus coretemp mwifiex_pcie nls_iso8859_1 kern :warn : [ 118.418542] snd_pcm_dmaengine nls_cp437 mwifiex snd_hda_intel vfat fat snd_hda_codec kvm_intel squashfs hid_multitouch fuse snd_hda_core loop crct10dif_pclmul hid_generic crc32_pclmul snd_hwdep ipu3_imgu ghash_clmulni_intel intel_ipts(OE) pcbc ipu3_cio2 cfg80211 snd_pcm v4l2_fwnode videobuf2_dma_sg videobuf2_memops videobuf2_v4l2 videobuf2_common snd_timer aesni_intel videodev aes_x86_64 crypto_simd cryptd glue_helper i2c_hid intel_cstate mei_me snd intel_xhci_usb_role_switch intel_uncore idma64 tpm_crb rfkill pcspkr surfacepro3_button mei roles soundcore intel_pch_thermal hid media intel_rapl_perf intel_lpss_pci intel_lpss battery soc_button_array ac tpm_tis tpm_tis_core evdev mac_hid tpm rng_core pcc_cpufreq vmmon(OE) vmw_vmci vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) sg scsi_mod crypto_user kern :warn : [ 118.418576] binder_linux(OE) ashmem_linux(OE) acpi_call(OE) ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto xhci_pci crc32c_intel xhci_hcd i915(OE) kvmgt vfio_mdev mdev vfio_iommu_type1 vfio kvm irqbypass i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm intel_agp intel_gtt agpgart kern :warn : [ 118.418592] CPU: 3 PID: 102 Comm: kworker/u8:2 Tainted: G W OE 4.19.59-1-lts419-surface #1 kern :warn : [ 118.418593] Hardware name: Microsoft Corporation Surface Book/Surface Book, BIOS 91.2706.768 04/18/2019 kern :warn : [ 118.418620] Workqueue: i915 __i915_gem_free_work [i915] kern :warn : [ 118.418646] RIP: 0010:__i915_gem_free_objects+0x2ae/0x2c0 [i915] kern :warn : [ 118.418647] Code: ff 48 83 c4 18 4c 89 f7 5b 5d 41 5c 41 5d 41 5e 41 5f e9 d5 0f fe ff 48 c7 c6 c8 c3 3e c0 48 c7 c7 94 cb 3d c0 e8 61 ff d8 dd <0f> 0b c7 85 c8 01 00 00 00 00 00 00 e9 81 fe ff ff 90 0f 1f 44 00 kern :warn : [ 118.418648] RSP: 0018:ffffc190c1f0fe00 EFLAGS: 00010286 kern :warn : [ 118.418650] RAX: 0000000000000000 RBX: ffff9b92192fdc60 RCX: 0000000000000006 kern :warn : [ 118.418650] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff9b921f396590 kern :warn : [ 118.418651] RBP: ffff9b920bd4ad00 R08: 0000000000004b83 R09: 000000000000002e kern :warn : [ 118.418652] R10: ffffffff9f86c6c0 R11: ffffffff9f86c2df R12: ffff9b920bd4adf8 kern :warn : [ 118.418653] R13: ffff9b920bd3f300 R14: ffff9b92192f8000 R15: ffff9b92192f8068 kern :warn : [ 118.418655] FS: 0000000000000000(0000) GS:ffff9b921f380000(0000) knlGS:0000000000000000 kern :warn : [ 118.418656] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kern :warn : [ 118.418657] CR2: 00007f3954000010 CR3: 000000023620a004 CR4: 00000000003606e0 kern :warn : [ 118.418658] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kern :warn : [ 118.418659] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kern :warn : [ 118.418659] Call Trace: kern :warn : [ 118.418686] __i915_gem_free_work+0x6a/0xa0 [i915] kern :warn : [ 118.418689] process_one_work+0x1da/0x3b0 kern :warn : [ 118.418691] worker_thread+0x4d/0x3f0 kern :warn : [ 118.418694] kthread+0xfb/0x130 kern :warn : [ 118.418696] ? process_one_work+0x3b0/0x3b0 kern :warn : [ 118.418697] ? kthread_park+0x90/0x90 kern :warn : [ 118.418700] ret_from_fork+0x35/0x40 kern :warn : [ 118.418702] ---[ end trace 27f0774c88c68aeb ]--- ``` then GPU hang ``` kern :info : [ 126.960588] [drm] GPU HANG: ecode 9:0:0xfffffffe, in gnome-shell [1661], reason: hang on rcs0, action: reset kern :notice: [ 126.960685] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 kern :err : [ 126.972733] [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x3 failed with error -110 0x3 kern :notice: [ 126.972782] i915 0000:00:02.0: Resetting chip for hang on rcs0 kern :info : [ 126.973038] [drm] HuC: Loaded firmware i915/skl_huc_ver01_07_1398.bin (version 1.7) kern :info : [ 126.974716] [drm] GuC: Loaded firmware i915/skl_guc_ver9_33.bin (version 9.33) kern :info : [ 126.974759] i915 0000:00:02.0: GuC firmware version 9.33 kern :info : [ 126.974761] i915 0000:00:02.0: GuC submission enabled kern :info : [ 126.974762] i915 0000:00:02.0: HuC enabled ```
EDIT: Result is the same if I add intel_context_unpin() into destroy_ipts_context() and use intel_context_pin() instead of to_intel_context(). Existence of execlists functions does not affect this problem.
Add intel_context_unpin() into destroy_ipts_context() ```diff From c56454bb4f9b2a2484e7585c142a494fef6bd0d4 Mon Sep 17 00:00:00 2001 From: kitakar5525 <34676735+kitakar5525@users.noreply.github.com> Date: Thu, 25 Jul 2019 18:47:02 +0900 Subject: [PATCH] Add intel_context_unpin() into destroy_ipts_context() --- drivers/gpu/drm/i915/intel_ipts.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/i915/intel_ipts.c b/drivers/gpu/drm/i915/intel_ipts.c index 94b6d1a32..b66be0ef4 100644 --- a/drivers/gpu/drm/i915/intel_ipts.c +++ b/drivers/gpu/drm/i915/intel_ipts.c @@ -234,6 +234,7 @@ static void destroy_ipts_context(void) return; } + intel_context_unpin(ce); i915_gem_context_put(ipts_ctx); mutex_unlock(&intel_ipts.dev->struct_mutex); -- 2.22.0 ``` Use intel_context_pin() instead ```diff From ddeb6d8428253cd4a2bbf5635a3f2949ea3eb03b Mon Sep 17 00:00:00 2001 From: kitakar5525 <34676735+kitakar5525@users.noreply.github.com> Date: Thu, 25 Jul 2019 04:39:05 +0900 Subject: [PATCH] Use intel_context_pin() instead --- drivers/gpu/drm/i915/intel_ipts.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/intel_ipts.c b/drivers/gpu/drm/i915/intel_ipts.c index f7aeb5911..72d41aaa2 100644 --- a/drivers/gpu/drm/i915/intel_ipts.c +++ b/drivers/gpu/drm/i915/intel_ipts.c @@ -193,7 +193,7 @@ static int create_ipts_context(void) goto err_unlock; } - ce = to_intel_context(ipts_ctx, dev_priv->engine[RCS]); + ce = intel_context_pin(ipts_ctx, dev_priv->engine[RCS]); if (IS_ERR(ce)) { DRM_ERROR("Failed to create intel context (error %ld)\n", PTR_ERR(ce)); -- 2.22.0 ```
@tmarkov Thank you for the clarification! Then, we should now consider if the commit a3a3ed3 of suspend/resume part helped other people problems than you.🤔
@tmarkov Thanks for that info, guess we should really consider reverting it. Maybe we could provide a development kernel where we could test things like this parallel to a stable kernel.
@kitakar5525
At least IPTS is working regardless of the existence of execlists functions. Is this change safe?
That's interesting! I have absolutely no idea, but might be worth a look. It is in the original IPTS repo, so I think we should be careful about this. Here's a small description of execlists from the i915 docs:
Execlists are the new method by which, on gen8+ hardware, workloads are submitted for execution (as opposed to the legacy, ringbuffer-based, method).
Regarding the creation of contexts, we have:
- One global default context.
- One local default context for each opened fd.
- One local extra context for each context create ioctl call.
So this might also be something that isn't necessary but a good thing to have (to me this looks like instead of using the default workqueue we use a speical workqueue only for IPTS stuff).
Regarding 4.19. I can't follow exactly what you did there: It does work when removing the module prior to calling intel_ipts_cleanup
(causing ipts_stop
to be run), right? So it only crashes when you call intel_ipts_cleanup
without removing the module? Isn't this the same thing we have on 5.1, like that's removing the in-kernel part before the ME part?
Sorry for a lack of information
I applied patches to add debugfs entry as mentioned above (https://github.com/jakeday/linux-surface/issues/544#issuecomment-514985562)
Under that condition:
Then, this problem; calling ipts_stop && intel_ipts_cleanup will still cause GPU hang
Expected result: calling ipts_stop && intel_ipts_cleanup should not cause GPU hang on 4.19, too (not happening on 5.2)
Actual result: GPU hang occurred.
Updated old comment (https://github.com/jakeday/linux-surface/issues/544#issuecomment-514985562).
Then, this problem; calling ipts_stop && intel_ipts_cleanup will still cause GPU hang
Ah, I see. Have you added ipts_send_sensor_quiesce_io_cmd
to ipts_stop
? If not, does that change anything?
Yes, applied your patch "Fix RCS0 hang" to make ipts_stop
work.
Updated old comment (https://github.com/jakeday/linux-surface/issues/544#issuecomment-514985562).
Okay, so the device should be quiet then. Could be that there is something else that's still in use by the ME driver even when stopped but that gets freed when calling intel_ipts_cleanup
. From what I can tell, the resources should all be freed though.
Edit: I guess that would fit to the WARN_ON
s you describe above.
Edit 2: Maybe caused by the missing unpin implementation? https://github.com/jakeday/linux-surface/blob/3d0abed6c461fd269694b66b9bb6372be230fa20/patches/4.19/0005-ipts.patch#L580-L583
@qzed your commit IPTS: Fix GuC communication warning on suspend · qzed/linux-surface@03b9074 existed on jakeday repository before and it seems that intel_ipts_cleanup will be called when you enter GNOME lock screen. I guess your commit will also cause GUI freeze unless intel_ipts module is removed or ipts_stop is called before entering lock screen. Did you face this issue?
Old closed issue I opened before:
I have not experienced that issue, just tested again on 4.19.60 and 5.2.2, I've put a print in intel_ipts_cleanup
and it doesn't seem to get called when I lock the screen (maybe a configuration difference?). Also as soon as the screen is turned off, intel_ipts_stop
will be called, so any later transition into a power-saving state should be covered.
Can you reproduce this right now?
Unfortunately, I can still reproduce the GNOME lock screen freeze issue on 5.2.1.
From 52f4f06e797fa9f14c9973c2c4421c4ae99e78c3 Mon Sep 17 00:00:00 2001
From: kitakar5525 <34676735+kitakar5525@users.noreply.github.com>
Date: Fri, 26 Jul 2019 22:58:58 +0900
Subject: [PATCH] print debug for intel_runtime_suspend/resume
---
drivers/gpu/drm/i915/intel_ipts.c | 6 ++++++
drivers/misc/ipts/ipts-msg-handler.c | 14 +++++++++++++-
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/intel_ipts.c b/drivers/gpu/drm/i915/intel_ipts.c
index efc9607f6..548beb8df 100644
--- a/drivers/gpu/drm/i915/intel_ipts.c
+++ b/drivers/gpu/drm/i915/intel_ipts.c
@@ -621,6 +621,8 @@ void intel_ipts_cleanup(struct drm_device *dev)
{
intel_ipts_object_t *obj, *n;
+ pr_alert("ipts: starting intel_ipts_cleanup\n");
+
if (intel_ipts.dev == dev) {
list_for_each_entry_safe(obj, n, &intel_ipts.buffers.list, list) {
struct i915_vma *vma, *vn;
@@ -644,14 +646,18 @@ void intel_ipts_cleanup(struct drm_device *dev)
destroy_ipts_context();
cancel_delayed_work(&intel_ipts.reacquire_db_work);
}
+
+ pr_alert("ipts: finished intel_ipts_cleanup\n");
}
int intel_ipts_resume(struct drm_device *dev)
{
+ pr_alert("ipts: intel_ipts_resume\n");
return intel_ipts_init(dev);
}
void intel_ipts_suspend(struct drm_device *dev)
{
+ pr_alert("ipts: intel_ipts_suspend\n");
intel_ipts_cleanup(dev);
}
diff --git a/drivers/misc/ipts/ipts-msg-handler.c b/drivers/misc/ipts/ipts-msg-handler.c
index db5356a1c..0ac67c0a7 100644
--- a/drivers/misc/ipts/ipts-msg-handler.c
+++ b/drivers/misc/ipts/ipts-msg-handler.c
@@ -134,6 +134,8 @@ int ipts_start(ipts_info_t *ipts)
we need to do this when protocol version doesn't match with reported one
how we keep vendor specific data is the first thing to solve */
+ pr_alert("ipts: starting ipts_start\n");
+
ipts_set_state(ipts, IPTS_STA_INIT);
ipts->num_of_parallel_data_buffers = TOUCH_SENSOR_MAX_DATA_BUFFERS;
@@ -141,6 +143,8 @@ int ipts_start(ipts_info_t *ipts)
ret = ipts_handle_cmd(ipts, TOUCH_SENSOR_NOTIFY_DEV_READY_CMD, NULL, 0);
+ pr_alert("ipts: finished ipts_start\n");
+
return ret;
}
@@ -148,22 +152,30 @@ void ipts_stop(ipts_info_t *ipts)
{
ipts_state_t old_state;
+ pr_alert("ipts: starting ipts_stop\n");
+
old_state = ipts_get_state(ipts);
ipts_set_state(ipts, IPTS_STA_STOPPING);
ipts_send_sensor_quiesce_io_cmd(ipts);
ipts_send_sensor_clear_mem_window_cmd(ipts);
- if (old_state < IPTS_STA_RESOURCE_READY)
+ if (old_state < IPTS_STA_RESOURCE_READY) {
+ pr_alert("ipts: finished ipts_stop via old_state < IPTS_STA_RESOURCE_READY\n");
return;
+ }
if (old_state == IPTS_STA_RAW_DATA_STARTED ||
old_state == IPTS_STA_HID_STARTED) {
ipts_free_default_resource(ipts);
ipts_free_raw_data_resource(ipts);
+ pr_alert("ipts: finished ipts_stop via "
+ "old_state == IPTS_STA_RAW_DATA_STARTED || old_state == IPTS_STA_HID_STARTED\n");
return;
}
+
+ pr_alert("ipts: finished ipts_stop\n");
}
int ipts_restart(ipts_info_t *ipts)
--
2.22.0
Steps to reproduce:
Go into GNOME screen lock:
sudo su -c "echo 'screen lock on GNOME' > /dev/kmsg"
dbus-send --type=method_call --dest=org.gnome.ScreenSaver /org/gnome/ScreenSaver org.gnome.ScreenSaver.Lock
We need to wait for about 1min with display off (sorry I've forgotten we need to wait) and the display will never turns on
``` Jul 26 23:15:54 archlinux unknown: screen lock on GNOME Jul 26 23:16:33 archlinux kernel: ipts: intel_ipts_suspend Jul 26 23:16:33 archlinux kernel: ipts: starting intel_ipts_cleanup Jul 26 23:16:33 archlinux kernel: BUG: unable to handle page fault for address: 0000000000001048 Jul 26 23:16:33 archlinux kernel: #PF: supervisor read access in kernel mode Jul 26 23:16:33 archlinux kernel: #PF: error_code(0x0000) - not-present page Jul 26 23:16:33 archlinux kernel: PGD 0 P4D 0 Jul 26 23:16:33 archlinux kernel: Oops: 0000 [#1] PREEMPT SMP PTI Jul 26 23:16:33 archlinux kernel: CPU: 2 PID: 196 Comm: kworker/2:2 Tainted: G C OE 5.2.1-arch1-1-surface #1 Jul 26 23:16:33 archlinux kernel: Hardware name: Microsoft Corporation Surface Book/Surface Book, BIOS 91.2706.768 04/18/2019 Jul 26 23:16:33 archlinux kernel: Workqueue: pm pm_runtime_work Jul 26 23:16:33 archlinux kernel: RIP: 0010:i915_request_wait+0x2c/0x3f0 [i915] Jul 26 23:16:33 archlinux kernel: Code: 44 00 00 41 57 41 56 41 55 41 89 f5 41 54 55 48 89 d5 53 48 89 fb 48 83 ec 50 65 48 8b 04 25 28 00 00 00 48 89 44 24 48 31 c> Jul 26 23:16:33 archlinux kernel: RSP: 0018:ffffbc0201b73c28 EFLAGS: 00010246 Jul 26 23:16:33 archlinux kernel: RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000 Jul 26 23:16:33 archlinux kernel: RDX: 7fffffffffffffff RSI: 0000000000000003 RDI: 0000000000001000 Jul 26 23:16:33 archlinux kernel: RBP: 7fffffffffffffff R08: 000000000000047f R09: 0000000000000001 Jul 26 23:16:33 archlinux kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffffa2f05966c760 Jul 26 23:16:33 archlinux kernel: R13: 0000000000000003 R14: dead000000000100 R15: ffffa2f15ca15c40 Jul 26 23:16:33 archlinux kernel: FS: 0000000000000000(0000) GS:ffffa2f15f300000(0000) knlGS:0000000000000000 Jul 26 23:16:33 archlinux kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 26 23:16:33 archlinux kernel: CR2: 0000000000001048 CR3: 000000036780a004 CR4: 00000000003606e0 Jul 26 23:16:33 archlinux kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 26 23:16:33 archlinux kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 26 23:16:33 archlinux kernel: Call Trace: Jul 26 23:16:33 archlinux kernel: ? wake_up_klogd+0x10/0x70 Jul 26 23:16:33 archlinux kernel: ? wake_up_klogd+0x4f/0x70 Jul 26 23:16:33 archlinux kernel: ? vprintk_emit+0x137/0x230 Jul 26 23:16:33 archlinux kernel: i915_active_wait+0x43/0x180 [i915] Jul 26 23:16:33 archlinux kernel: i915_vma_unbind+0xbd/0x240 [i915] Jul 26 23:16:33 archlinux kernel: i915_vma_destroy+0x52/0x150 [i915] Jul 26 23:16:33 archlinux kernel: intel_ipts_cleanup+0x86/0x1c3 [i915] Jul 26 23:16:33 archlinux kernel: intel_runtime_suspend+0x1a8/0x270 [i915] Jul 26 23:16:33 archlinux kernel: pci_pm_runtime_suspend+0x5b/0x150 Jul 26 23:16:33 archlinux kernel: ? __switch_to_asm+0x34/0x70 Jul 26 23:16:33 archlinux kernel: ? pci_has_legacy_pm_support+0x70/0x70 Jul 26 23:16:33 archlinux kernel: __rpm_callback+0x7b/0x130 Jul 26 23:16:33 archlinux kernel: ? pci_has_legacy_pm_support+0x70/0x70 Jul 26 23:16:33 archlinux kernel: ? pci_has_legacy_pm_support+0x70/0x70 Jul 26 23:16:33 archlinux kernel: rpm_callback+0x2a/0x90 Jul 26 23:16:33 archlinux kernel: ? pci_has_legacy_pm_support+0x70/0x70 Jul 26 23:16:33 archlinux kernel: rpm_suspend+0x136/0x610 Jul 26 23:16:33 archlinux kernel: pm_runtime_work+0x94/0xa0 Jul 26 23:16:33 archlinux kernel: process_one_work+0x1d1/0x3e0 Jul 26 23:16:33 archlinux kernel: worker_thread+0x4a/0x3d0 Jul 26 23:16:33 archlinux kernel: kthread+0xfd/0x130 Jul 26 23:16:33 archlinux kernel: ? process_one_work+0x3e0/0x3e0 Jul 26 23:16:33 archlinux kernel: ? kthread_park+0x90/0x90 Jul 26 23:16:33 archlinux kernel: ret_from_fork+0x35/0x40 Jul 26 23:16:33 archlinux kernel: Modules linked in: intel_ipts(OE) cmac rfcomm overlay input_leds bnep btusb btrtl btbcm btintel bluetooth ecdh_generic ecc xt_CHEC> Jul 26 23:16:33 archlinux kernel: snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel cfg80211 snd_hda_codec aesni_intel ipu3_imgu(C) aes_x86_64 snd_hda_core cry> Jul 26 23:16:33 archlinux kernel: CR2: 0000000000001048 Jul 26 23:16:33 archlinux kernel: ---[ end trace 7897a1ea4eeaed7c ]--- ```
the journal log says intel_ipts_suspend
was called and then crashed the system. ipts_stop
was never called.
By the way, I noticed that intel_ipts_suspend will not be called on suspend. I can confirm that ipts_stop will be called, but I cannot confirm that intel_ipts_suspend will be called on suspend.
Also as soon as the screen is turned off, intel_ipts_stop will be called
You mean ipts_stop
in ipts-msg-handler.c will be called?
On my system, it will never be called on screen off.
We may want to consider adding ipts_stop
/ipts_start
depending on backlight status if we call intel_ipts_cleanup
/intel_ipts_init
https://github.com/jakeday/linux-surface/blob/3d0abed6c461fd269694b66b9bb6372be230fa20/patches/5.1/0005-ipts.patch#L4171-L4179
Ah sorry, yeah ipts_stop
is not called, but ipts_send_sensor_quiesce_io_cmd
via intel_ipts_notify_backlight_status
. If intel_ipts_cleanup
gets called that is not enough.
I've also waited previously. I've waited for a couple of minutes now and I still can't trigger the runtime suspend.
I don't really like adding ipts_stop
/ipts_start
to the backlight status, as this basically just tears everything down or sets everything up again each time the backlight status changes. Honestly, at this point I'd rather revert a3a3ed3 and see if we can fix any issues coming from that directly than trying to fix it without knowing if it's actually necessary.
There's also another alternative that I'd prefer: We could set up a device link between the i915 device and the me device. This should give us the correct suspend/resume order and we could call ipts_stop
in the runtime-suspend method of the ME device. If I'm not mistaken, this should actually also remove the need for unloading the module.
I'll try to implement the device link stuff later.
I occasionally encounter an entire system freeze on calling ipts_stop
or module removing. I described before that suspend will occasionally be broken (https://github.com/jakeday/linux-surface/issues/544#issuecomment-513531566). It turned out that journal log is available in that comment.
EDIT: I looked at the journal log again and it seems that the suspend crash happened rather after ipts_start (?).ipts_stop
will cause this issue.
Current issues on IPTS are (I think all of these issues are present also on older kernel, not a porting issue to 5.2 anymore):
What is resolved / found so far
intel_ipts
module removing (it will call ipts_stop
) will usually work on both 4.19 and 5.2EDIT:2019-08-02
ipts_stop
will occasionally cause an entire system freezeipts_stop
&& intel_ipts_cleanup
will cause GPU hang on 4.19I occasionally encounter an entire system freeze on calling
ipts_stop
or module removing. I described before that suspend will occasionally be broken (#544 (comment)). It turned out thatipts_stop
will cause this issue. journal log is available in that comment.
If we're lucky, that could be fixed by the device link.
I think you're right that the current issues are older ones.
I've added a device-link now, diff below (I'll update the patches in a minute). Seems to work without removing the module before suspend. I don't think it will work for runtime-suspend yet (I think we'd have to implement the runtime_suspend method for that), but could you check if by any chance this fixed it for you?
```diff diff --git a/drivers/gpu/drm/i915/intel_ipts.c b/drivers/gpu/drm/i915/intel_ipts.c index efc9607f6f73..5d9145ac221c 100644 --- a/drivers/gpu/drm/i915/intel_ipts.c +++ b/drivers/gpu/drm/i915/intel_ipts.c @@ -496,32 +496,36 @@ static int intel_ipts_unmap_buffer(uint64_t gfx_handle, uint64_t buf_handle) int intel_ipts_connect(intel_ipts_connect_t *ipts_connect) { + u32 flags = DL_FLAG_PM_RUNTIME | DL_FLAG_AUTOREMOVE_CONSUMER; struct drm_i915_private *dev_priv = to_i915(intel_ipts.dev); - int ret = 0; if (!intel_ipts.initialized) return -EIO; - if (ipts_connect && ipts_connect->if_version <= - SUPPORTED_IPTS_INTERFACE_VERSION) { + if (!ipts_connect) + return -EINVAL; - /* return gpu operations for ipts */ - ipts_connect->ipts_ops.get_wq_info = intel_ipts_get_wq_info; - ipts_connect->ipts_ops.map_buffer = intel_ipts_map_buffer; - ipts_connect->ipts_ops.unmap_buffer = intel_ipts_unmap_buffer; - ipts_connect->gfx_version = INTEL_INFO(dev_priv)->gen; - ipts_connect->gfx_handle = (uint64_t)&intel_ipts; + if (ipts_connect->if_version > SUPPORTED_IPTS_INTERFACE_VERSION) + return -EINVAL; - /* save callback and data */ - intel_ipts.data = ipts_connect->data; - intel_ipts.ipts_clbks = ipts_connect->ipts_cb; + /* set up device-link for PM */ + if (!device_link_add(ipts_connect->client, intel_ipts.dev->dev, flags)) + return -EFAULT; - intel_ipts.connected = true; - } else { - ret = -EINVAL; - } + /* return gpu operations for ipts */ + ipts_connect->ipts_ops.get_wq_info = intel_ipts_get_wq_info; + ipts_connect->ipts_ops.map_buffer = intel_ipts_map_buffer; + ipts_connect->ipts_ops.unmap_buffer = intel_ipts_unmap_buffer; + ipts_connect->gfx_version = INTEL_INFO(dev_priv)->gen; + ipts_connect->gfx_handle = (uint64_t)&intel_ipts; - return ret; + /* save callback and data */ + intel_ipts.data = ipts_connect->data; + intel_ipts.ipts_clbks = ipts_connect->ipts_cb; + + intel_ipts.connected = true; + + return 0; } EXPORT_SYMBOL_GPL(intel_ipts_connect); diff --git a/drivers/misc/ipts/ipts-gfx.c b/drivers/misc/ipts/ipts-gfx.c index 51727770e75d..556d31b94f52 100644 --- a/drivers/misc/ipts/ipts-gfx.c +++ b/drivers/misc/ipts/ipts-gfx.c @@ -46,6 +46,7 @@ static int connect_gfx(ipts_info_t *ipts) int ret = 0; intel_ipts_connect_t ipts_connect; + ipts_connect.client = &ipts->cldev->dev; ipts_connect.if_version = IPTS_INTERFACE_V1; ipts_connect.ipts_cb.workload_complete = gfx_processing_complete; ipts_connect.ipts_cb.notify_gfx_status = notify_gfx_status; diff --git a/include/linux/intel_ipts_if.h b/include/linux/intel_ipts_if.h index f329bbfb8079..bad44fb4f233 100644 --- a/include/linux/intel_ipts_if.h +++ b/include/linux/intel_ipts_if.h @@ -60,6 +60,7 @@ typedef struct intel_ipts_callback { } intel_ipts_callback_t; typedef struct intel_ipts_connect { + struct device *client; /* input : client device for PM setup */ intel_ipts_callback_t ipts_cb; /* input : callback addresses */ void *data; /* input : callback data */ u32 if_version; /* input : interface version */ ```
Still ipts_stop
will not be called. Directly intel_ipts_suspend
will be called and it freezes the system on lock screen.
``` unknown: screen lock on GNOME kernel: audit: type=1131 audit(1564158918.879:73): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" e> kernel: ipts: intel_ipts_suspend kernel: ipts: starting intel_ipts_cleanup kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 0 P4D 0 kernel: Oops: 0000 [#1] PREEMPT SMP PTI kernel: CPU: 3 PID: 333 Comm: kworker/3:4 Tainted: G C OE 5.2.1-arch1-1-surface #1 kernel: Hardware name: Microsoft Corporation Surface Book/Surface Book, BIOS 91.2706.768 04/18/2019 kernel: Workqueue: pm pm_runtime_work kernel: RIP: 0010:i915_request_wait+0x91/0x3f0 [i915] kernel: Code: 00 00 0f 85 6d 03 00 00 48 83 c4 50 4c 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8b 63 40 e8 a6 b5 00 cb 48 8b 83 b8 01 00 0> kernel: RSP: 0018:ffff9eecc22c3c28 EFLAGS: 00010202 kernel: RAX: 0000000000000000 RBX: ffff9ae199d08000 RCX: 0000000000000000 kernel: RDX: 7fffffffffffffff RSI: 0000000000000003 RDI: ffff9ae199d08000 kernel: RBP: 7fffffffffffffff R08: 000000000000039c R09: 0000000000000001 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 0000000100000033 kernel: R13: 0000000000000003 R14: dead000000000100 R15: ffff9ae196d20000 kernel: FS: 0000000000000000(0000) GS:ffff9ae19f380000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000000 CR3: 000000024c20a005 CR4: 00000000003606e0 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kernel: Call Trace: kernel: ? wake_up_klogd+0x10/0x70 kernel: i915_active_wait+0x43/0x180 [i915] kernel: i915_vma_unbind+0xbd/0x240 [i915] kernel: i915_vma_destroy+0x52/0x150 [i915] kernel: intel_ipts_cleanup+0x86/0x1c3 [i915] kernel: intel_runtime_suspend+0x1a8/0x270 [i915] kernel: pci_pm_runtime_suspend+0x5b/0x150 kernel: ? __switch_to_asm+0x34/0x70 kernel: ? pci_has_legacy_pm_support+0x70/0x70 kernel: __rpm_callback+0x7b/0x130 kernel: ? pci_has_legacy_pm_support+0x70/0x70 kernel: ? pci_has_legacy_pm_support+0x70/0x70 kernel: rpm_callback+0x2a/0x90 kernel: ? pci_has_legacy_pm_support+0x70/0x70 kernel: rpm_suspend+0x136/0x610 kernel: pm_runtime_work+0x94/0xa0 kernel: process_one_work+0x1d1/0x3e0 kernel: worker_thread+0x4a/0x3d0 kernel: kthread+0xfd/0x130 kernel: ? process_one_work+0x3e0/0x3e0 kernel: ? kthread_park+0x90/0x90 kernel: ret_from_fork+0x35/0x40 kernel: Modules linked in: cmac rfcomm overlay bnep btusb btrtl btbcm btintel bluetooth ecdh_generic ecc input_leds xt_CHECKSUM iptable_ma> kernel: snd_hda_codec_generic ledtrig_audio snd_pcm_dmaengine cfg80211 snd_hda_intel aesni_intel snd_hda_codec aes_x86_64 crypto_simd cry> kernel: CR2: 0000000000000000 kernel: ---[ end trace b3ff7f811af2f868 ]--- ```
Right, I suspected that, just wasn't sure if the ME devices get removed on either runtime-suspend or suspend (interestingly they get removed in suspend, it seems). I'll add a runtime-suspend method to call ipts_stop
then.
This should fix it:
diff --git a/drivers/misc/ipts/ipts-mei.c b/drivers/misc/ipts/ipts-mei.c
index 199e49cb8d70..296fdaa3a5b0 100644
--- a/drivers/misc/ipts/ipts-mei.c
+++ b/drivers/misc/ipts/ipts-mei.c
@@ -168,6 +168,28 @@ static void init_work_func(struct work_struct *work)
ipts_start(ipts);
}
+static int ipts_mei_cl_runtime_suspend(struct device *dev)
+{
+ struct mei_cl_device *cldev = to_mei_cl_device(dev);
+ ipts_info_t *ipts = mei_cldev_get_drvdata(cldev);
+
+ ipts_stop(ipts);
+
+ return 0;
+}
+
+static int ipts_mei_cl_runtime_resume(struct device *dev)
+{
+ struct mei_cl_device *cldev = to_mei_cl_device(dev);
+ ipts_info_t *ipts = mei_cldev_get_drvdata(cldev);
+
+ return ipts_start(ipts);
+}
+
+const struct dev_pm_ops ipts_mei_cl_pm = {
+ SET_RUNTIME_PM_OPS(ipts_mei_cl_runtime_suspend, ipts_mei_cl_runtime_resume, NULL)
+};
+
static int ipts_mei_cl_probe(struct mei_cl_device *cldev,
const struct mei_cl_device_id *id)
{
@@ -249,6 +271,9 @@ static struct mei_cl_driver ipts_mei_cl_driver = {
.name = IPTS_DRIVER_NAME,
.probe = ipts_mei_cl_probe,
.remove = ipts_mei_cl_remove,
+ .driver = {
+ .pm = &ipts_mei_cl_pm,
+ },
};
static int ipts_mei_cl_init(void)
Unfortunately, it is not working with almost the same journal log.
By the way, on my system, intel_ipts_cleanup
will not be called with your commit IPTS: Fix GuC communication warning on suspend · qzed/linux-surface@03b9074 EDIT: when going into suspend
So ipts_stop
doesn't get called?
About intel_ipts_cleanup
not getting called: I somehow thought runtime_suspend
would get called before suspend
, but that apparently isn't the case, so to stay true to a3a3ed3 we should add that back in to suspend
/resume
I guess. I wonder how the GuC communication warning happened then.
Edit: Maybe we can remove the runtime_suspend
stuff for intel_ipts.c
now that the device-link is present and go back to the way it was in a3a3ed3?
So ipts_stop doesn't get called?
Yes, not get called. intel_ipts_cleanup got called before ipts_stop (?)
I understand runtime_suspend
stuff will call only intel_ipts_cleanup. Will the device-link call both of ipts_stop
and intel_ipts_cleanup
? If it is true, I should test with runtime_suspend
stuff reverted.
With the device link and the patch above, the runtime_suspend
method on the ME device (which will call ipts_stop
) should get called before the runtime_suspend
method in i915_drv.c
(which will call intel_ipts_cleanup
).
The thing is that suspend with the ME devices is a bit weird, for instance they are completely removed during suspend, so it could be that the runtime_suspend
function for the ME device does not get called at all.
Okay, changing https://github.com/qzed/linux-surface/commit/03b9074ef5affe5cef4783f0e57253f070e600d4 back requires to unload the modules again, even with the device link.
If I add the device link between the MEI bus device and the i915 device, suspend works (with warnings) and I don't need to remove the module any more. The problem is that we don't officialy have access to the mei bus device, which means I have to include "../mei/mei_dev.h
in ipts-gfx.c
.
I'll need to have another look at the MEI bus, but it looks like that's completely doing it's own thing for power management. So power-management functions defined in the IPTS MEI driver won't get called at all. As we've seen by changing the device link consumer to the bus device, that subsystem also doesn't seem to care about child device power-management...
I'll need to have another look at the MEI bus
Done.
but it looks like that's completely doing it's own thing for power management
Yup, not calling device PM ops at all. I don't like it. No idea if there's a reason behind this.
Still no idea why the device link does not work. Normally, I'd expect that the bus can only be suspended when all devices on it have been suspended, but apparently that's not the case for ME.
Let's continue discussing from gitter linux-surface/community here.
My porting jakeday patches to Linux 5.2 here and IPTS patch changes from 5.1 to 5.2 is also available there as README.md
to_intel_context()
is not available anymore. So, I usedintel_context_pin_lock()
instead. I actually don't knowintel_context_pin_lock()
is appropriate for replacement of that function.