ni / linux

Linux kernel source for NI Linux Real-Time
Other
81 stars 78 forks source link

[5.15] revert e1000e changes #78

Closed amstewart closed 2 years ago

amstewart commented 2 years ago

This PR reverts #77.

When booting NILRT 5.15 on devices with an e1000e ethernet device (at least a PXIe-8861), the changes in #77 seem to cause an indefinite RCU stall, as below.

image

Incriminating kernel stack trace:

[    4.238970] ------------[ cut here ]------------
[    4.238972] Voluntary context switch within RCU read-side critical section!
[    4.238974] WARNING: CPU: 2 PID: 1030 at kernel/rcu/tree_plugin.h:316 rcu_note_context_switch+0x4fe/0x560
[    4.238979] Modules linked in: g_ether u_ether libcomposite udc_core mousedev radeon drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm agpgart x86_pkg_temp_thermal coretemp aesni_intel tpm_tis crypto_simd tpm_tis_core igb e1000e i2c_i801 drm i2c_algo_bit tpm i2c_smbus leds_nic78bx button video backlight rng_core
[    4.238990] CPU: 2 PID: 1030 Comm: ifconfig Not tainted 5.15.55-rt48 #1
[    4.238992] Hardware name: National Instruments NI PXIe-8861/NI PXIe-8861, BIOS 1.2.2f0 06/24/2019
[    4.238993] RIP: 0010:rcu_note_context_switch+0x4fe/0x560
[    4.238994] Code: 08 48 89 87 c8 02 00 00 4c 89 8f d0 02 00 00 49 89 11 e9 2c fd ff ff 48 c7 c7 40 71 26 9d c6 05 ea 96 33 01 01 e8 62 3c 70 00 <0f> 0b e9 53 fb ff ff e8 7d 95 f2 ff e9 9f fd ff ff c6 43 15 00 48
[    4.238996] RSP: 0018:ffff9f3841d37af8 EFLAGS: 00010082
[    4.238997] RAX: 0000000000000000 RBX: ffff8dcfb5d28e40 RCX: 000000000000093c
[    4.238998] RDX: 00000000ffffe314 RSI: ffffffff9d461860 RDI: 0000000000000001
[    4.238999] RBP: 0000000000000000 R08: 000000000000003f R09: ffff9f3841d37a90
[    4.239000] R10: 0000000000000040 R11: 0000000000000001 R12: 0000000000000000
[    4.239000] R13: ffff8dce02c249c0 R14: ffff8dce00e50000 R15: ffff8dce02c249c0
[    4.239001] FS:  00007fa7df250800(0000) GS:ffff8dcfb5d00000(0000) knlGS:0000000000000000
[    4.239003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.239004] CR2: 00007fdb98be0020 CR3: 0000000102ff4004 CR4: 00000000003706e0
[    4.239004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.239005] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    4.239006] Call Trace:
[    4.239008]  <TASK>
[    4.239008]  ? __schedule+0x77/0x5f0
[    4.239011]  ? clockevents_program_event+0x8d/0xf0
[    4.239014]  ? schedule+0xad/0x110
[    4.239016]  ? schedule_hrtimeout_range_clock+0x9f/0x130
[    4.239017]  ? __hrtimer_init+0xe0/0xe0
[    4.239019]  ? usleep_range_state+0x60/0x90
[    4.239020]  ? e1000e_update_mc_addr_list_generic+0x12a/0x150 [e1000e]
[    4.239026]  ? e1000e_set_rx_mode+0x229/0x5e0 [e1000e]
[    4.239031]  ? __dev_open+0xf4/0x180
[    4.239033]  ? __dev_change_flags+0x1ba/0x240
[    4.239035]  ? dev_change_flags+0x21/0x60
[    4.239037]  ? devinet_ioctl+0x63d/0x810
[    4.239039]  ? inet_ioctl+0x175/0x1b0
[    4.239041]  ? dev_get_by_name_rcu+0xa/0x20
[    4.239043]  ? netdev_name_node_lookup_rcu+0x5e/0x70
[    4.239045]  ? dev_get_by_name_rcu+0xa/0x20
[    4.239047]  ? dev_ioctl+0x293/0x550
[    4.239049]  ? sock_do_ioctl.constprop.0+0x2b/0xd0
[    4.239051]  ? sock_ioctl+0xcc/0x2f0
[    4.239053]  ? handle_mm_fault+0x73/0x1b0
[    4.239056]  ? __x64_sys_ioctl+0x80/0xb0
[    4.239058]  ? do_syscall_64+0x40/0x90
[    4.239060]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[    4.239063]  </TASK>
[    4.239063] ---[ end trace 0000000000000002 ]---

The stall forces the user to recover the target using either NILRT safemode, or a USB provisioning key. Revert the changes until a fix can be developed.

NI AZDO: https://dev.azure.com/ni/DevCentral/_workitems/edit/2135135

amstewart commented 2 years ago

@gratian Do you want me to revert the same in the 5.10 mainline? Or can we wait on a fix there because it isn't the default anymore?

gratian commented 2 years ago

@amstewart Can you revert just the commit that enables the option in nati_x86_64_defconfig? I think the other change is valid and we can leave it in. I think we can wait on 5.10 until we find a better fix since it's not in use.

amstewart commented 2 years ago

Patch V2