Open Diaoul opened 1 month ago
I can confirm that ethtool -K eth0 gso off gro off tso off
fixes the issue mentioned.
How can I make this applied at OS level?
Could you try running an image with intel CPU firmware included? Reading through some of those reports it appears to possibly be a problem with CPU power states and the intel blobs are not included with a default image. Disabling those features will disable the NIC packet offloading which could have other performance issues.
You can download an image with the intel firmware from the factory here https://factory.talos.dev/?arch=amd64&cmdline-set=true&extensions=-&extensions=siderolabs%2Fi915-ucode&extensions=siderolabs%2Fintel-ice-firmware&extensions=siderolabs%2Fintel-ucode&platform=metal&target=metal&version=1.7.5
I have tested by only disabling TCP Segmentation Offloading
and it works equally well as disabling the 3 of them.
I will try this thanks 🙏
It's not working, I still have hardware resets.
kern: err: [2024-06-29T10:52:39.030745602Z]: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
TDH <a1>
TDT <8b>
next_to_use <8b>
next_to_clean <96>
buffer_info[next_to_clean]:
time_stamp <1000baba9>
next_to_watch <a1>
jiffies <1000bae09>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
SUBSYSTEM=pci
DEVICE=+pci:0000:00:1f.6
kern: err: [2024-06-29T10:52:41.014749602Z]: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
TDH <a1>
TDT <8b>
next_to_use <8b>
next_to_clean <96>
buffer_info[next_to_clean]:
time_stamp <1000baba9>
next_to_watch <a1>
jiffies <1000baff9>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
SUBSYSTEM=pci
DEVICE=+pci:0000:00:1f.6
kern: warning: [2024-06-29T10:52:47.190200602Z]: ------------[ cut here ]------------
kern: warning: [2024-06-29T10:52:47.190731602Z]: NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out 9240 ms
kern: warning: [2024-06-29T10:52:47.191559602Z]: WARNING: CPU: 0 PID: 6150 at net/sched/sch_generic.c:525 dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.192534602Z]: Modules linked in: i915 i2c_algo_bit ttm e1000e i2c_i801 drm_buddy wdat_wdt ahci drm_display_helper i2c_smbus nvme watchdog libahci
kern: warning: [2024-06-29T10:52:47.193970602Z]: CPU: 0 PID: 6150 Comm: kube-apiserver Not tainted 6.6.33-talos #1
kern: warning: [2024-06-29T10:52:47.194788602Z]: Hardware name: Intel(R) Client Systems NUC8i5BEH/NUC8BEB, BIOS BECFL357.86A.0097.2024.0221.1015 02/21/2024
kern: warning: [2024-06-29T10:52:47.196008602Z]: RIP: 0010:dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.196525602Z]: Code: ff ff ff 48 89 df c6 05 72 82 dc 01 01 e8 92 5d f9 ff 45 89 f0 44 89 e9 48 89 de 48 89 c2 48 c7 c7 a0 7b 43 98 e8 fa d1 11 ff <0f> 0b e9 42 ff ff ff 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90
kern: warning: [2024-06-29T10:52:47.198621602Z]: RSP: 0000:ffffc900071a7db8 EFLAGS: 00010286
kern: warning: [2024-06-29T10:52:47.199217602Z]: RAX: 0000000000000000 RBX: ffff888104178000 RCX: 0000000000000027
kern: warning: [2024-06-29T10:52:47.200019602Z]: RDX: ffff88846dc1d748 RSI: 0000000000000001 RDI: ffff88846dc1d740
kern: warning: [2024-06-29T10:52:47.200841602Z]: RBP: ffff888104178488 R08: 0000000000000000 R09: 205d373335343039
kern: warning: [2024-06-29T10:52:47.201642602Z]: R10: 303165282031454e R11: 7274203a2965454e R12: 0000000000000000
kern: warning: [2024-06-29T10:52:47.202477602Z]: R13: 0000000000000000 R14: 0000000000002418 R15: ffffc900071a7e30
kern: warning: [2024-06-29T10:52:47.203276602Z]: FS: 000000c000700098(0000) GS:ffff88846dc00000(0000) knlGS:0000000000000000
kern: warning: [2024-06-29T10:52:47.204183602Z]: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern: warning: [2024-06-29T10:52:47.204831602Z]: CR2: 000000c0001f8000 CR3: 0000000101e52001 CR4: 00000000003706f0
kern: warning: [2024-06-29T10:52:47.205632602Z]: Call Trace:
kern: warning: [2024-06-29T10:52:47.205922602Z]: <TASK>
kern: warning: [2024-06-29T10:52:47.206172602Z]: ? dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.206634602Z]: ? __warn+0x81/0x120
kern: warning: [2024-06-29T10:52:47.207016602Z]: ? dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.207484602Z]: ? report_bug+0x15d/0x180
kern: warning: [2024-06-29T10:52:47.207918602Z]: ? handle_bug+0x3c/0x80
kern: warning: [2024-06-29T10:52:47.208327602Z]: ? exc_invalid_op+0x17/0x70
kern: warning: [2024-06-29T10:52:47.208772602Z]: ? asm_exc_invalid_op+0x1a/0x20
kern: warning: [2024-06-29T10:52:47.209258602Z]: ? dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.209701602Z]: ? dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.210145602Z]: ? __pfx_dev_watchdog+0x10/0x10
kern: warning: [2024-06-29T10:52:47.210639602Z]: call_timer_fn+0x24/0x110
kern: warning: [2024-06-29T10:52:47.211065602Z]: __run_timers+0x218/0x2a0
kern: warning: [2024-06-29T10:52:47.211492602Z]: run_timer_softirq+0x2c/0x70
kern: warning: [2024-06-29T10:52:47.211946602Z]: handle_softirqs+0xe7/0x300
kern: warning: [2024-06-29T10:52:47.212389602Z]: __irq_exit_rcu+0x68/0x90
kern: warning: [2024-06-29T10:52:47.212813602Z]: sysvec_apic_timer_interrupt+0x3e/0x90
kern: warning: [2024-06-29T10:52:47.213383602Z]: asm_sysvec_apic_timer_interrupt+0x1a/0x20
kern: warning: [2024-06-29T10:52:47.213970602Z]: RIP: 0033:0x457975
kern: warning: [2024-06-29T10:52:47.214344602Z]: Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 49 3b 66 10 0f 86 95 00 00 00 55 48 89 e5 48 83 ec 18 48 8b 10 <48> 89 c6 48 89 d0 48 89 c7 48 f7 e1 70 70 48 ba 00 00 00 00 00 00
kern: warning: [2024-06-29T10:52:47.216438602Z]: RSP: 002b:000000c031b9e4f8 EFLAGS: 00000212
kern: warning: [2024-06-29T10:52:47.217036602Z]: RAX: 00000000041dea60 RBX: 0000000000000000 RCX: 0000000000000007
kern: warning: [2024-06-29T10:52:47.217838602Z]: RDX: 0000000000000028 RSI: 0000000000000000 RDI: 0000000000000000
kern: warning: [2024-06-29T10:52:47.218654602Z]: RBP: 000000c031b9e510 R08: 000000c031b9ebb0 R09: 000000000000002a
kern: warning: [2024-06-29T10:52:47.219472602Z]: R10: 00007fd1fd8d3388 R11: 0000000000000000 R12: 000000c031b9e4a8
kern: warning: [2024-06-29T10:52:47.220290602Z]: R13: 0000000000000000 R14: 000000c034385500 R15: 000000000003fbff
kern: warning: [2024-06-29T10:52:47.221113602Z]: </TASK>
kern: warning: [2024-06-29T10:52:47.221378602Z]: ---[ end trace 0000000000000000 ]---
kern: err: [2024-06-29T10:52:47.221939602Z]: e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
SUBSYSTEM=pci
kern: warning: [2024-06-29T10:52:47.190200602Z]: ------------[ cut here ]------------
kern: warning: [2024-06-29T10:52:47.190731602Z]: NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out 9240 ms
kern: warning: [2024-06-29T10:52:47.191559602Z]: WARNING: CPU: 0 PID: 6150 at net/sched/sch_generic.c:525 dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.192534602Z]: Modules linked in: i915 i2c_algo_bit ttm e1000e i2c_i801 drm_buddy wdat_wdt ahci drm_display_helper i2c_smbus nvme watchdog libahci
kern: warning: [2024-06-29T10:52:47.193970602Z]: CPU: 0 PID: 6150 Comm: kube-apiserver Not tainted 6.6.33-talos #1
kern: warning: [2024-06-29T10:52:47.194788602Z]: Hardware name: Intel(R) Client Systems NUC8i5BEH/NUC8BEB, BIOS BECFL357.86A.0097.2024.0221.1015 02/21/2024
kern: warning: [2024-06-29T10:52:47.196008602Z]: RIP: 0010:dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.196525602Z]: Code: ff ff ff 48 89 df c6 05 72 82 dc 01 01 e8 92 5d f9 ff 45 89 f0 44 89 e9 48 89 de 48 89 c2 48 c7 c7 a0 7b 43 98 e8 fa d1 11 ff <0f> 0b e9 42 ff ff ff 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90
kern: warning: [2024-06-29T10:52:47.198621602Z]: RSP: 0000:ffffc900071a7db8 EFLAGS: 00010286
kern: warning: [2024-06-29T10:52:47.199217602Z]: RAX: 0000000000000000 RBX: ffff888104178000 RCX: 0000000000000027
kern: warning: [2024-06-29T10:52:47.200019602Z]: RDX: ffff88846dc1d748 RSI: 0000000000000001 RDI: ffff88846dc1d740
kern: warning: [2024-06-29T10:52:47.200841602Z]: RBP: ffff888104178488 R08: 0000000000000000 R09: 205d373335343039
kern: warning: [2024-06-29T10:52:47.201642602Z]: R10: 303165282031454e R11: 7274203a2965454e R12: 0000000000000000
kern: warning: [2024-06-29T10:52:47.202477602Z]: R13: 0000000000000000 R14: 0000000000002418 R15: ffffc900071a7e30
kern: warning: [2024-06-29T10:52:47.203276602Z]: FS: 000000c000700098(0000) GS:ffff88846dc00000(0000) knlGS:0000000000000000
kern: warning: [2024-06-29T10:52:47.204183602Z]: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern: warning: [2024-06-29T10:52:47.204831602Z]: CR2: 000000c0001f8000 CR3: 0000000101e52001 CR4: 00000000003706f0
kern: warning: [2024-06-29T10:52:47.205632602Z]: Call Trace:
kern: warning: [2024-06-29T10:52:47.205922602Z]: <TASK>
kern: warning: [2024-06-29T10:52:47.206172602Z]: ? dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.206634602Z]: ? __warn+0x81/0x120
kern: warning: [2024-06-29T10:52:47.207016602Z]: ? dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.207484602Z]: ? report_bug+0x15d/0x180
kern: warning: [2024-06-29T10:52:47.207918602Z]: ? handle_bug+0x3c/0x80
kern: warning: [2024-06-29T10:52:47.208327602Z]: ? exc_invalid_op+0x17/0x70
kern: warning: [2024-06-29T10:52:47.208772602Z]: ? asm_exc_invalid_op+0x1a/0x20
kern: warning: [2024-06-29T10:52:47.209258602Z]: ? dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.209701602Z]: ? dev_watchdog+0x236/0x240
kern: warning: [2024-06-29T10:52:47.210145602Z]: ? __pfx_dev_watchdog+0x10/0x10
kern: warning: [2024-06-29T10:52:47.210639602Z]: call_timer_fn+0x24/0x110
kern: warning: [2024-06-29T10:52:47.211065602Z]: __run_timers+0x218/0x2a0
kern: warning: [2024-06-29T10:52:47.211492602Z]: run_timer_softirq+0x2c/0x70
kern: warning: [2024-06-29T10:52:47.211946602Z]: handle_softirqs+0xe7/0x300
kern: warning: [2024-06-29T10:52:47.212389602Z]: __irq_exit_rcu+0x68/0x90
kern: warning: [2024-06-29T10:52:47.212813602Z]: sysvec_apic_timer_interrupt+0x3e/0x90
kern: warning: [2024-06-29T10:52:47.213383602Z]: asm_sysvec_apic_timer_interrupt+0x1a/0x20
kern: warning: [2024-06-29T10:52:47.213970602Z]: RIP: 0033:0x457975
kern: warning: [2024-06-29T10:52:47.214344602Z]: Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 49 3b 66 10 0f 86 95 00 00 00 55 48 89 e5 48 83 ec 18 48 8b 10 <48> 89 c6 48 89 d0 48 89 c7 48 f7 e1 70 70 48 ba 00 00 00 00 00 00
kern: warning: [2024-06-29T10:52:47.216438602Z]: RSP: 002b:000000c031b9e4f8 EFLAGS: 00000212
kern: warning: [2024-06-29T10:52:47.217036602Z]: RAX: 00000000041dea60 RBX: 0000000000000000 RCX: 0000000000000007
kern: warning: [2024-06-29T10:52:47.217838602Z]: RDX: 0000000000000028 RSI: 0000000000000000 RDI: 0000000000000000
kern: warning: [2024-06-29T10:52:47.218654602Z]: RBP: 000000c031b9e510 R08: 000000c031b9ebb0 R09: 000000000000002a
kern: warning: [2024-06-29T10:52:47.219472602Z]: R10: 00007fd1fd8d3388 R11: 0000000000000000 R12: 000000c031b9e4a8
kern: warning: [2024-06-29T10:52:47.220290602Z]: R13: 0000000000000000 R14: 000000c034385500 R15: 000000000003fbff
kern: warning: [2024-06-29T10:52:47.221113602Z]: </TASK>
kern: warning: [2024-06-29T10:52:47.221378602Z]: ---[ end trace 0000000000000000 ]---
kern: err: [2024-06-29T10:52:47.221939602Z]: e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
SUBSYSTEM=pci ```
Maybe the best would be to check what Ubuntu is doing in this case? I honestly wound not know where to look :(
Also, is this normal (and related)?
kern: info: [2024-06-29T11:26:08.077576996Z]: eth0: renamed from tmp7343e
kern: info: [2024-06-29T11:26:09.185780996Z]: eth0: renamed from tmpd0096
kern: info: [2024-06-29T11:26:09.533711996Z]: eth0: renamed from tmp6ff45
kern: info: [2024-06-29T11:26:09.685917996Z]: eth0: renamed from tmp3dacd
kern: info: [2024-06-29T11:26:10.285559996Z]: eth0: renamed from tmp0c900
kern: info: [2024-06-29T11:26:11.781876996Z]: eth0: renamed from tmpa8abf
kern: info: [2024-06-29T11:26:11.862190996Z]: eth0: renamed from tmp274ce
kern: info: [2024-06-29T11:26:12.033282996Z]: eth0: renamed from tmp4e110
kern: info: [2024-06-29T11:26:12.164607996Z]: eth0: renamed from tmp14a46
kern: info: [2024-06-29T11:26:12.633438996Z]: eth0: renamed from tmp852ee
kern: info: [2024-06-29T11:26:20.781471996Z]: eth0: renamed from tmp01a9b
The message about renaming is the way CNI works (unrelated), but e1000e
seems to be known to be buggy, e.g. https://bugzilla.kernel.org/show_bug.cgi?id=118721
Talos doesn't support updating ethtool
settings natively unfortunately
Thanks for taking the time to look into this.
Yes it is a known issue with this hardware, unfortunately, changing hardware is not necessarily what I'm looking into now so I'd rather workaround the issue.
From the post you shared above:
Disabling TSO seems to have fixed the problem for me. (I needed to set it after a fresh boot, before the interface starts bailing out continually.)
For now I rely on a daemonset that spins off on all nodes with e1000e hardware and apply ethtool -K eth0 tso off
.
Is there a more Talos compliant way to do that and possibly earlier in the boot process?
Reloading the module with the parameter Node=0 (The NUMA node my NIC is on - modprobe e1000e Node=0) appears to have worked around the issue.
How can I achieve this with Talos?
How can I achieve this with Talos?
Most probably you could try to load the module manually with parameters by loading it manually.
Have tried using the following deploying from a local Omni instance to no avail:
machine:
kernel:
# Kernel modules to load.
modules:
- name: e1000e # Module name.
- parameters: ["Node=0"]
Bug Report
The cluster is crashing and has connectivity issues. Eventually the cluster recovers after a few minutes.
Description
Talos is showing those errors running on Intel NUC 8 hardware (models NUC8i5BEH and NUC8i3BEH). Before Talos, I ran k3s on top of Ubuntu 20.04 then 22.04 for 3 years without any network issues of the sort.
Apparently this is a known issues and "solutions" exist like disabling TSO, GSO and GRO using ethtool (source) How could I do the same with Talos?
Logs
Environment
Talos version: [
talosctl version --nodes <problematic nodes>
]Kubernetes version: [
kubectl version --short
]v1.30.2
Platform: Bare Metal Intel NUC 8 i5 BEH