Closed NovaPixell closed 1 week ago
I observed the same problem on one of my WAX206 running snapshot from april with Adblock and OpenVPN. Ethernet connection on LAN port went down and reboot was needed. The device was rock solid before... https://forum.openwrt.org/t/transmit-queue-timed-out-on-netgear-wax206-and-fritzbox-7530-nearly-at-same-moment/165972/2
Same here
[106752.700628] ------------[ cut here ]------------
[106752.705331] NETDEV WATCHDOG: eth1 (mtk_soc_eth): transmit queue 1 timed out
[106752.712389] WARNING: CPU: 2 PID: 0 at 0xffffffc0086bf5f4
[106752.717771] Modules linked in: mproto pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet wireguard uas pppox ppp_generic nft_tproxy nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt76_connac_lib mt76 mac80211 libchacha20poly1305 chacha_neon cfg80211 usb_storage ts_kmp ts_fsm ts_bm tcp_bbr slhc poly1305_neon nfnetlink nf_tproxy_ipv6 nf_tproxy_ipv4 nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libchacha crc_ccitt compat crypto_safexcel sch_teql sch_sfq sch_multiq sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_simple act_pedit act_csum libcrc32c sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_route cls_matchall cls_fw cls_flow cls_basic
[106752.717920] act_skbedit act_mirred act_gact ip_gre gre ip6_udp_tunnel udp_tunnel ip_tunnel sha1_generic seqiv md5 des_generic libdes authencesn authenc leds_gpio xhci_plat_hcd xhci_pci xhci_mtk_hcd xhci_hcd gpio_button_hotplug usbcore usb_common [last unloaded: mproto]
[106752.828546] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.15.114 #0
[106752.834709] Hardware name: TP-Link TL-XDR6086 (DT)
[106752.839568] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[106752.846596] pc : 0xffffffc0086bf5f4
[106752.850155] lr : 0xffffffc0086bf5f4
[106752.853713] sp : ffffffc008b7bdb0
[106752.857097] x29: ffffffc008b7bdb0 x28: 0000000000000140 x27: 00000000ffffffff
[106752.864301] x26: 0000000000000000 x25: 0000000000000002 x24: ffffff8000c204c0
[106752.871502] x23: 0000000000000000 x22: 0000000000000001 x21: ffffffc008a36000
[106752.878703] x20: ffffff8000c20000 x19: 0000000000000001 x18: ffffffc008a4a338
[106752.885905] x17: ffffffc017491000 x16: ffffffc008b78000 x15: 00000000000003e4
[106752.893106] x14: 000000000000014c x13: ffffffc008b7bad8 x12: ffffffc008aa2338
[106752.900308] x11: 712074696d736e61 x10: ffffffc008aa2338 x9 : 0000000000000000
[106752.907509] x8 : ffffffc008a4a2e8 x7 : ffffffc008a4a338 x6 : 0000000000000001
[106752.914711] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[106752.921913] x2 : ffffff801feaf080 x1 : ffffffc017491000 x0 : 000000000000003f
[106752.929115] Call trace:
[106752.931634] 0xffffffc0086bf5f4
[106752.934847] 0xffffffc0080a5bc0
[106752.938058] 0xffffffc0080a6438
[106752.941269] 0xffffffc0080a6d88
[106752.944480] 0xffffffc00801010c
[106752.947691] 0xffffffc00803b560
[106752.950903] 0xffffffc00803b690
[106752.954114] 0xffffffc008089634
[106752.957326] 0xffffffc008421184
[106752.960537] 0xffffffc00801584c
[106752.963749] 0xffffffc00801614c
[106752.966960] 0xffffffc00882f2e0
[106752.970172] 0xffffffc00882fa24
[106752.973383] 0xffffffc008011374
[106752.976595] 0xffffffc008830004
[106752.979806] 0xffffffc0080696c0
[106752.983017] 0xffffffc0080698d4
[106752.986228] 0xffffffc00801ed10
[106752.989440] 0xffffffc0088382cc
[106752.992653] ---[ end trace b60fde47a7177da9 ]---
I observed the same problem on one of my WAX206 running snapshot from april with Adblock and OpenVPN. Ethernet connection on LAN port went down and reboot was needed. The device was rock solid before... https://forum.openwrt.org/t/transmit-queue-timed-out-on-netgear-wax206-and-fritzbox-7530-nearly-at-same-moment/165972/2
Same ethernet driver ports (mtk_soc_eth)?
Same here
[106752.700628] ------------[ cut here ]------------ [106752.705331] NETDEV WATCHDOG: eth1 (mtk_soc_eth): transmit queue 1 timed out [106752.712389] WARNING: CPU: 2 PID: 0 at 0xffffffc0086bf5f4 [106752.717771] Modules linked in: mproto pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet wireguard uas pppox ppp_generic nft_tproxy nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt76_connac_lib mt76 mac80211 libchacha20poly1305 chacha_neon cfg80211 usb_storage ts_kmp ts_fsm ts_bm tcp_bbr slhc poly1305_neon nfnetlink nf_tproxy_ipv6 nf_tproxy_ipv4 nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libchacha crc_ccitt compat crypto_safexcel sch_teql sch_sfq sch_multiq sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_simple act_pedit act_csum libcrc32c sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_route cls_matchall cls_fw cls_flow cls_basic [106752.717920] act_skbedit act_mirred act_gact ip_gre gre ip6_udp_tunnel udp_tunnel ip_tunnel sha1_generic seqiv md5 des_generic libdes authencesn authenc leds_gpio xhci_plat_hcd xhci_pci xhci_mtk_hcd xhci_hcd gpio_button_hotplug usbcore usb_common [last unloaded: mproto] [106752.828546] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.15.114 #0 [106752.834709] Hardware name: TP-Link TL-XDR6086 (DT) [106752.839568] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [106752.846596] pc : 0xffffffc0086bf5f4 [106752.850155] lr : 0xffffffc0086bf5f4 [106752.853713] sp : ffffffc008b7bdb0 [106752.857097] x29: ffffffc008b7bdb0 x28: 0000000000000140 x27: 00000000ffffffff [106752.864301] x26: 0000000000000000 x25: 0000000000000002 x24: ffffff8000c204c0 [106752.871502] x23: 0000000000000000 x22: 0000000000000001 x21: ffffffc008a36000 [106752.878703] x20: ffffff8000c20000 x19: 0000000000000001 x18: ffffffc008a4a338 [106752.885905] x17: ffffffc017491000 x16: ffffffc008b78000 x15: 00000000000003e4 [106752.893106] x14: 000000000000014c x13: ffffffc008b7bad8 x12: ffffffc008aa2338 [106752.900308] x11: 712074696d736e61 x10: ffffffc008aa2338 x9 : 0000000000000000 [106752.907509] x8 : ffffffc008a4a2e8 x7 : ffffffc008a4a338 x6 : 0000000000000001 [106752.914711] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [106752.921913] x2 : ffffff801feaf080 x1 : ffffffc017491000 x0 : 000000000000003f [106752.929115] Call trace: [106752.931634] 0xffffffc0086bf5f4 [106752.934847] 0xffffffc0080a5bc0 [106752.938058] 0xffffffc0080a6438 [106752.941269] 0xffffffc0080a6d88 [106752.944480] 0xffffffc00801010c [106752.947691] 0xffffffc00803b560 [106752.950903] 0xffffffc00803b690 [106752.954114] 0xffffffc008089634 [106752.957326] 0xffffffc008421184 [106752.960537] 0xffffffc00801584c [106752.963749] 0xffffffc00801614c [106752.966960] 0xffffffc00882f2e0 [106752.970172] 0xffffffc00882fa24 [106752.973383] 0xffffffc008011374 [106752.976595] 0xffffffc008830004 [106752.979806] 0xffffffc0080696c0 [106752.983017] 0xffffffc0080698d4 [106752.986228] 0xffffffc00801ed10 [106752.989440] 0xffffffc0088382cc [106752.992653] ---[ end trace b60fde47a7177da9 ]---
What build are you using of OpenWRT? It's a very serious bug, I hope it gets fixed because it's quite annoying
I'm suing OpenWrt 23.05.0-rc1 r23069-e2701e0f33
The device is TP-Link TL-XDR6086.
I'm suing OpenWrt 23.05.0-rc1 r23069-e2701e0f33
The device is TP-Link TL-XDR6086.
Ok, so seems a conmon issue on this board
I have now turned off:
This could be the cause. Will follow up with updates if the problem occurs again. @cnJamesLucas @NerdyProjects have you tried it yet?
Current those two option is on in my device. This issue occurs very rare, only twice in two monthes. I will try turn the options off when the issue occurs again.
I have and had flow offloading turned off. Interestingly, my device only offers software flow offloading. Just to make clear: WAX206 is a MT7622BV device but uses the same ethernet driver.
For sure, when replugging the device yesterday I could not reproduce the problem, but I doubt it disappeared by taking it offline for half a day :D
What do you get from ethtool -l eth0
? Try making queues of each of types match number of cores and restart irqbalance.
root@OpenWrt:# ethtool -l lan1
Channel parameters for lan1:
Cannot get device channel parameters: Not supported
root@OpenWrt:# ethtool -l eth0
Channel parameters for eth0:
Cannot get device channel parameters: Not supported
Question is if it is a multiqueue device:
/proc/interrupts
for more than 2 interrupts for eth0ls /sys/class/net/*/queues/rx-*/*_cpus
And in multicore case - if different devices get interrupts on different CPUs, CPU0 might be short like with long IRQ work from multiple sources.
141: 12572 0 MT_SYSIRQ 224 Level 1b100000.ethernet
142: 46468 0 MT_SYSIRQ 225 Level 1b100000.ethernet
145: 0 0 dummy 0 Edge PCIe PME
146: 0 3 mt7530 1 Edge mt7530-0:01
147: 0 0 mt7530 2 Edge mt7530-0:02
148: 0 9 mt7530 3 Edge mt7530-0:03
149: 0 0 mt7530 4 Edge mt7530-0:04
150: 385178 0 MTK PCIe MSI 524288 Edge mt7915e
/sys/class/net/eth0/queues:
rx-0 tx-0 tx-1 tx-10 tx-11 tx-12 tx-13 tx-14 tx-15 tx-2 tx-3 tx-4 tx-5 tx-6 tx-7 tx-8 tx-9
/sys/class/net/eth0/queues/rx-0/rps_cpus (contains 0)
Traffic in all situations where the hangup occured was not much, maybe 10 Mbit/s or something passing through the device.
Question is if it is a multiqueue device:
- check
/proc/interrupts
for more than 2 interrupts for eth0ls /sys/class/net/*/queues/rx-*/*_cpus
And in multicore case - if different devices get interrupts on different CPUs, CPU0 might be short like with long IRQ work from multiple sources.
root@routerone:~# ethtool -l eth1
Channel parameters for eth1:
Cannot get device channel parameters: Not supported
root@routerone:~# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
11: 1442638 1493254 1642039 1868045 GICv3 30 Level arch_ti mer
14: 0 0 0 0 GICv3 142 Level wdt_bar k
27: 0 0 0 0 mt-eint 11 Edge gpio-ke ys
37: 0 0 0 0 mt-eint 21 Edge gpio-ke ys
72: 0 0 0 0 mt-eint 56 Edge gpio-ke ys
117: 0 0 0 0 GICv3 148 Level 1032000 0.crypto
118: 0 0 0 0 GICv3 149 Level 1032000 0.crypto
119: 0 0 0 0 GICv3 150 Level 1032000 0.crypto
120: 0 0 0 0 GICv3 151 Level 1032000 0.crypto
121: 12 0 0 0 GICv3 155 Level ttyS0
124: 39382 0 0 0 GICv3 172 Level 1100a00 0.spi
125: 0 0 0 0 GICv3 173 Level 1100b00 0.spi
134: 6661219 0 0 0 GICv3 229 Level 1510000 0.ethernet
135: 6943868 0 0 0 GICv3 230 Level 1510000 0.ethernet
137: 8699068 0 0 0 GICv3 245 Level mt7915e
IPI0: 5739 33936 9055 10049 Rescheduling interrupts
IPI1: 1906678 12502734 7387462 6410318 Function call interrupts
IPI2: 0 0 0 0 CPU stop interrupts
IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts
IPI4: 0 0 0 0 Timer broadcast interrupt s
IPI5: 0 0 0 0 IRQ work interrupts
IPI6: 0 0 0 0 CPU wake-up interrupts
Err: 0
root@routerone:~# ls /sys/class/net/*/queues/rx-*/*_cpus
/sys/class/net/br-lan/queues/rx-0/rps_cpus
/sys/class/net/eth0/queues/rx-0/rps_cpus
/sys/class/net/eth1.835/queues/rx-0/rps_cpus
/sys/class/net/eth1/queues/rx-0/rps_cpus
/sys/class/net/lan1/queues/rx-0/rps_cpus
/sys/class/net/lan2/queues/rx-0/rps_cpus
/sys/class/net/lan3/queues/rx-0/rps_cpus
/sys/class/net/lan4/queues/rx-0/rps_cpus
/sys/class/net/lo/queues/rx-0/rps_cpus
/sys/class/net/phy0-ap0/queues/rx-0/rps_cpus
/sys/class/net/phy1-ap0/queues/rx-0/rps_cpus
/sys/class/net/pppoe-wan/queues/rx-0/rps_cpus
I disabled HW/SW Offloading. On EX5601-T0.
are you running irqbalance to make interrupts go to differen cores?
I did not run irqbalance.
I did not run irqbalance.
Same here
Try running irqbalance, it should be able to direct different irqs to different CPUs. ??s_cpus being 0 means queue follows IRQ.
Try running irqbalance, it should be able to direct different irqs to different CPUs. ??s_cpus being 0 means queue follows IRQ.
I have disabled Hardware/Software Offload, could this be the cause of the bug? In any case, by enabling irqbalance could I have benefits only on performance? Isn't the bug ONLY with the ethernet switch driver?
It is timeout (long interrupt serviceing) where normal PC would have PCIe ethernet card connected, not the wire side.
It is timeout (long interrupt serviceing) where normal PC would have PCIe ethernet card connected, not the wire side.
Ok, but since the problem occurred with Hardware/Software Offload enabled, I don't think with irqbalance it can be solved
In its maximal case soft offload puts slightly modified packets in front of qdisc-s, there is a chance non-offloaded packet stays behind too long.
In its maximal case soft offload puts slightly modified packets in front of qdisc-s, there is a chance non-offloaded packet stays behind too long.
So, I have to disable soft offload?
You can spread load to all cores to make lockup less likely, or get into driver programming and turn watchdog fail into non-fatal log record.
You can spread load to all cores to make lockup less likely, or get into driver programming and turn watchdog fail into non-fatal log record.
What do you mean with "or get into driver programming and turn watchdog fail into non-fatal log record"
Exactly that
!!!! Sun Jul 16 20:31:12 2023 kern.debug kernel: [328426.487591] dev_watchdog+0x330/0x33c
Sun Jul 16 20:31:12 2023 kern.debug kernel: [328426.491326] call_timer_fn.constprop.0+0x20/0x80
I will not believe that the lockup happens because of some CPU load stuff. As I said, there have been not much traffic at my side, and the CPU was running with performance governor all the time. There has to be a driver bug maybe due to silicon misbehaviour or whatever... If the watchdog does not stop the hardware, it might have to reinitialize it? Without being able to reproduce it, I guess it's hard to just patch out the lockup...
The watchdog at say 10kHz tracks qdisc-ed, i.e not offloaded packet reaching the wire while offload pushes packets into netcard queue (ethtool -g/-G) preempting that from happening. That queue is not strictly FIFO, so anything can happen out-of-order, and the packet does not reach wire on deadline time.
I will not believe that the lockup happens because of some CPU load stuff. As I said, there have been not much traffic at my side, and the CPU was running with performance governor all the time. There has to be a driver bug maybe due to silicon misbehaviour or whatever... If the watchdog does not stop the hardware, it might have to reinitialize it? Without being able to reproduce it, I guess it's hard to just patch out the lockup...
I think is a mtk_soc_eth driver issue, related to ethernet ports switch.
You can spread load to all cores to make lockup less likely, or get into driver programming and turn watchdog fail into non-fatal log record.
IMHO reducing possibility of facing the issue makes it harder to reproduce the problem and can bring us to dead end
related to ethernet ports switch
Frames forwarded inside switch do not touch host at all.
I wonder if porting these patches could help with the issue: https://github.com/gl-inet/gl-infra-builder/blob/main/patches-mt798x-7.6.6.1/0403-compat-gpy211-external-flash.patch https://github.com/gl-inet/gl-infra-builder/blob/main/patches-mt798x-7.6.6.1/3003-target-mediatek-mtk-eth-poll-gpy211-link-state.patch https://github.com/gl-inet/gl-infra-builder/blob/main/patches-mt798x-7.6.6.1/3006-kernel-common-eth-Remove-SGMII_FIXED_2G5-from-Maxlin.patch https://github.com/gl-inet/gl-infra-builder/blob/main/patches-mt798x-7.6.6.1/3007-kernel-common-eth-Fix-HSGMII-link-down-issue.patch
One has to sit monitoring mii events and poll mii status and check if updown could be co-related to adapter queue timeouts. But anything improving physical connection stability across multiple platforms is a plus. Related or not to issue.
One has to sit monitoring mii events and poll mii status and check if updown could be co-related to adapter queue timeouts. But anything improving physical connection stability across multiple platforms is a plus. Related or not to issue.
We should understand how to create debug output that analyzes buffers and more
You can graph ethtool -S eth0
without re-compiling and relate that to event.
Probably tc -s qdisc is related, though unknown if as cause or consequence.
Miitool/ethtool probably will not notice short up/down events if adapter+driver does not report them.
Just now, crash again: kern.info kernel: [64406.248738] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 5 timed out
You need to establish syslog over wifi (kiwi syslog is in windows world), preferably TCP to survive short network glitches. Issue will continue on at same intensity if no corrective activity is taken, like changing some sysctl or driver param or code. Syslog collection is just to see events leading to BUG and trying to make sense of them.
You need to establish syslog over wifi (kiwi syslog is in windows world), preferably TCP to survive short network glitches. Issue will continue on at same intensity if no corrective activity is taken, like changing some sysctl or driver param or code. Syslog collection is just to see events leading to BUG and trying to make sense of them.
I can access to syslog without no problems. The mtk switch driver problem is serious and we need to figure out how to solve it. Can someone who knows about OpenWRT project drivers help us out?
So: Is there any specific event chain leading or trailing switch blackouts? Does irqbalance make problem pop up less often? So far you said flowtable offload is a trigger for problem.
crash on gl-mt3000:
Oops#1 Part1
<7>[96200.033503] sha1_generic seqiv md5 md4 echainiv des_generic libdes chacha20poly1305 cbc authencesn authenc arc4 uas usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_mtk_hcd xhci_hcd ohci_platform ohci_hcd fsl_mph_dr_of ehci_platform ehci_fsl ehci_hcd gpio_button_hotplug vfat fat exfat usbcore usb_common mii [last unloaded: ksmbd]
<7>[96200.150156] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.125 #0
<7>[96200.156238] Hardware name: GL.iNet GL-MT3000 (DT)
<7>[96200.160928] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
<7>[96200.167877] pc : dev_watchdog+0x330/0x33c
<7>[96200.171879] lr : dev_watchdog+0x330/0x33c
<7>[96200.175877] sp : ffffffc00800bdb0
<7>[96200.179178] x29: ffffffc00800bdb0 x28: 0000000000000140 x27: 00000000ffffffff
<7>[96200.186305] x26: 0000000000000000 x25: 0000000000000001 x24: ffffff8000c7d480
<7>[96200.193430] x23: 0000000000000000 x22: 0000000000000001 x21: ffffffc008ae6000
<7>[96200.200554] x20: ffffff8000c7d000 x19: 0000000000000000 x18: ffffffc008afa338
<7>[96200.207679] x17: ffffffc0173f6000 x16: ffffffc008008000 x15: 0000000000000588
<7>[96200.214803] x14: 00000000000001d8 x13: ffffffc00800bad8 x12: ffffffc008b52338
<7>[96200.221928] x11: 712074696d736e61 x10: ffffffc008b52338 x9 : 0000000000000000
<7>[96200.229052] x8 : ffffffc008afa2e8 x7 : ffffffc008afa338 x6 : 0000000000000001
<7>[96200.236176] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
<7>[96200.243301] x2 : ffffff801fec4080 x1 : ffffffc0173f6000 x0 : 000000000000003f
<7>[96200.250427] Call trace:
<7>[96200.252864] dev_watchdog+0x330/0x33c
<7>[96200.256517] call_timer_fn.constprop.0+0x20/0x80
<7>[96200.261124] __run_timers.part.0+0x208/0x284
<7>[96200.265383] run_timer_softirq+0x38/0x70
<7>[96200.269295] _stext+0x10c/0x28c
<7>[96200.272426] __irq_exit_rcu+0xdc/0xfc
<7>[96200.276081] irq_exit+0xc/0x1c
<7>[96200.279126] handle_domain_irq+0x60/0x8c
<7>[96200.283041] gic_handle_irq+0x50/0x120
<7>[96200.286782] call_on_irq_stack+0x20/0x34
<7>[96200.290695] do_interrupt_handler+0x4c/0x54
<7>[96200.294867] el1_interrupt+0x2c/0x4c
<7>[96200.298433] el1h_64_irq_handler+0x14/0x20
<7>[96200.302519] el1h_64_irq+0x78/0x7c
<7>[96200.305910] arch_cpu_idle+0x14/0x20
<7>[96200.309475] do_idle+0xc0/0x150
<7>[96200.312610] cpu_startup_entry+0x20/0x60
<7>[96200.316523] secondary_start_kernel+0x130/0x140
<7>[96200.321043] __secondary_switched+0x50/0x54
<4>[96200.325220] ---[ end trace e415913f88d97c43 ]---
<3>[96200.329864] mtk_soc_eth 15100000.ethernet eth1: transmit timed out
<1>[96200.441331] Unable to handle kernel read from unreadable memory at virtual address 0000000000000048
<1>[96200.450397] Mem abort info:
<1>[96200.453191] ESR = 0x0000000096000005
<1>[96200.456927] EC = 0x25: DABT (current EL), IL = 32 bits
<1>[96200.462231] SET = 0, FnV = 0
<1>[96200.465274] EA = 0, S1PTW = 0
<1>[96200.468401] FSC = 0x05: level 1 translation fault
<1>[96200.473270] Data abort info:
<1>[96200.476138] ISV = 0, ISS = 0x00000005
<1>[96200.479959] CM = 0, WnR = 0
<1>[96200.482923] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000046c2d000
<1>[96200.489350] [0000000000000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
<0>[96200.498048] Internal error: Oops: 96000005 [#1] SMP
<7>[96200.502913] Modules linked in: ksmbd qcserial pppoe ppp_async option l2tp_ppp cdc_mbim wireguard usb_wwan sierra_net sierra rndis_host qmi_wwan pptp pppox ppp_mppe ppp_generic pl2303 nft_fib_inet mt7915e mt76_connac_lib mt76 mac80211 libchacha20poly1305 ipt_REJECT huawei_cdc_ncm chacha_neon cfg80211 cdc_ncm cdc_ether xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_policy xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_esp xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_MASQUERADE xt_LOG xt_HL xt_DSCP xt_DNETMAP xt_DHCPMAC xt_CT xt_CLASSIFY xfrm_interface usbserial usbnet usblp ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda ts_fsm ts_bm tcp_scalable tcp_bbr slhc sch_cake r8152 poly1305_neon nft_tproxy nft_socket nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject_bridge nft_reject nft_redir nft_quota nft_queue nft_objref
<7>[96200.503129] nft_numgen nft_nat nft_meta_bridge nft_masq nft_log nft_limit nft_hash nft_fwd_netdev nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_dup_netdev nft_ct nft_counter nft_compat nft_chain_nat nfnetlink_queue nf_tproxy_ipv6 nf_tproxy_ipv4 nf_tables nf_socket_ipv6 nf_socket_ipv4 nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_log_syslog nf_dup_netdev nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_netlink nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack_broadcast nf_conntrack_bridge ts_kmp nf_conntrack_amanda nf_conncount macvlan libcurve25519_generic libcrc32c libchacha ipvlan iptable_raw iptable_nat iptable_mangle iptable_filter ipt_ah ipt_ECN ipheth ip_tables crc_ccitt compat_xtables compat cdc_wdm br_netfilter natflow natcap nf_nat nf_conntrack crypto_safexcel ntfs3 sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_route cls_matchall cls_fw cls_flow
<7>[96200.590302] cls_basic act_skbedit act_mirred act_gact pwm_fan xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ipmac ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink jool_siit jool nf_defrag_ipv4 jool_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 nfsv4 asn1_decoder nfs msdos bonding tls ip6_gre ip_gre gre ifb nat46 nf_defrag_ipv6 ip6_vti ip_vti sit l2tp_netlink l2tp_core ipcomp6 xfrm6_tunnel esp6 ah6 xfrm4_tunnel ipcomp esp4 ah4 ip6_tunnel oid_registry tunnel6 tunnel4 ip_tunnel udp_diag tcp_diag raw_diag inet_diag tun ovpn_dco_v2 udp_tunnel ip6_udp_tunnel xfrm_user xfrm_ipcomp af_key xfrm_algo lockd sunrpc grace autofs4 dns_resolver nls_utf8 nls_iso8859_1 nls_cp437 crypto_user algif_skcipher algif_rng algif_hash algif_aead af_alg
<7>[96200.676978] sha1_generic seqiv md5 md4 echainiv des_generic libdes chacha20poly1305 cbc authencesn authenc arc4 uas usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_mtk_hcd xhci_hcd ohci_platform ohci_hcd fsl_mph_dr_of ehci_platform ehci_fsl ehci_hcd gpio_button_hotplug vfat fat exfat usbcore usb_common mii [last unloaded: ksmbd]
<7>[96200.793627] CPU: 1 PID: 22881 Comm: kworker/1:1 Tainted: G W 5.15.125 #0
<7>[96200.801616] Hardware name: GL.iNet GL-MT3000 (DT)
<7>[96200.806306] Workqueue: events mtk_pending_work
<7>[96200.810748] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
<7>[96200.817696] pc : mtk_wed_fe_reset+0x30/0x90
<7>[96200.821869] lr : mtk_wed_fe_reset+0x48/0x90
<7>[96200.826039] sp : ffffffc00f0cbd10
<7>[96200.829341] x29: ffffffc00f0cbd10 x28: 0000000000000000 x27: ffffff8000835c00
<7>[96200.836466] x26: ffffff8006f42874 x25: ffffff801fec9905 x24: ffffff800009a080
<7>[96200.843592] x23: 0000000000000001 x22: ffffffc008c00888 x21: ffffffc008ba2b60
<7>[96200.850717] x20: 0000000000000001 x19: ffffff8003567108 x18: 0000000000000000
<7>[96200.857841] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
<7>[96200.864965] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
<7>[96200.872089] x11: 000000000000b343 x10: 00000000000008a0 x9 : ffffffc00f0cbb70
<7>[96200.879214] x8 : ffffff8000836500 x7 : 0000000000000001 x6 : 0000000000000001
<7>[96200.886338] x5 : 5800080540096738 x4 : 0000000000000000 x3 : ffffff8003562038
<7>[96200.893462] x2 : 0000000000000000 x1 : 000000000000030f x0 : 0000000000000000
<7>[96200.900587] Call trace:
<7>[96200.903022] mtk_wed_fe_reset+0x30/0x90
<7>[96200.906847] mtk_pending_work+0x58/0x1d0
<7>[96200.910760] process_one_work+0x210/0x3b0
<7>[96200.914759] worker_thread+0x170/0x4d0
<7>[96200.918496] kthread+0x11c/0x130
<7>[96200.921717] ret_from_fork+0x10/0x20
<0>[96200.925286] Code: 912d82b5 aa1503e0 94094a1a f8747ac0 (f9402413)
<4>[96200.931364] ---[ end trace e415913f88d97c44 ]---
@ptpt52 any specific conditions it happened on (like huge upload/download), what wan speed do you have?
So: Is there any specific event chain leading or trailing switch blackouts? Does irqbalance make problem pop up less often? So far you said flowtable offload is a trigger for problem.
The problem always occurs, both with irqbalance, even with hw/sw offload enabled.
@ptpt52
mtk_wed_fe_reset
I still do not know what stops ethernet, but this patch should fix crash that happens later: https://lore.kernel.org/lkml/20230818183313.52f9b9f3@kernel.org/T/
@ptpt52
mtk_wed_fe_reset
I still do not know what stops ethernet, but this patch should fix crash that happens later: https://lore.kernel.org/lkml/20230818183313.52f9b9f3@kernel.org/T/
great thanks. I would try that patch.
@ptpt52
mtk_wed_fe_reset
I still do not know what stops ethernet, but this patch should fix crash that happens later: https://lore.kernel.org/lkml/20230818183313.52f9b9f3@kernel.org/T/
This woul'd fix this issue (transmit queue 1 timed out?
No. Issue is hard to debug, because there is no crash - hardware just hangs. Anyway with this fix maybe driver is able to succesfuly reset hw (this is better than rebooting the device)
Let me throw in following link: http://lists.infradead.org/pipermail/linux-mediatek/2017-November/011174.html
Sounds somehow familiar to what we are experiencing...
Reading that, we could try:
As I still cannot reproduce the issue, I can also not debug it... But I might try in a few days the reproduction steps listed in that thread.
Describe the bug
While using the router Zyxel EX5601-T0 I randomly encountered the following problem: The Ethernet switch driver (mtk_soc_eth) stops working, for no reason (the crash log is attached in the Actual behavior section). The following problem seems to be the same as this Issue: #12143
This problem blocks the normal use of the router as well as the functionality of the Ethernet ports.
OpenWrt version
r23551-e21b4c9636
OpenWrt target/subtarget
mediatek/filogic
Device
Zyxel EX5601-T0
Image kind
Official downloaded image
Steps to reproduce
The mentioned issue is randomly encountered, I encountered the issue twice during an active upload stream (for example live video stream) a month apart.
Actual behaviour
KERNEL LOG:
Actual behaviour
Using the following configuration:
Upon crashing, all Ethernet ports stop responding: not allowing me to access the OpenWRT GUI via numeric IP/local DNS name. I can only access the OpenWRT GUI from the Wi-Fi interface. (The bug affects the mtk_soc_eth ethernet switch).
To restore full system functionality, I had to completely reboot the router.
Expected behaviour
The network connectivity should remain stable, and the router should not experience timeouts/crash or loss of LAN access.
Additional info
The issue occurs randomly and is not reproducible consistently. The log indicates a timeout in the transmit queue of eth1, which is related to the Mtk_soc_eth driver. Searching the web with the keyword "(mtk_soc_eth): transmit queue" I came across several similar issues. Except for this problem, the router has never presented any instability problems
I should add that I use Stubby for DoT configuration, I don't think it is the cause of the problem.
Diffconfig
No response
Terms