Closed jordens closed 2 years ago
What's the DHCP lease interval set to? I suspect this may be our internal timer. It wraps at 5 days (according to comments), and it looks like the DHCP renewal is just around that much time as well?
If I had to guess, it looks like the DHCP server makes a very long duration lease to the device. The device then tries to renew 5 days later and continues using the .130 IP address until 50s after the renewal request (https://github.com/smoltcp-rs/smoltcp/blob/db18514d46d7c6c1b4305b7e4c5560372806cc84/src/socket/dhcpv4.rs#L17). At that point, it abandons the DHCP lease and unassigns the IP address. I can't figure out why it doesn't follow up with a new DHCP request though
30 minutes As mentioned, the high number of the TXID indicates that it was trying to do something with DHCP before but didn't send anything.
This may have been a side-effect of https://github.com/rtic-rs/cortex-m-rtic/issues/458#issuecomment-933656256 Can't reproduce it now and I'm seeing healthy DHCP traffic.
Happened again, pretty much after the same 4 days 2 hours (around 351890 seconds)
Trace when ctrl-c at that point.
^C────────────────────────────────────────────────────────────────────────────────
stack backtrace:
0: core::ptr::read_volatile
at /rustc/9dbb26efe8a886f7bba94c2b996c9535ce07f917/library/core/src/ptr/mod.rs:1054:14
1: vcell::VolatileCell<T>::get
at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/vcell-0.1.3/src/lib.rs:33:18
2: stm32h7::generic::Reg<U,REG>::read
at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/stm32h7-0.13.0/src/generic.rs:51:18
3: stm32h7xx_hal::ethernet::eth::interrupt_handler
at /home/rj/.cargo/git/checkouts/stm32h7xx-hal-54d30215a9ddb2ce/33aa67d/src/ethernet/eth.rs:804:13
4: fls::app::eth
at src/bin/fls.rs:823:18
5: fls::app::ETH::{{closure}}
at src/bin/fls.rs:474:1
6: rtic::export::run
at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rtic-0.6.0-rc.2/src/export.rs:33:9
7: ETH
at src/bin/fls.rs:474:1
<exception entry>
8: cortex_m::asm::inline::__wfi
at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/../asm/inline.rs:182:5
9: cortex_m::asm::wfi
at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/asm.rs:55:5
10: fls::app::idle
at src/bin/fls.rs:762:43
11: main
at src/bin/fls.rs:474:1
12: Reset
(HOST) WARN call stack was corrupted; unwinding could not be completed
(HOST) INFO device halted without error
To debug, I increased the SystemTimer
tick rate by a factor of 1,000 (up to 10MHz). With this setup, the internal Instant
overflows at a rate of every 425 seconds. I then configured my DHCP server to lease IP addresses for 60s. With this setup, Stabilizer nominally re-acquires the IP address every 30 seconds.
As can be seen, right as the internal Instant
rolls over, Stabilizer then fails to re-request the IP address. This is likely due to internal roll-over of the millisecond counter. The mechanism by which this causes a failure is not yet known.
During this period, I'm still able to ping the device and telemetry is still active.
Upon further analysis, this is likely an issue with how we're feeding timestamps to DHCP. Specifically, once the internal timer rolls over, our millisecond
interval wraps from N -> 0. However, internally, the DHCP server stores the next scheduled retry attempt as an Instant
internally sometime past N. This can result in cases where DHCP never gets renewed.
I believe this to be a fault in how smoltcp-nal
translates the time: u32
into a smoltcp::Instant
. We can likely utilize embedded-time
within smoltcp-nal
such that smoltcp-nal
stores an internal smoltcp::Instant
for the time that is monotonically incremented by elapsed time. This ensures that the time translation occurs properly.
There may well be two different overflows. The one here doesn't occur after 2**32 ms but much (~factor of 12) earlier.
And in my case there is definitely no telemetry, ICMP, and streaming activity anymore when it locks up.
I agree that the symptoms I describe don't quite match what has been shown so far above, and I'd like to get a handle more on the problems you're facing.
Questions:
An update: I checked that the process()
ISR still runs regularly when this bug appears. That part is not hung and the code isn't crashed.
And on each occasion (~3/3) of the bug the PC is in stm32h7xx_hal::ethernet::eth::interrupt_handler
.
In reference to the PC being at the eth::interrupt_handler
- I'm curious if we're hitting an ISR that isn't properly being cleared, which essentially locks up your device because we sit in the interrupt handler. This would seem to hit the symptoms (namely, DSP processing continus because it is at a higher prio, but network traffic ceases, as it is at a lower priority).
Reviewing the ethernet peripheral, it looks like ISRs can source from the MAC, DMA, or MTL. Currently, the HAL only checks a few specific, masked DMA interrupt sources and clears them. However, it looks like there are a few MTL interrupt sources that are on by-default:
If we were to hit an MTL interrupt, it seems like the ISR bit would remain set, so we would continually sit in the eth ISR handler since we're failing to clear MTL interrupt bits.
If we recapture this with a debug probe attached, we should check the ETH_DMAISR
register to determine if an MTL interrupt has been asserted.
The MTL interrupts appear to trigger if the internal FIFOs over/underflow in the transfer from MAC -> DMA. The FIFOs are 2KB, so if our descriptor ring were to get jugged up for some reason, it's feasible we could hit an MTL interrupt that then hangs the device, but we'll need to probe some of the ISR registers when this occurs again to aid in debugging.
It's possible the ICMP echos from streaming may be causing this problem, as they are relatively large incoming ICMP-level messages containing (rejected, echod) stream data. Wireshark shows these packets as being about ~590 bytes coming in at nominal 1Hz rate.
That being said, I wouldn't expect this to be a source of concern. It looks like this data comes in relatively sparsely (1Hz is pretty slow) and is likely pushed through to the descriptor ring quickly.
Does this bug appear after the ~351890 s mentioned above on every boot, or is it non-deterministic? (Just so I know what to look out for…)
Upon inducing the fault on a system, @jordens attached to the faulty unit. It was observed that the unit appeared stuck in the ethernet ISR:
0x080070bc stm32h7xx_hal::ethernet::eth::interrupt_handler+2 ldr r1, [r0, #0]
0x080070be stm32h7xx_hal::ethernet::eth::interrupt_handler+4 ldr r0, [r0, #0]
0x080070c0 stm32h7xx_hal::ethernet::eth::interrupt_handler+6 movs r1, #0
~
~
~
─── Breakpoints ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[1] break at 0x080297fc in /home/rj/.cargo/git/checkouts/stabilizer-f1eefc2ad8feebcf/0a0d92b/src/hardware/mod.rs:89 for DefaultHandler
[2] break at 0x0802b8fa in /home/rj/.cargo/git/checkouts/stabilizer-f1eefc2ad8feebcf/0a0d92b/src/hardware/mod.rs:82 for HardFault
[3] break at 0x08025432 in /rustc/baba6687df3e83fdb15cc6ec239b4a1c75a30505/library/core/src/sync/atomic.rs:2364 for rust_begin_unwind
─── Expressions ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─── History ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─── Memory ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─── Registers ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
r0 0x40029160 r1 0x00000484 r2 0x06000000 r3 0x2001ff04 r4 0x2001fff7 r5 0xe000e208 r6 0x00004000 r7 0x2001ff68 r8 0x000000f0 r9 0x00000000
r10 0x20005110 r11 0x00000001 r12 0x00000000 sp 0x2001ff68 lr 0xffffffe9 pc 0x080070be xPSR 0x6100004d fpscr 0x60000010 msp 0x2001ff68 psp 0x00000000
primask 0x00 basepri 0x00 faultmask 0x00 control 0x00
─── Source ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
799 let eth_dma = &*stm32::ETHERNET_DMA::ptr();
800 eth_dma
801 .dmacsr
802 .write(|w| w.nis().set_bit().ri().set_bit().ti().set_bit());
803 let _ = eth_dma.dmacsr.read();
804 let _ = eth_dma.dmacsr.read(); // Delay 2 peripheral clocks
805 }
806
807 pub unsafe fn enable_interrupt() {
808 let eth_dma = &*stm32::ETHERNET_DMA::ptr();
─── Stack ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[0] from 0x080070be in stm32h7xx_hal::ethernet::eth::interrupt_handler+4 at /home/rj/.cargo/git/checkouts/stm32h7xx-hal-54d30215a9ddb2ce/33aa67d/src/ethernet/eth.rs:804
[1] from 0x080070be in fls::app::eth+4 at src/bin/fls.rs:851
[2] from 0x080070be in fls::app::ETH::{{closure}}+4 at src/bin/fls.rs:498
[3] from 0x080070be in rtic::export::run+4 at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rtic-0.6.0-rc.2/src/export.rs:33
[4] from 0x080070be in fls::app::ETH+20 at src/bin/fls.rs:498
[5] from 0xffffffe9
[6] from 0x08006ff4 in cortex_m::asm::inline::__wfi+2 at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/../asm/inline.rs:182
[7] from 0x08006ff4 in cortex_m::asm::wfi+2 at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/asm.rs:55
[8] from 0x08006ff4 in fls::app::idle+68 at src/bin/fls.rs:790
[9] from 0x08009f0c in fls::app::rtic_ext::main+88 at src/bin/fls.rs:498
─── Threads ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[1] id 0 from 0x080070be in stm32h7xx_hal::ethernet::eth::interrupt_handler+4 at /home/rj/.cargo/git/checkouts/stm32h7xx-hal-54d30215a9ddb2ce/33aa67d/src/ethernet/eth.rs:804
─── Variables ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>> bt
#0 stm32h7xx_hal::ethernet::eth::interrupt_handler () at /home/rj/.cargo/git/checkouts/stm32h7xx-hal-54d30215a9ddb2ce/33aa67d/src/ethernet/eth.rs:804
#1 fls::app::eth () at src/bin/fls.rs:851
#2 fls::app::ETH::{{closure}} () at src/bin/fls.rs:498
#3 rtic::export::run (priority=1, f=...) at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rtic-0.6.0-rc.2/src/export.rs:33
#4 fls::app::ETH () at src/bin/fls.rs:498
#5 <signal handler called>
#6 0x08006ff4 in cortex_m::asm::inline:: () at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/../asm/inline.rs:182
#7 cortex_m::asm::wfi () at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/asm.rs:55
#8 fls::app::idle (c=...) at src/bin/fls.rs:790
#9 0x08009f0c in fls::app::rtic_ext::main () at src/bin/fls.rs:498
>>>
GDB was then used to extract register values.
>>> x/10wx 0x40029000
0x40029000: 0x00000000 0x00001001 0x00020000 0x00006400
0x40029010: 0x00000000 0x00000000 0x00000000 0x00000000
0x40029020: 0x00000000 0x00000000
>>>
This indicates that the MACIS flag is set in the DMAISR register.
>>> x/10wx 0x400280b0
0x400280b0: 0x00000500 0x00000000 0x00000000 0x00000000
0x400280c0: 0x00000000 0x00000000 0x00000000 0x00000000
0x400280d0: 0x00000000 0x03e80000
>>>
This appears to indicate that MMCTXIS
and MMCIS
are both set. These interrupts relate to the MMC (MAC Management Counters):
Investigation into MMCTXIS
indicates an interrupt of the MMC TX statistics counters:
>>> x/10wx 0x40028708
0x40028708: 0x00200000 0x00000000 0x00000000 0x00000000
0x40028718: 0x00000000 0x00000000 0x00000000 0x00000000
0x40028728: 0x00000000 0x00000000
This indicates that the TXGPKTIS
bit is set:
This appears to be caused by these counter interrupts being "enabled by default".
Looks like a DHCP issue. Get's an address, but doesn't renew. Only after days it does another REQUEST and then apparently doesn't get a response. TX ID appears to indicate a large number of passed transactions though (they don't ever seem to make it onto the wire).
This is a recent smoltcp db18514d4
Doesn't really seem to be related to #441 or #433