Network stuck - Githubissues

jordens commented 2 years ago

Looks like a DHCP issue. Get's an address, but doesn't renew. Only after days it does another REQUEST and then apparently doesn't get a response. TX ID appears to indicate a large number of passed transactions though (they don't ever seem to make it onto the wire).

This is a recent smoltcp db18514d4

Doesn't really seem to be related to #441 or #433

rj@murray:~/src/fls (master)$ cargo +nightly run --release --
    Finished release [optimized + debuginfo] target(s) in 0.04s
     Running `probe-run --chip STM32H743ZITx --speed 30000 target/thumbv7em-none-eabihf/release/fls`
(HOST) INFO  flashing program (245.88 KiB)
(HOST) INFO  success!
────────────────────────────────────────────────────────────────────────────────
INFO - Starting
INFO - EUI48: 04-91-62-d9-81-ff
INFO - Found Pounder
INFO - Assuming Pounder v1.1 or later
INFO - setup() complete
WARN - QSPI stalling
WARN - Network link DOWN
WARN - Network link UP
INFO - MQTT connected, subscribing to settings
INFO - Settings update: `iir`
INFO - Settings update: `stream_target`
INFO - Opening stream
^C────────────────────────────────────────────────────────────────────────────────
stack backtrace:
   0: ETH
        at src/bin/fls.rs:474
      <exception entry>
   1: rtic::export::lock
        at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rtic-0.6.0-rc.2/src/export.rs:142:8
   2: fls::app::<impl rtic_core::Mutex for fls::app::shared_resources::network>::lock
        at src/bin/fls.rs:474:1
   3: fls::app::idle
        at src/bin/fls.rs:759:19
   4: main
        at src/bin/fls.rs:474:1
   5: Reset
(HOST) WARN  call stack was corrupted; unwinding could not be completed
(HOST) INFO  device halted without error

ryan-summers commented 2 years ago

What's the DHCP lease interval set to? I suspect this may be our internal timer. It wraps at 5 days (according to comments), and it looks like the DHCP renewal is just around that much time as well?

https://github.com/quartiq/stabilizer/blob/27564ccb53be1e2401d25362935f686c79e81402/src/hardware/system_timer.rs#L24

If I had to guess, it looks like the DHCP server makes a very long duration lease to the device. The device then tries to renew 5 days later and continues using the .130 IP address until 50s after the renewal request (https://github.com/smoltcp-rs/smoltcp/blob/db18514d46d7c6c1b4305b7e4c5560372806cc84/src/socket/dhcpv4.rs#L17). At that point, it abandons the DHCP lease and unassigns the IP address. I can't figure out why it doesn't follow up with a new DHCP request though

jordens commented 2 years ago

30 minutes As mentioned, the high number of the TXID indicates that it was trying to do something with DHCP before but didn't send anything.

jordens commented 2 years ago

This may have been a side-effect of https://github.com/rtic-rs/cortex-m-rtic/issues/458#issuecomment-933656256 Can't reproduce it now and I'm seeing healthy DHCP traffic.

jordens commented 2 years ago

Happened again, pretty much after the same 4 days 2 hours (around 351890 seconds)

jordens commented 2 years ago

Trace when ctrl-c at that point.

^C────────────────────────────────────────────────────────────────────────────────
stack backtrace:
   0: core::ptr::read_volatile
        at /rustc/9dbb26efe8a886f7bba94c2b996c9535ce07f917/library/core/src/ptr/mod.rs:1054:14
   1: vcell::VolatileCell<T>::get
        at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/vcell-0.1.3/src/lib.rs:33:18
   2: stm32h7::generic::Reg<U,REG>::read
        at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/stm32h7-0.13.0/src/generic.rs:51:18
   3: stm32h7xx_hal::ethernet::eth::interrupt_handler
        at /home/rj/.cargo/git/checkouts/stm32h7xx-hal-54d30215a9ddb2ce/33aa67d/src/ethernet/eth.rs:804:13
   4: fls::app::eth
        at src/bin/fls.rs:823:18
   5: fls::app::ETH::{{closure}}
        at src/bin/fls.rs:474:1
   6: rtic::export::run
        at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rtic-0.6.0-rc.2/src/export.rs:33:9
   7: ETH
        at src/bin/fls.rs:474:1
      <exception entry>
   8: cortex_m::asm::inline::__wfi
        at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/../asm/inline.rs:182:5
   9: cortex_m::asm::wfi
        at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/asm.rs:55:5
  10: fls::app::idle
        at src/bin/fls.rs:762:43
  11: main
        at src/bin/fls.rs:474:1
  12: Reset
(HOST) WARN  call stack was corrupted; unwinding could not be completed
(HOST) INFO  device halted without error

ryan-summers commented 2 years ago

To debug, I increased the SystemTimer tick rate by a factor of 1,000 (up to 10MHz). With this setup, the internal Instant overflows at a rate of every 425 seconds. I then configured my DHCP server to lease IP addresses for 60s. With this setup, Stabilizer nominally re-acquires the IP address every 30 seconds.

As can be seen, right as the internal Instant rolls over, Stabilizer then fails to re-request the IP address. This is likely due to internal roll-over of the millisecond counter. The mechanism by which this causes a failure is not yet known.

ryan-summers commented 2 years ago

During this period, I'm still able to ping the device and telemetry is still active.

Upon further analysis, this is likely an issue with how we're feeding timestamps to DHCP. Specifically, once the internal timer rolls over, our millisecond interval wraps from N -> 0. However, internally, the DHCP server stores the next scheduled retry attempt as an Instant internally sometime past N. This can result in cases where DHCP never gets renewed.

I believe this to be a fault in how smoltcp-nal translates the time: u32 into a smoltcp::Instant. We can likely utilize embedded-time within smoltcp-nal such that smoltcp-nal stores an internal smoltcp::Instant for the time that is monotonically incremented by elapsed time. This ensures that the time translation occurs properly.

jordens commented 2 years ago

There may well be two different overflows. The one here doesn't occur after 2**32 ms but much (~factor of 12) earlier.

jordens commented 2 years ago

And in my case there is definitely no telemetry, ICMP, and streaming activity anymore when it locks up.

ryan-summers commented 2 years ago

I agree that the symptoms I describe don't quite match what has been shown so far above, and I'd like to get a handle more on the problems you're facing.

Questions:

Does your set up consistently fail to renew the DHCP address after first acquisition on every boot? (e.g. after 15 minutes of address acquisition, do you ever see a renewal request?)
Does the system lock up after the first 30 minutes, or only after the unanswered DHCP request?

jordens commented 2 years ago

The top post is misleading. As mentioned above DHCP looks fine. It renews just fine but then ceases together with all other traffic.
The system locks up (meaning all network traffic ceases) after around 351890 seconds (to better than 1% accuracy). From the data I have I can't say what fails first. DHCP renew is not yet due when it locks up.

jordens commented 2 years ago

An update: I checked that the process() ISR still runs regularly when this bug appears. That part is not hung and the code isn't crashed.

jordens commented 2 years ago

And on each occasion (~3/3) of the bug the PC is in stm32h7xx_hal::ethernet::eth::interrupt_handler.

ryan-summers commented 2 years ago

In reference to the PC being at the eth::interrupt_handler - I'm curious if we're hitting an ISR that isn't properly being cleared, which essentially locks up your device because we sit in the interrupt handler. This would seem to hit the symptoms (namely, DSP processing continus because it is at a higher prio, but network traffic ceases, as it is at a lower priority).

Reviewing the ethernet peripheral, it looks like ISRs can source from the MAC, DMA, or MTL. Currently, the HAL only checks a few specific, masked DMA interrupt sources and clears them. However, it looks like there are a few MTL interrupt sources that are on by-default:

If we were to hit an MTL interrupt, it seems like the ISR bit would remain set, so we would continually sit in the eth ISR handler since we're failing to clear MTL interrupt bits.

If we recapture this with a debug probe attached, we should check the ETH_DMAISR register to determine if an MTL interrupt has been asserted.

ryan-summers commented 2 years ago

The MTL interrupts appear to trigger if the internal FIFOs over/underflow in the transfer from MAC -> DMA. The FIFOs are 2KB, so if our descriptor ring were to get jugged up for some reason, it's feasible we could hit an MTL interrupt that then hangs the device, but we'll need to probe some of the ISR registers when this occurs again to aid in debugging.

ryan-summers commented 2 years ago

It's possible the ICMP echos from streaming may be causing this problem, as they are relatively large incoming ICMP-level messages containing (rejected, echod) stream data. Wireshark shows these packets as being about ~590 bytes coming in at nominal 1Hz rate.

That being said, I wouldn't expect this to be a source of concern. It looks like this data comes in relatively sparsely (1Hz is pretty slow) and is likely pushed through to the descriptor ring quickly.

dnadlinger commented 2 years ago

Does this bug appear after the ~351890 s mentioned above on every boot, or is it non-deterministic? (Just so I know what to look out for…)

ryan-summers commented 2 years ago

Upon inducing the fault on a system, @jordens attached to the faulty unit. It was observed that the unit appeared stuck in the ethernet ISR:

 0x080070bc  stm32h7xx_hal::ethernet::eth::interrupt_handler+2 ldr    r1, [r0, #0]
 0x080070be  stm32h7xx_hal::ethernet::eth::interrupt_handler+4 ldr    r0, [r0, #0]
 0x080070c0  stm32h7xx_hal::ethernet::eth::interrupt_handler+6 movs    r1, #0
~
~
~
─── Breakpoints ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[1] break at 0x080297fc in /home/rj/.cargo/git/checkouts/stabilizer-f1eefc2ad8feebcf/0a0d92b/src/hardware/mod.rs:89 for DefaultHandler
[2] break at 0x0802b8fa in /home/rj/.cargo/git/checkouts/stabilizer-f1eefc2ad8feebcf/0a0d92b/src/hardware/mod.rs:82 for HardFault
[3] break at 0x08025432 in /rustc/baba6687df3e83fdb15cc6ec239b4a1c75a30505/library/core/src/sync/atomic.rs:2364 for rust_begin_unwind
─── Expressions ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─── History ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─── Memory ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─── Registers ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
         r0 0x40029160          r1 0x00000484            r2 0x06000000          r3 0x2001ff04     r4 0x2001fff7     r5 0xe000e208       r6 0x00004000        r7 0x2001ff68      r8 0x000000f0      r9 0x00000000
        r10 0x20005110         r11 0x00000001           r12 0x00000000          sp 0x2001ff68     lr 0xffffffe9     pc 0x080070be     xPSR 0x6100004d     fpscr 0x60000010     msp 0x2001ff68     psp 0x00000000
    primask 0x00           basepri 0x00           faultmask 0x00           control 0x00
─── Source ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 799      let eth_dma = &*stm32::ETHERNET_DMA::ptr();
 800      eth_dma
 801          .dmacsr
 802          .write(|w| w.nis().set_bit().ri().set_bit().ti().set_bit());
 803      let _ = eth_dma.dmacsr.read();
 804      let _ = eth_dma.dmacsr.read(); // Delay 2 peripheral clocks
 805  }
 806
 807  pub unsafe fn enable_interrupt() {
 808      let eth_dma = &*stm32::ETHERNET_DMA::ptr();
─── Stack ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[0] from 0x080070be in stm32h7xx_hal::ethernet::eth::interrupt_handler+4 at /home/rj/.cargo/git/checkouts/stm32h7xx-hal-54d30215a9ddb2ce/33aa67d/src/ethernet/eth.rs:804
[1] from 0x080070be in fls::app::eth+4 at src/bin/fls.rs:851
[2] from 0x080070be in fls::app::ETH::{{closure}}+4 at src/bin/fls.rs:498
[3] from 0x080070be in rtic::export::run+4 at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rtic-0.6.0-rc.2/src/export.rs:33
[4] from 0x080070be in fls::app::ETH+20 at src/bin/fls.rs:498
[5] from 0xffffffe9
[6] from 0x08006ff4 in cortex_m::asm::inline::__wfi+2 at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/../asm/inline.rs:182
[7] from 0x08006ff4 in cortex_m::asm::wfi+2 at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/asm.rs:55
[8] from 0x08006ff4 in fls::app::idle+68 at src/bin/fls.rs:790
[9] from 0x08009f0c in fls::app::rtic_ext::main+88 at src/bin/fls.rs:498
─── Threads ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[1] id 0 from 0x080070be in stm32h7xx_hal::ethernet::eth::interrupt_handler+4 at /home/rj/.cargo/git/checkouts/stm32h7xx-hal-54d30215a9ddb2ce/33aa67d/src/ethernet/eth.rs:804
─── Variables ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>> bt
#0  stm32h7xx_hal::ethernet::eth::interrupt_handler () at /home/rj/.cargo/git/checkouts/stm32h7xx-hal-54d30215a9ddb2ce/33aa67d/src/ethernet/eth.rs:804
#1  fls::app::eth () at src/bin/fls.rs:851
#2  fls::app::ETH::{{closure}} () at src/bin/fls.rs:498
#3  rtic::export::run (priority=1, f=...) at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-rtic-0.6.0-rc.2/src/export.rs:33
#4  fls::app::ETH () at src/bin/fls.rs:498
#5  <signal handler called>
#6  0x08006ff4 in cortex_m::asm::inline:: () at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/../asm/inline.rs:182
#7  cortex_m::asm::wfi () at /home/rj/.cargo/registry/src/github.com-1ecc6299db9ec823/cortex-m-0.7.3/src/asm.rs:55
#8  fls::app::idle (c=...) at src/bin/fls.rs:790
#9  0x08009f0c in fls::app::rtic_ext::main () at src/bin/fls.rs:498
>>>

GDB was then used to extract register values.

>>> x/10wx 0x40029000
0x40029000:     0x00000000      0x00001001      0x00020000      0x00006400
0x40029010:     0x00000000      0x00000000      0x00000000      0x00000000
0x40029020:     0x00000000      0x00000000
>>>

This indicates that the MACIS flag is set in the DMAISR register.

>>> x/10wx 0x400280b0
0x400280b0:     0x00000500      0x00000000      0x00000000      0x00000000
0x400280c0:     0x00000000      0x00000000      0x00000000      0x00000000
0x400280d0:     0x00000000      0x03e80000
>>>

This appears to indicate that MMCTXIS and MMCIS are both set. These interrupts relate to the MMC (MAC Management Counters):

ryan-summers commented 2 years ago

Investigation into MMCTXIS indicates an interrupt of the MMC TX statistics counters:

>>> x/10wx 0x40028708
0x40028708:     0x00200000      0x00000000      0x00000000      0x00000000
0x40028718:     0x00000000      0x00000000      0x00000000      0x00000000
0x40028728:     0x00000000      0x00000000

This indicates that the TXGPKTIS bit is set:

This appears to be caused by these counter interrupts being "enabled by default".

quartiq / stabilizer

Network stuck #450