quartiq / booster

Firmware for the Sinara Booster RF amplifier
Apache License 2.0
10 stars 1 forks source link

Network traffic triggers watchdog #337

Closed ryan-summers closed 4 days ago

ryan-summers commented 9 months ago

While testing out some multicast UDP traffic, I observed booster reboot twice. The service then indicated that a watchdog occurred. It's not clear if this was related to the Multicast traffic or not, but seemed to only occur when executing the Stabilizer streaming HITL with a streaming IP of 224.192.168.1

> service
Version             : Unspecified [release]
Hardware Revision   : v1.5
Rustc Version       : rustc 1.72.0 (5680fa18f 2023-08-23)
Features            :
Detected Phy        : W5500
Panic Info          : None
Watchdog Detected   : true

>
ryan-summers commented 9 months ago

This was running on 1ea1124170383a787d8dce644a263fde21fc279d

jordens commented 9 months ago

n.b. I'm also streaming multicast in the same net on 239.34.16.10. I would not use multicast in 224.0.0.0/8. Let's stick to the admin scoped 239.0.0.0/8.

ryan-summers commented 9 months ago

This appears pretty reproducible with 239.192.168.1 in my network. I'm looking into what's happening

ryan-summers commented 9 months ago

Disabling the watchdog indicates that booster remains operational throughout the whole event and doesn't encounter a true lockup. What I believe may be happening is that there's so much traffic incoming on the PHY that smoltcp and the W5500 need to process that it slows down processing of the application in general. I wonder if there's a way we can handle events like this where there's excessive ethernet traffic.

jordens commented 9 months ago

It should not be seeing that traffic at all. The switch doesn't flood. It might see IGMP traffic but that's not much at all and always has been there.

ryan-summers commented 9 months ago

I see the green ethernet LED remain consistently on during the Stabilizer stream period, indicating that Booster is indeed receiving excessive packets.

I opened https://github.com/smoltcp-rs/smoltcp/issues/848 around this, but this could be due to a cheap switch that isn't handling multicast properly as well.

jordens commented 9 months ago

That doesn't seem to be the case here. I'm seeing close to zero traffic towards stabilizer (while another one is streaming multicast). As expected.

jordens commented 9 months ago

This is likely then a W5500 or w5500 specific bug.

ryan-summers commented 9 months ago

Do you have a router and/or managed switch in between? The switch I'm using with my local network is just some cheap unmanaged unit. I suspect that it's forwarding the multicast traffic to all of the ports regardless of subscription to the multicast group.

jordens commented 9 months ago

There is a cisco 2960L in between. Whether that one can be called "managed" or not is debatable. But it does behave properly regarding non-flooding of multicast.

aferk commented 1 week ago

We have found a similar issue when using the booster with an unmanaged switch with other traffic. The booster seems to reboot when it receives invalid packages on the MQTT port 1883. After that, when trying to enable channels, they only seem to change to powered instead of enabled and stay this way until the booster is powercycled. This behavior can be reproduced by sending invalid packages via netcat, e.g. nc -u <booster-ip> 1883 < /dev/random.

We will use a vlan for now, but it would be nice if this could be fixed to make integration into existing network structures easier.

jordens commented 1 week ago

@aferk Contributions welcome.

ryan-summers commented 1 week ago

@aferk if you connect to the device when it's in this state and run the platform service command and attach the output here, that would be helpful.

That will tell us why the device is resetting.

aferk commented 1 week ago

Running service before sending packages:

> service
Version             : v0.5.0 [debug]
Hardware Revision   : v1.6
Rustc Version       : rustc 1.76.0 (07dca489a 2024-02-04)
Features            : 
Detected Phy        : W5500
Panic Info          : None
Watchdog Detected   : false

Sometimes I just get the Watchdog Detected: true as above, but when re-trying a few times, I do also get the more informative panic info below:

> service
Version             : v0.5.0 [debug]
Hardware Revision   : v1.6
Rustc Version       : rustc 1.76.0 (07dca489a 2024-02-04)
Features            : 
Detected Phy        : W5500
Panic Info          : panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/w5500-0.4.1/src/raw_device.rs:92:13:
attempt to subtract with overflow

Watchdog Detected   : true

Booster HW is version 1.6 (non-HL) from Creotec. We tried to upgrade the firmware because of this issue https://github.com/sinara-hw/Booster/issues/393, which both of our non-HL boosters regularly experience.

jordens commented 1 week ago

@aferk The old firmware is dead and rotten. You will hopefully understand that this firmware here is bound to have the same fate as long as people don't invest into it or contribute or buy where the development is funded.

ryan-summers commented 4 days ago

The panic referenced above should be fixed in the latest release of the w5500 crate, which was updated recently (in https://github.com/quartiq/booster/pull/373). I'm currently testing a fix for the watchdog event in smoltcp