newAM / w5500-rs

Embedded rust support for the Wiznet W5500 SPI internet offload chip
MIT License
33 stars 13 forks source link

Strange packet header arriving spuriously #252

Open birkenfeld opened 1 year ago

birkenfeld commented 1 year ago

Describe the Problem

Unfortunately, I cannot now reproduce the problem, the hardware is behaving perfectly. But maybe you recognize something in the description and can give advice. Otherwise feel free to close this again.

I started with the ambientsensor project and adapted it to my chip (Raspberry Pi Pico, rp2040, connected to a small W5500 breakout module). You can see the current state here: https://github.com/birkenfeld/rp2040-w5500/blob/main/src/main.rs

(As a side node, I had to add a 500 millisecond delay between link-up and starting DHCP. Otherwise the first DISCOVER would be sent fine from the firmware perspective, but Wireshark never saw it on the wire.)

From the beginning I started seeing a problem processing OFFER replies from the DHCP server, with the log saying I got "0 bytes from 64.0.0.1:0". I checked in udp_reader, in hl/src/udp.rs:770, and saw that the RSR was correct (302 bytes), but the received header data indeed was [64, 0, 0, 1, 0, 0, 0, 0]. Added some logging at that point, which seemed to make it appear more rarely, leading me to suspect a race condition. However, the garbage data was always "64.0.0.1:0".

However, now that I tried to reproduce it today, everything works flawlessly. I haven't even touched the hardware since I'm not physically there :man_shrugging:

Anyway, any advice that might pop into your head is appreciated. And of course, MANY thanks for this excellent library. It's always a pleasure to check for Rust support of a new chip, since if a library exists, it's usually of high quality.

Logging

Log of a working session:

DEBUG RX 302 B from 192.168.0.1:67
└─ w5500_dhcp::{impl#0}::process::recv @ /home/gbrandl/devel/ext/w5500-rs/dhcp/src/fmt.rs:125

What happened before:

DEBUG RX 0 B from 64.0.0.1:0
└─ w5500_dhcp::{impl#0}::process::recv @ /home/gbrandl/devel/ext/w5500-rs/dhcp/src/fmt.rs:125
newAM commented 1 year ago

I've been thinking about this for a bit and can't think of any reason for this, but can't come up with any ideas. I haven't seen anything like this myself sadly.

Feel free to leave the issue open though, I don't mind having these sorts of things open because it helps with visibility if anyone else experiences something similar.

NicoBiernat commented 1 year ago

Hi, I'm having the same problem with the RP2040 (Wiznet W5500-EVB-Pico). Most of the UDP packets are received incorrectly (from 64.0.0.1:0 with length 0) but some are received correctly (about 1 in 10 to 1 in 50 - but very inconsistent).

I figured out that setting the compiler optimization level to 0 or 1 ("opt-level = 1" in Cargo.toml) fixes this issue and all packets arrive correctly (the problem only occurs with "opt-level = 2" and "opt-level = 3") . But deactivating most or all compiler optimizations for embedded applications is not ideal to say the least...

I'm still not sure where this bug originates from. Maybe it's some sort of SPI timing issue or race condition in either this crate or in the rp2040-hal that only occurs with optimizations enabled, or maybe (but unlikely) it could be a compiler bug.

To narrow this down I tried to set separate optimization levels for the dependencies using profile overrides. Reducing the optimizations on some or all dependencies didn't work but for some reason setting the profiles opt-level to 1 (or 0) and then overriding all dependencies to use opt-level = 3 works correctly. Unfortunately, this result confuses me even more.

Cargo.toml (excerpt):

[package]
edition = "2021"
name = "RP2040-W5500-Test"
version = "0.1.0"

[dependencies]
cortex-m = "0.7"
cortex-m-rt = "0.7"
embedded-hal = "0.2.5"

defmt = "=0.3.2"
defmt-rtt = "0.4"
panic-probe = { version = "0.3", features = ["print-defmt"] }

rp-pico = "0.7"
fugit = "0.3.6"
w5500-hl = { version = "0.9.0", features = ["embedded-hal"] }

[profile.dev]
codegen-units = 1
debug = 2
debug-assertions = true
incremental = false
opt-level = 1 # this has to be <= 1
overflow-checks = true

[profile.dev.package."*"]
opt-level = 3 # overriding the opt-level for all dependencies works fine

# other profiles omitted
My simple UDP echo code for reference (click to expand) ```rust use defmt::info; use embedded_hal::{digital::v2::OutputPin, spi}; use fugit::RateExtU32; use rp_pico::{ entry, hal::{clocks::init_clocks_and_plls, gpio::FunctionSpi, Clock, Sio, Spi, Watchdog}, pac::Peripherals, Pins, }; use w5500_hl::{ block, ll::{blocking::vdm::W5500, Registers, Sn}, net::{Eui48Addr, Ipv4Addr}, Udp, }; #[entry] fn test_w5500() -> ! { let mut pac = Peripherals::take().unwrap(); let mut watchdog = Watchdog::new(pac.WATCHDOG); let sio = Sio::new(pac.SIO); const XTAL_FREQ_HZ: u32 = 12_000_000; let clocks = init_clocks_and_plls( XTAL_FREQ_HZ, pac.XOSC, pac.CLOCKS, pac.PLL_SYS, pac.PLL_USB, &mut pac.RESETS, &mut watchdog, ) .ok() .unwrap(); let pins = Pins::new( pac.IO_BANK0, pac.PADS_BANK0, sio.gpio_bank0, &mut pac.RESETS, ); let mut w5500 = { let mut chip_select = pins.gpio17.into_push_pull_output(); chip_select.set_high().unwrap(); let _mosi = pins.gpio19.into_mode::(); let _sckl = pins.gpio18.into_mode::(); let _miso = pins.gpio16.into_mode::(); let spi_eth = Spi::<_, _, 8>::new(pac.SPI0).init( &mut pac.RESETS, clocks.peripheral_clock.freq(), 5_000_000.Hz(), &spi::MODE_0, ); W5500::new(spi_eth, chip_select) }; assert_eq!(w5500.version().unwrap(), 0x04); const UDP_PORT: u16 = 1337; w5500.set_gar(&Ipv4Addr::new(192, 168, 0, 1)).unwrap(); w5500 .set_shar(&Eui48Addr::new(0x5E, 0x4C, 0x69, 0x67, 0x00, 0x03)) .unwrap(); w5500.set_sipr(&Ipv4Addr::new(192, 168, 0, 3)).unwrap(); w5500.set_subr(&Ipv4Addr::new(255, 255, 255, 0)).unwrap(); w5500.udp_bind(Sn::Sn0, UDP_PORT).unwrap(); let mut request_buffer = [0u8; 2048]; loop { let (len, sender) = block!(w5500.udp_recv_from(Sn::Sn0, &mut request_buffer)).unwrap(); let request = &request_buffer[0..len.into()]; info!( "Received from {}:{}\nData: {}\n", sender.ip().octets, sender.port(), request, ); w5500.udp_send_to(Sn::Sn0, request, &sender).unwrap(); } } ```
newAM commented 1 year ago

That's good information!

Do you know if it is reproducible with opt-level = "s"? I will have to check, but I think that's what most of my projects have been built with.

NicoBiernat commented 1 year ago

The issue does not occur with opt-level = 0, 1, "s" and "z". Only opt-level = 2 and 3 cause this problem.

birkenfeld commented 1 year ago

Wow, nice! I didn't even think of trying this... of course problems caused by UB or misoptimization can easily disappear from seemingly unrelated changes, or the addition of logging, like they did for me. I'll make some tests of my own tomorrow, when I have access to the device.

Rustc version was 1.69 stable for me and the relevant Cargo.toml snippets were copypasted from the pico quickstart:

# cargo build/run --release
[profile.release]
codegen-units = 1
debug = 2
debug-assertions = false
incremental = false
lto = 'fat'
opt-level = 3
overflow-checks = false

[profile.release.build-override]
codegen-units = 8
debug = false
debug-assertions = false
opt-level = 0
overflow-checks = false
newAM commented 1 year ago

I really appreciate the reproduction! I don't have a lot of time these days, but I'll try to investigate this weekend!

newAM commented 11 months ago

It has been a while, but I finally got around to trying to reproduce this, but have not been able to yet :(

Here's my full code: https://github.com/newAM/w5500-issue-252

NicoBiernat commented 11 months ago

I reproduced the issue with your repo. I just had to change the SPI device and pins for my board. I also had to pin the defmt version to 0.3.2 due to some version mismatch with my picoprobe firmware (I don't think that it has anything to do with the issue though). The important part is to change the opt-level under [profile.dev] to 2 or 3 and then sending a few UDP packets. I used nc -u 192.168.0.3 1337 for that. Sometimes the packets arrive correctly but most of the time they are empty and from 64.0.0.1:0. For now, I am using opt-level = "s" or 1, which work reliably.

newAM commented 11 months ago

:facepalm: that was it! I can easily reproduce it now, thanks!

I captured the SPI bus with my LA, here's the working (opt-level=1) case (data from 10.0.0.3):

good

and here's the not working (opt-level=3) case (0B from 64.0.0.1):

bad

They're doing the same thing on the bus, but getting different data from the RX buffer :thinking:

newAM commented 11 months ago

I'm pretty sure this is some SPI bus timing being violated when the code is optimized better.

It is always 0x40, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00 when the bad UDP header arrives. 0x4000 aligns with the reset value of SN_FRAG, which is at address 0x002D, which would line up with the W5500 missing a chip-select between the reads of Sn_RX_RD and SN_RX_Buf.

image

0x30 is an invalid register, but reading it with read hardware will always return 0x01.

let mut buf: [u8; 8] = [0; 8];
w5500.read(0x2D, SN.block(), &mut buf).unwrap();
defmt::info!("buf: {:X}", buf);
// buf: [40, 0, 0, 1, 0, 0, 0, 0]
newAM commented 11 months ago

A bit of digging I found an example waveform of this on my LA. TCS here is 20ns, lower than the 30ns minimum. Also it gets asserted before SCLK is done its full cycle.

image

newAM commented 11 months ago

This seems to make everything work at all opt-levels.

diff --git a/ll/src/eh0/vdm.rs b/ll/src/eh0/vdm.rs
index d43cf79..8c3287f 100644
--- a/ll/src/eh0/vdm.rs
+++ b/ll/src/eh0/vdm.rs
@@ -96,6 +96,7 @@ where
         self.cs.set_low().map_err(Error::Pin)?;
         let result = f(&mut self.spi);
         self.cs.set_high().map_err(Error::Pin)?;
+        self.cs.set_high().map_err(Error::Pin)?;
         result
     }
 }

Should I release that as a patch? It feels like a hack, but it does resolve the issue...

Let me know what you guys think.

birkenfeld commented 11 months ago

Thanks for the digging! I'm not deep enough into the details to say whether the fix should be applied at this level. But at least it seems like it couldn't have nasty side effects...

NicoBiernat commented 11 months ago

Wow, that issue would have taken me much longer to find. Thanks for looking into it! I thought about where this should be fixed. There is this one problem that the SPI write/transfer functions return before the last clock cycle is completed. Then the CS line is pulled high before the transmission is complete. Maybe this is HAL or even hardware dependent, but I think there might be a solution planned for embedded-hal 1.0 with the "flush" function. But even with this fix, the CS could still be high for too short (<30ns) when the program runs too fast, just a little bit later. Either because of better compiler optimizations or because the MCU is just faster (I'm thinking of Cortex-M7 or similar that run at more than twice the frequency of the RP2040). So your quickfix might solve it for the RP2040 but not for faster devices (haven't tested it though). Maybe you need to ensure that the minimum high time of the CS is not violated, but this might be complicated since you need to introduce a delay which should depend on the configured SPI frequency. I currently don't know a better and more elegant solution, but I'll take your hack for the RP2040 at the moment.