lwa-project / ng_digital_processor

The Next Generation Digital Processor for LWA North Arm
Apache License 2.0
0 stars 0 forks source link

High packet loss on DRX startup #17

Closed jaycedowell closed 11 months ago

jaycedowell commented 11 months ago

The DRX pipelines take a while to startup and experience massive packet loss (~100%) for the first several seconds. This is likely related to not having the bifrost.map disk cache but I'm not sure.

jaycedowell commented 11 months ago

I've hacked in the disk cache but it's not clear that it does anything helpful. Startup still has a high packet loss at the beginning. Could it be related to the Verbs transmit startup?

jaycedowell commented 11 months ago

After watching this a few times it looks like it takes ~10 s for everyone to get up to speed even after using the map cache. What's interesting is that when the capture is lossy the output bandwidth is ~50-ish% higher than in the steady state. Could this be pointing to a startup condition with the transmit side of things? Is this also related to #6?

jaycedowell commented 11 months ago

This are definitely better if RetransmitOp isn't started.

jaycedowell commented 11 months ago

After some more digging this looks to be a startup condition associated with the first udp_verbs_transmit call. Specifically, the creation of the ethernet, ipv4, and udp headers is slow (~0.2 s total). Multiply that by eight for the four beams and two channel ranges then you get over 1.5 s of delay. You can cover this with a larger tengine_ring but that gets kind of crazy (buffer factors of ~100).

jaycedowell commented 11 months ago

Oh, larger buffer sizes lead to watchdog: BUG: soft lockup errors like with #11.

jaycedowell commented 11 months ago

Fixing verbs by getting rid of system calls to ip helps and the ethernet/ipv4/udp headers now create in less than 1 ms. It's not clear that this solves all of the problem though.

jaycedowell commented 11 months ago

The "new" problem is that locking on the bifrost.map cache seems to hang sometimes. Not clear why but probably an interaction between the file locking and NFS. I could hack in a lock cache.

jaycedowell commented 11 months ago

The cache has been moved to /opt/.bifrost/map_cache which fixes the locking problem by avoiding NFS.

jaycedowell commented 11 months ago

This seems to be ok now.