Intermediate beam packetizer is too slow

jaycedowell commented 1 year ago

There seems to be a problem getting the intermediate beamformer data out of ndp_drx.py via the RetransmitOp and into the T-engines. This could be causing some back pressure that is interfering with the packet capture.

Is this something where verbs transmit could help?

jaycedowell commented 1 year ago

Verbs transmit helps (a lot) but it also introduces a couple of problems. There is a startup lag where the RX packet loss is high for several seconds and there is too much scatter in the packet arrival times for the T-engine to have good packet capture.

For the first one this may be a case for splitting the sending from the receiving of the verbs implementation. For the second I don't know what helps. I've tried larger buffer sizes on the receive end, splitting the packet sending into smaller batches, and sending fewer packets at a time. None of those were able to get me close to good packet capture.

jaycedowell commented 1 year ago

Splitting does help as does using a Verbs send implementation that waits for the previous batch of packets to be sent before queuing the next batch. I'm now getting decent packet capture (loss <~3%) on the T-engine when I'm only catching packets.

jaycedowell commented 1 year ago

It looks like things are only happy for a short time before the T-engine starts to drop a lot of packets again. Restarting the DRX pipelines helps but I'm not sure what that means right now.

jaycedowell commented 1 year ago

Playing around with this some more I was able to get ~10% loss on all four T-engine pipelines for several hours. This is looking like it is more of a tuning problem.

jaycedowell commented 1 year ago

Since #17 seems to be mostly resolved I think that puts us back at this being either: 1) a problem with the capture at the T-engine or 2) a problem with how Verbs is used such that we might not be sending the packets we think we are.

The switch doesn't say that anything is being dropped so I don't think this is a switch configuration issue. Looking at like_bmon.py on the T-engine the missing packets are indeed missing and not late. If they were late I would say something like "skew in the packet arrivals times across DRX pipelines". I'm leaning towards (2) right now.

jaycedowell commented 1 year ago

Here's a snapshot from the rate counter on the switch:

arista>show interfaces ethernet 27/1-34 counters rates
Port      Name        Intvl   In Mbps      %  In Kpps  Out Mbps      % Out Kpps
Et27/1                 5:00   38109.9  95.6%      765    2704.6   6.8%       54
Et28/1                 5:00   38109.9  95.6%      765    2704.6   6.8%       54
Et29                   5:00       0.0   0.0%        0   10009.4  25.1%      201
Et30                   5:00       0.0   0.0%        0   16141.8  40.5%      324
Et31                   5:00    9352.1  23.5%      188   21668.6  54.3%      435
Et32                   5:00    9352.4  23.5%      188   20544.0  51.5%      412
Et33                   5:00   10558.2  26.5%      140   21858.8  54.8%      439
Et34                   5:00    9451.7  23.7%      190   20624.5  51.7%      414

Et27/1 and 28/1 are the SNAP2s. Et29 and 30 are the T-engine, Et31 through 34 are ndp1/2. The output rates for the DRX pipelines on ndp1/2 are in the 180 kpps range. I would expect 24k FFT windows/s 2 packets set/window 4 beams * 2 pipelines per server / 2 interfaces or 192 kpps. 180 kpps is about 90%.

If I ask like_bmon.py what it thinks is happening it reports massive packet loss which doesn't match this.

Going back to the switch and asking a different question (are there dropped packets) leads to:

arista>show interfaces ethernet 27/1-34 counters discards
Port                InDiscards    OutDiscards 
--------------- ----------------- ----------- 
Et27/1                       0              0 
Et28/1                       0              0 
Et29                         0       25345730 
Et30                         0       29378995 
Et31                4089497277              0 
Et32                8246586069              0 
Et33                3923273893      596600922 
Et34                8172819386      589366579 
---------            ---------      --------- 
Totals             24432176625     1240692226

Lots of dropped packets.

jaycedowell commented 1 year ago

After working with verbs more over the last couple of days I think this is fixed. The problem was that I had an order of operations error where I was updating the buffers with new packets before I checked to see if the buffers were ready. Fixing that, along with a few cleanups in packet_writer.hpp, gets all of the packets out and out in time.

The only other thing I ran into was a bad ARP entry for one of the headnode interfaces on ndp1. I think that this is just something that can happen because of our strange networking setup. Manually setting values in the ARP table helps.

lwa-project / ng_digital_processor

Intermediate beam packetizer is too slow #6