lwa-project / ng_digital_processor

The Next Generation Digital Processor for LWA North Arm
Apache License 2.0
0 stars 0 forks source link

T-engine packet loss #30

Open jaycedowell opened 4 months ago

jaycedowell commented 4 months ago

Editor's Note: This was orginally reported in #18 but it's really its own issue.

Disabling flow control on the switch for ndp1 seems to have helped with #18 but now there is continuous low-level packet loss on the T-engine now.

History:

jaycedowell commented 4 months ago

I'm testing today and I see that everyone is at about 2% packet loss.

The interaction between the T-engines on the same NIC was something I looked at (which lead to https://github.com/lwa-project/ng_digital_processor/commit/d16ac9d5ec747b656e68e1062297c159952f0b87). It's interesting that there seems to be even more to it for the first NIC.

I should note that this was also done under new_drx_packetizer but that shouldn't matter for the receive end.

jaycedowell commented 4 months ago

I'm focusing today on that last test case from yesterday - only one beam running. The same as yesterday I found packet loss with only one beam running. I checked:

That all checked out so that makes me think that this is more of a problem with how we are sending the data. I guess we had hints of that with #18. So I thought more about the data flow. Even though we are running four T-engines (one T-engine per beam) we are actually sending 32 data streams from ndp1 - 4 beams x 4 sub-bands per beam[^1] x 2 DRX pipelines on ndp1. These data streams are also not arranged in a "friendly" way. Each DRX pipeline sends all of the data for a beam at once, so: beam 1 - subband 1, beam 1 - subband 2, beam 1 - subband 3, beam 1 - subband 4, beam 2 - subband 1, etc. The configuration of the system also puts beams 1 and 2 on the same NIC so you basically have 16 data streams that fire all of their data at the same switch port at the same time. Maybe this means that we need to better control how packets come out, aka, packet pacing.

As noted at the start of this issue I've tried packet pacing before using the Bifrost set_packet_rate() method - this really didn't do anything. Maybe the thing to do is look at the aggregate packet rate from all 16 streams that each DRX pipeline is sending instead. It also might be a good idea to shuffle around the T-engines so that we break the current pairing. Move things around so that NIC0 receives beams 1 and 3 and NIC1 receives beams 2 and 4[^2].

[^1]: Even though the DRX pipelines can bring in 1536 channels we can only send out ibeam1 packets with up to 384 channels because of packet size limits.

[^2]: This is kind of a throw back to DP at LWA1 where the adder chains for beams 1 and 3 are paired and those for beams 2 and 4 are paired.

jaycedowell commented 4 months ago

This is implemented in main now and we will see what it does. Early indications are that the packet loss is much better but it needs to run for more than 15 minutes. I have also turned on flow control for switch ports 29 and 30. That may or may not be required.

Update: Beam 1 and 3 look good, beams 2 and 4 look really bad

Update: I've bumped up the GPU clocks on ndp.

Update: Changing the GPU clocks didn't stop it from happening again.

jaycedowell commented 4 months ago

I left the system running with logging on beams 1 and 2 last night to look at the behavior with the new packet pacing implementation.

Beam 1: beam1

Beam 2: beam2

Beam 1 looks pretty good. There are some excursions to instantaneous packet loss of up to 3% but the overall value is ~0.7%. Beam 2 has this interesting episodic behavior where it drops a ton of packets for a while and then magically recovers. The period looks to be ~1,000 s.

I've now turned off flow control on the T-engine's switch ports this morning to see if that changes things at all.

jaycedowell commented 4 months ago

Yesterday I changed the axis order for the ibeam1 data in ndp_drx.py so that it iterates over beams then subbands. This further breaks up the traffic flow and gets beam 1 down to ~3.5% packet loss. One thing to note is that I haven't checked that the data are still correctly ordered after the swap.

Maybe it's time to revisit set_packet_rate().

Update: It's actually called set_rate_limit().

jaycedowell commented 4 months ago

I realized yesterday that my previous attempts to use set_rate_limit() never worked because set_rate_limit() never worked on UDPVerbsTransmit. After fixing that feature in Bifrost and spending an inordinate amount of time fighting with this "feature" of Bifrost I was able to get things happy again. Apart from some startup losses the T-engines have been running without significant packet loss for about a day now.

@dentalfloss1 and @league: I hit the exact situation described in that comment on LockFile in fileutils.cpp and it was super frustrating. @GregBTaylor suggested that we maybe have LockFile print out a "waiting for lock"-style message if we cannot acquire the lock within a reasonable time period (~5 to 10 s?). We could also think about implementing the solution proposed in the comment.

jaycedowell commented 4 months ago

If anyone is curious these are what the packet loss logs look like now.

Beam 1: beam1

Beam 2: beam2

I'm not sure what those spikes are but they seem to only be one or two sample periods long (~5 to 10 s).

jaycedowell commented 4 months ago

After running for several days it looks like more of the same. Here's Beam 1 as an example: Figure_1

Spikes get up to ~35% but the overall packet loss is something like ~0.3%. Interestingly if I take an FFT of the data I get a peak at ~4 mHz (250 s period). I'm not sure what that means. Maybe nothing.

jaycedowell commented 4 months ago

Things seem unstable after the INI today (see https://github.com/lwa-project/ng_digital_processor/issues/21 for details). The packet rates into the T-engines are also varying a lot between ~120k and ~220k pkt/s.T here's even packet loss on the DRX side of things.

jaycedowell commented 4 months ago

Rebooting ndp1 didn't help. Now I'm rebooting ndp, ndp1, and the snap2s.

Update: That seems to have helped things out. The packet rate into the T-engines is more steady, between ~185k and ~205k pkts/s.

Update: There seems to be some kind of oscillation now in how the DRX pipelines are running. They will be fine and then all of a sudden drop almost all packets, triggering a sequence reset in the pipelines.

jaycedowell commented 4 months ago

I played around with things again and it looks like the oscillation was some kind of interaction between the two packetizers (IBeam and COR) in the DRX pipelines. Disabling either stopped the cycle of massive packet loss. In the end I switched COR over to verbs and that seemed to be a workable solution.

Other things I changed:

jaycedowell commented 4 months ago

Now this is what I want to see.

Beam 1: beam1

Beam 2: beam2

dentalfloss1 commented 4 months ago

Excellent!

On Mar 17, 2024, at 2:19 PM, jaycedowell @.***> wrote:

 [EXTERNAL]

Now this is what I want to see.

Beam 1: beam1.png (view on web)https://github.com/lwa-project/ng_digital_processor/assets/6718179/1af00b01-fd41-4d78-8a64-0e21f4a6c5a7

Beam 2: beam2.png (view on web)https://github.com/lwa-project/ng_digital_processor/assets/6718179/d2ca3901-caed-4fa0-ae5d-c7c331fd8c16

— Reply to this email directly, view it on GitHubhttps://github.com/lwa-project/ng_digital_processor/issues/30#issuecomment-2002604539, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF35E67P6KLM54JP4AIJCCDYYX3DXAVCNFSM6AAAAABEB4BI7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGYYDINJTHE. You are receiving this because you were mentioned.Message ID: @.***>

jaycedowell commented 4 months ago

This was working until I moved on to #21 and testing for https://github.com/lwa-project/lwana-issues/issues/15. Now we're back to cycles of massive packet loss. Was this ever really fixed or is this a result of lots of SNAP2 re-initializations?

Update: Restarting the snap2s didn't change the situation today.

jaycedowell commented 4 months ago

The latest behavior is:

  1. Things are going fine for ndp_drx.py.
  2. Something happens
  3. RetransmitOp hits 100% CPU usage, the output data rate drops by at least a factor of two, packets start falling on the floor.
  4. The problem clears up after some amount of time.
  5. Go back to (1).

It's not clear what is causing (2). From watching top I can't really see anything odd happening. I've tried changing the sending order (again), reducing the number of UDPVerbsTransmit objects we use, and increasing the gulp size into RetransmitOp but none of those break the cycle.

So what is causing (2)? Something running on the system? Some interaction between the beams and the correlator output? Is RetransmitOp just too slow to cover temporary slow downs?

Update: I changed some things (gulp size, COR packet rate limiting, ring buffer factors). Didn't do much. I also rebooted ndp1. That was much more productive.

jaycedowell commented 4 months ago

I was working on the cyclic packet loss again today and I noticed that when things went south packets seemed to go everywhere. I first became suspicious of this when I noticed that the discard counters were incrementing up on the ports the data recorders are attached to even though I wasn't sending any DRX data to them. That lead me to tcpdump the traffic coming into DR1 and that is when I noticed the COR packets and F-engine traffic. When this happens I also see that the switch's MAC address table gets refreshed[^1]. It's like the switch forgets who is on what port and then sends traffic to all the wrong places. Configuration problem on the servers with the split networking? Config. problem on the switch? A hardware problem with the switch? I don't know.

To try to get around this I added static MAC entries to the switch config. That seems to help but we'll have to wait to see if this is a long term fix.

In the mean time I also re-implemented the "busy wait" style packet pacing and I think we are back to where we were on March 7-9.

Update: I've started logging on all four T-engines to see how things are running.

[^1]: Maybe this is where that ~250 s period comes from that I noted in the March 14 update.

jaycedowell commented 4 months ago

Eh, not great: beams

Looks like we are at March 14 levels again.

Update: Looks like flow control was turned off again. Let's try turning that back on for ndp0, orville, and the data recorders.

Update: Now I'm seeing discards on interfaces 29 (ndp0), 31 (ndp1), and 32 (ndp1). Maybe I need to flow control 31/32 as well?

jaycedowell commented 3 months ago

Here's the latest: beams

Better? Maybe.

jaycedowell commented 3 months ago

Looking better after about a day: beams_240402

jaycedowell commented 3 months ago

The new thing is for everything to work until an observation comes in. The observation seems to set up some kind of oscillation in the packet pacing which causes the data rate to rollercoaster.

Update: I've been watching it and it doesn't look like the switch is dropping. Maybe it's all about the packet pacing setup?

Update: Yeah, I'm thinking that this is another problem with the packet pacing. From the ndp-drx-0 log:

2024-04-12 16:23:42 [INFO    ] Changing packet pacing parameter from 7500 to 6750 (found 416722.7 pkts/s)
2024-04-12 16:24:03 [INFO    ] Changing packet pacing parameter from 6750 to 6000 (found 428818.7 pkts/s)
2024-04-12 16:24:25 [INFO    ] Changing packet pacing parameter from 6000 to 5250 (found 404263.9 pkts/s)
2024-04-12 16:24:47 [INFO    ] Changing packet pacing parameter from 5250 to 4500 (found 412552.8 pkts/s)
2024-04-12 16:25:08 [INFO    ] Changing packet pacing parameter from 4500 to 3750 (found 411186.2 pkts/s)
2024-04-12 16:25:30 [INFO    ] Changing packet pacing parameter from 3750 to 3000 (found 396029.5 pkts/s)
2024-04-12 16:25:51 [INFO    ] Changing packet pacing parameter from 3000 to 2250 (found 419248.8 pkts/s)
2024-04-12 16:26:12 [INFO    ] Changing packet pacing parameter from 2250 to 1500 (found 405714.9 pkts/s)
2024-04-12 16:26:34 [INFO    ] Changing packet pacing parameter from 1500 to 750 (found 395001.9 pkts/s)
2024-04-12 16:26:55 [INFO    ] Changing packet pacing parameter from 750 to 0 (found 408694.6 pkts/s)

Update: Restarting the DRX pipelines didn't help. I'll try the T-engines next.

Update: Restarting the T-engines did help. I'm not sure what to make of that right now since that would point away from the DRX pipelines.

jaycedowell commented 3 months ago

Based on my April 12 notes I'm trying new capture code on the T-engines that use the _mm_loadu_si128 and _mm_stream_si128 intrinsics to see if that improves performance.

jaycedowell commented 2 months ago

After looking at this some more I developed a theory that the problem stems from the 10G links to the data recorders. It seems like the sequence of events is:

  1. Things are fine.
  2. A beamformer observation comes in/starts up.
  3. There is congestion on the DR port since the server wants to send at 40G and the DR is on 10G.
  4. This creates a backup in the switch, consuming all of its packet buffer since it is shared across all switch ports.
  5. We all fall down.

We tested this last week when we swapped the 10G card on DR1/3 with a 40G card we had in the lab. Switch looked happy during observations. @ctaylor-physics did some testing and didn't see any packet loss running on beams 1 and 3 nor did he see time tag errors in the recorded data.

We've gone ahead and ordered a 40G card for DR2/4. I'm optimistic that this will allow us to finally close this issue.

jaycedowell commented 1 week ago

Things were better with the new 40G cards/links to the data recorders but the packet loss still seems to be there. Part of it looks to be caused by flow control on the snap ➡ ndp1 links and part of it seems to be related to T-engine load.

Updates:

jaycedowell commented 1 week ago

I've now made a few more tweaks to the BIOS setup on ndp:

And... I'm not sure that that did a lot. Most of the time the packet loss reported for the T-engines is 0% but it has its moments where it jumps up to ~30% on all beams. I'll try an INI to put everything back into a clean state. [^1]: Well, that guide says "Power" but I managed to pick "Performance" anyways.

jaycedowell commented 1 week ago

I've rolled back most of the BIOS changes I made expect for setting CF states to off and turning off TSME. I couldn't really tell any difference in the performance by playing around with any of these settings. The only thing that really changed is that the previous set caused problems like this one:

pcieport 0000:80:01.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
pcieport 0000:80:01.1: AER: aer_status: 0x00000040, aer_mask: 0x00000000
pcieport 0000:80:01.1: AER:    [ 6] BadTLP                

Working on this more I'm now seeing that the packet loss on the T-engine is kind of like flipping a coin: sometimes it starts up ok and sometimes it doesn't. You can even have it in a situation where it is running with no loss, restart the ndp-tengine-[0-4] processes, and then be back to them dropping packets. This makes me think that it isn't a network/DRX pipeline/performance issue but something about how the pipelines run on ndp. Something like a bad memory binding somewhere that only happens because we're trying to run four pipelines. So I tried wrapping the T-engines with numactl to force memory binding to whatever the relevant socket is and that seems robust (there will be a commit soon with the new service files).

I think this is ultimately a Bifrost/ibverb-support problem with how I've split up the packet capture into PacketCaptureMethod and PacketCaptureThread. The split makes it such that PacketCaptureMethod is created/initialized before PacketCaptureThread. That means that the call to hwloc_set_membind() can't ensure that the right NUMA node is used for PacketCaptureMethod. This might be fixed with https://github.com/ledatelescope/bifrost/commit/b0a73e8914f0f7800e1108411f0cc6582a4cd969 but that needs to be demonstrated.