Open jaycedowell opened 4 months ago
I'm testing today and I see that everyone is at about 2% packet loss.
The interaction between the T-engines on the same NIC was something I looked at (which lead to https://github.com/lwa-project/ng_digital_processor/commit/d16ac9d5ec747b656e68e1062297c159952f0b87). It's interesting that there seems to be even more to it for the first NIC.
I should note that this was also done under new_drx_packetizer
but that shouldn't matter for the receive end.
I'm focusing today on that last test case from yesterday - only one beam running. The same as yesterday I found packet loss with only one beam running. I checked:
numastat -p ndp_tengine
lstopo
That all checked out so that makes me think that this is more of a problem with how we are sending the data. I guess we had hints of that with #18. So I thought more about the data flow. Even though we are running four T-engines (one T-engine per beam) we are actually sending 32 data streams from ndp1 - 4 beams x 4 sub-bands per beam[^1] x 2 DRX pipelines on ndp1. These data streams are also not arranged in a "friendly" way. Each DRX pipeline sends all of the data for a beam at once, so: beam 1 - subband 1, beam 1 - subband 2, beam 1 - subband 3, beam 1 - subband 4, beam 2 - subband 1, etc. The configuration of the system also puts beams 1 and 2 on the same NIC so you basically have 16 data streams that fire all of their data at the same switch port at the same time. Maybe this means that we need to better control how packets come out, aka, packet pacing.
As noted at the start of this issue I've tried packet pacing before using the Bifrost set_packet_rate()
method - this really didn't do anything. Maybe the thing to do is look at the aggregate packet rate from all 16 streams that each DRX pipeline is sending instead. It also might be a good idea to shuffle around the T-engines so that we break the current pairing. Move things around so that NIC0 receives beams 1 and 3 and NIC1 receives beams 2 and 4[^2].
[^1]: Even though the DRX pipelines can bring in 1536 channels we can only send out ibeam1
packets with up to 384 channels because of packet size limits.
[^2]: This is kind of a throw back to DP at LWA1 where the adder chains for beams 1 and 3 are paired and those for beams 2 and 4 are paired.
This is implemented in main now and we will see what it does. Early indications are that the packet loss is much better but it needs to run for more than 15 minutes. I have also turned on flow control for switch ports 29 and 30. That may or may not be required.
Update: Beam 1 and 3 look good, beams 2 and 4 look really bad
Update: I've bumped up the GPU clocks on ndp.
Update: Changing the GPU clocks didn't stop it from happening again.
I left the system running with logging on beams 1 and 2 last night to look at the behavior with the new packet pacing implementation.
Beam 1:
Beam 2:
Beam 1 looks pretty good. There are some excursions to instantaneous packet loss of up to 3% but the overall value is ~0.7%. Beam 2 has this interesting episodic behavior where it drops a ton of packets for a while and then magically recovers. The period looks to be ~1,000 s.
I've now turned off flow control on the T-engine's switch ports this morning to see if that changes things at all.
Yesterday I changed the axis order for the ibeam1
data in ndp_drx.py so that it iterates over beams then subbands. This further breaks up the traffic flow and gets beam 1 down to ~3.5% packet loss. One thing to note is that I haven't checked that the data are still correctly ordered after the swap.
Maybe it's time to revisit set_packet_rate()
.
Update: It's actually called set_rate_limit()
.
I realized yesterday that my previous attempts to use set_rate_limit()
never worked because set_rate_limit()
never worked on UDPVerbsTransmit
. After fixing that feature in Bifrost and spending an inordinate amount of time fighting with this "feature" of Bifrost I was able to get things happy again. Apart from some startup losses the T-engines have been running without significant packet loss for about a day now.
@dentalfloss1 and @league: I hit the exact situation described in that comment on LockFile
in fileutils.cpp
and it was super frustrating. @GregBTaylor suggested that we maybe have LockFile
print out a "waiting for lock"-style message if we cannot acquire the lock within a reasonable time period (~5 to 10 s?). We could also think about implementing the solution proposed in the comment.
If anyone is curious these are what the packet loss logs look like now.
Beam 1:
Beam 2:
I'm not sure what those spikes are but they seem to only be one or two sample periods long (~5 to 10 s).
After running for several days it looks like more of the same. Here's Beam 1 as an example:
Spikes get up to ~35% but the overall packet loss is something like ~0.3%. Interestingly if I take an FFT of the data I get a peak at ~4 mHz (250 s period). I'm not sure what that means. Maybe nothing.
Things seem unstable after the INI today (see https://github.com/lwa-project/ng_digital_processor/issues/21 for details). The packet rates into the T-engines are also varying a lot between ~120k and ~220k pkt/s.T here's even packet loss on the DRX side of things.
Rebooting ndp1 didn't help. Now I'm rebooting ndp, ndp1, and the snap2s.
Update: That seems to have helped things out. The packet rate into the T-engines is more steady, between ~185k and ~205k pkts/s.
Update: There seems to be some kind of oscillation now in how the DRX pipelines are running. They will be fine and then all of a sudden drop almost all packets, triggering a sequence reset in the pipelines.
I played around with things again and it looks like the oscillation was some kind of interaction between the two packetizers (IBeam and COR) in the DRX pipelines. Disabling either stopped the cycle of massive packet loss. In the end I switched COR over to verbs and that seemed to be a workable solution.
Other things I changed:
set_rate_limit()
solution. I have kept the packet rate reporting for now.set_rate_limit()
call to 420k pkts/s. That's about 110% of the expected rate.set_rate_limit()
call to the DRX output from the T-engine to see if that further makes things easier on the data recorders. I went with about 110% of the expected rate for filter code 7.Now this is what I want to see.
Beam 1:
Beam 2:
Excellent!
On Mar 17, 2024, at 2:19 PM, jaycedowell @.***> wrote:
[EXTERNAL]
Now this is what I want to see.
Beam 1: beam1.png (view on web)https://github.com/lwa-project/ng_digital_processor/assets/6718179/1af00b01-fd41-4d78-8a64-0e21f4a6c5a7
Beam 2: beam2.png (view on web)https://github.com/lwa-project/ng_digital_processor/assets/6718179/d2ca3901-caed-4fa0-ae5d-c7c331fd8c16
— Reply to this email directly, view it on GitHubhttps://github.com/lwa-project/ng_digital_processor/issues/30#issuecomment-2002604539, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF35E67P6KLM54JP4AIJCCDYYX3DXAVCNFSM6AAAAABEB4BI7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGYYDINJTHE. You are receiving this because you were mentioned.Message ID: @.***>
This was working until I moved on to #21 and testing for https://github.com/lwa-project/lwana-issues/issues/15. Now we're back to cycles of massive packet loss. Was this ever really fixed or is this a result of lots of SNAP2 re-initializations?
Update: Restarting the snap2s didn't change the situation today.
The latest behavior is:
ndp_drx.py
.RetransmitOp
hits 100% CPU usage, the output data rate drops by at least a factor of two, packets start falling on the floor.It's not clear what is causing (2). From watching top
I can't really see anything odd happening. I've tried changing the sending order (again), reducing the number of UDPVerbsTransmit
objects we use, and increasing the gulp size into RetransmitOp
but none of those break the cycle.
So what is causing (2)? Something running on the system? Some interaction between the beams and the correlator output? Is RetransmitOp
just too slow to cover temporary slow downs?
Update: I changed some things (gulp size, COR packet rate limiting, ring buffer factors). Didn't do much. I also rebooted ndp1. That was much more productive.
I was working on the cyclic packet loss again today and I noticed that when things went south packets seemed to go everywhere. I first became suspicious of this when I noticed that the discard counters were incrementing up on the ports the data recorders are attached to even though I wasn't sending any DRX data to them. That lead me to tcpdump
the traffic coming into DR1 and that is when I noticed the COR packets and F-engine traffic. When this happens I also see that the switch's MAC address table gets refreshed[^1]. It's like the switch forgets who is on what port and then sends traffic to all the wrong places. Configuration problem on the servers with the split networking? Config. problem on the switch? A hardware problem with the switch? I don't know.
To try to get around this I added static MAC entries to the switch config. That seems to help but we'll have to wait to see if this is a long term fix.
In the mean time I also re-implemented the "busy wait" style packet pacing and I think we are back to where we were on March 7-9.
Update: I've started logging on all four T-engines to see how things are running.
[^1]: Maybe this is where that ~250 s period comes from that I noted in the March 14 update.
Eh, not great:
Looks like we are at March 14 levels again.
Update: Looks like flow control was turned off again. Let's try turning that back on for ndp0, orville, and the data recorders.
Update: Now I'm seeing discards on interfaces 29 (ndp0), 31 (ndp1), and 32 (ndp1). Maybe I need to flow control 31/32 as well?
Here's the latest:
Better? Maybe.
Looking better after about a day:
The new thing is for everything to work until an observation comes in. The observation seems to set up some kind of oscillation in the packet pacing which causes the data rate to rollercoaster.
Update: I've been watching it and it doesn't look like the switch is dropping. Maybe it's all about the packet pacing setup?
Update: Yeah, I'm thinking that this is another problem with the packet pacing. From the ndp-drx-0
log:
2024-04-12 16:23:42 [INFO ] Changing packet pacing parameter from 7500 to 6750 (found 416722.7 pkts/s)
2024-04-12 16:24:03 [INFO ] Changing packet pacing parameter from 6750 to 6000 (found 428818.7 pkts/s)
2024-04-12 16:24:25 [INFO ] Changing packet pacing parameter from 6000 to 5250 (found 404263.9 pkts/s)
2024-04-12 16:24:47 [INFO ] Changing packet pacing parameter from 5250 to 4500 (found 412552.8 pkts/s)
2024-04-12 16:25:08 [INFO ] Changing packet pacing parameter from 4500 to 3750 (found 411186.2 pkts/s)
2024-04-12 16:25:30 [INFO ] Changing packet pacing parameter from 3750 to 3000 (found 396029.5 pkts/s)
2024-04-12 16:25:51 [INFO ] Changing packet pacing parameter from 3000 to 2250 (found 419248.8 pkts/s)
2024-04-12 16:26:12 [INFO ] Changing packet pacing parameter from 2250 to 1500 (found 405714.9 pkts/s)
2024-04-12 16:26:34 [INFO ] Changing packet pacing parameter from 1500 to 750 (found 395001.9 pkts/s)
2024-04-12 16:26:55 [INFO ] Changing packet pacing parameter from 750 to 0 (found 408694.6 pkts/s)
Update: Restarting the DRX pipelines didn't help. I'll try the T-engines next.
Update: Restarting the T-engines did help. I'm not sure what to make of that right now since that would point away from the DRX pipelines.
Based on my April 12 notes I'm trying new capture code on the T-engines that use the _mm_loadu_si128
and _mm_stream_si128
intrinsics to see if that improves performance.
After looking at this some more I developed a theory that the problem stems from the 10G links to the data recorders. It seems like the sequence of events is:
We tested this last week when we swapped the 10G card on DR1/3 with a 40G card we had in the lab. Switch looked happy during observations. @ctaylor-physics did some testing and didn't see any packet loss running on beams 1 and 3 nor did he see time tag errors in the recorded data.
We've gone ahead and ordered a 40G card for DR2/4. I'm optimistic that this will allow us to finally close this issue.
Things were better with the new 40G cards/links to the data recorders but the packet loss still seems to be there. Part of it looks to be caused by flow control on the snap ➡ ndp1 links and part of it seems to be related to T-engine load.
Updates:
I've now made a few more tweaks to the BIOS setup on ndp:
And... I'm not sure that that did a lot. Most of the time the packet loss reported for the T-engines is 0% but it has its moments where it jumps up to ~30% on all beams. I'll try an INI to put everything back into a clean state. [^1]: Well, that guide says "Power" but I managed to pick "Performance" anyways.
I've rolled back most of the BIOS changes I made expect for setting CF states to off and turning off TSME. I couldn't really tell any difference in the performance by playing around with any of these settings. The only thing that really changed is that the previous set caused problems like this one:
pcieport 0000:80:01.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
pcieport 0000:80:01.1: AER: aer_status: 0x00000040, aer_mask: 0x00000000
pcieport 0000:80:01.1: AER: [ 6] BadTLP
Working on this more I'm now seeing that the packet loss on the T-engine is kind of like flipping a coin: sometimes it starts up ok and sometimes it doesn't. You can even have it in a situation where it is running with no loss, restart the ndp-tengine-[0-4]
processes, and then be back to them dropping packets. This makes me think that it isn't a network/DRX pipeline/performance issue but something about how the pipelines run on ndp. Something like a bad memory binding somewhere that only happens because we're trying to run four pipelines. So I tried wrapping the T-engines with numactl
to force memory binding to whatever the relevant socket is and that seems robust (there will be a commit soon with the new service files).
I think this is ultimately a Bifrost/ibverb-support problem with how I've split up the packet capture into PacketCaptureMethod
and PacketCaptureThread
. The split makes it such that PacketCaptureMethod
is created/initialized before PacketCaptureThread
. That means that the call to hwloc_set_membind()
can't ensure that the right NUMA node is used for PacketCaptureMethod
. This might be fixed with https://github.com/ledatelescope/bifrost/commit/b0a73e8914f0f7800e1108411f0cc6582a4cd969 but that needs to be demonstrated.
Editor's Note: This was orginally reported in #18 but it's really its own issue.
Disabling flow control on the switch for ndp1 seems to have helped with #18 but now there is continuous low-level packet loss on the T-engine now.
History: