I'm working on the host-side peer of the transceivers task, which is in this repository. I'm plugging that library into the Barefoot Software Development Environment (BF SDE), the driver framework for the Sidecar Tofino ASIC. The SDE uses that crate to query and control the front IO board QSFP transceiver modules, such as asserting low-power mode etc.
The SDE uses a polling model -- every second, it queries all the modules we've told it about to update their presence, reset status, query their identity, etc. In the Sidecar rev B board, there are 64 such ports, 32 of which are queried through the Hubris transceivers task using the above library. So the network activity here is very bursty. Every second, there are 32 modules being queried, usually with several requests, all basically as fast as possible. (There are several requests because the SDE detects presence, and then tries to read the memory map. Further, we need to read the map ourselves to determine how to access it.)
Here's a snippet of the log of the binary actually using that crate, running the BF SDE code for managing the modules:
It's important to note that these request are all serialized in the library itself. There's only one controller object in the process, and it only allows a single outstanding request at a time. Once the response for that has been received, it starts processing the next one.
At some point I started looking at the ringbuf for the transceivers task, and saw a lot of QueueFull errors:
It looks in this case as though the outgoing socket queue is entirely full. I'm not sure how this is possible, for two reasons. First, as I mentioned, the host only sends one request at a time and there's only one controller running on the system. Second, the transceivers task should be just dropping packets it can't send. (That's a normal part of the protocol, the host is responsible for retrying requests.)
At this point, the transceivers task is actually not responding to host requests at all, it really does appear that the queue is full and all packets are dropped. We can see that because the NDX and GEN columns are changing, if we look at the ringbuf output again:
Also note that the system is still running NDP correctly, and responding to pings. This is a snoop capture of the interface that allows communication between the host and SP, while pinging from another shell.
The link-local ending in 6a7e is the host, and 2734 is the SP. So the host sends a ping, the SP sends an NS to resolve the host's IP, gets a response, and sends the echo reply back to it. So the net task itself doesn't appear borked. Indeed it's ringbuf appears "minimal" (quoting @mkeeter!):
I should note that I first noticed the QueueFull errors earlier this evening, but had to run before I could write this up. At that time, I saw similar errors, but the task appeared to still be responding. That's actually where the noisy log output at the beginning of this issue came from. I have another dump from that state here.
I'm working on the host-side peer of the
transceivers
task, which is in this repository. I'm plugging that library into the Barefoot Software Development Environment (BF SDE), the driver framework for the Sidecar Tofino ASIC. The SDE uses that crate to query and control the front IO board QSFP transceiver modules, such as asserting low-power mode etc.The SDE uses a polling model -- every second, it queries all the modules we've told it about to update their presence, reset status, query their identity, etc. In the Sidecar rev B board, there are 64 such ports, 32 of which are queried through the Hubris
transceivers
task using the above library. So the network activity here is very bursty. Every second, there are 32 modules being queried, usually with several requests, all basically as fast as possible. (There are several requests because the SDE detects presence, and then tries to read the memory map. Further, we need to read the map ourselves to determine how to access it.)Here's a snippet of the log of the binary actually using that crate, running the BF SDE code for managing the modules:
It's important to note that these request are all serialized in the library itself. There's only one controller object in the process, and it only allows a single outstanding request at a time. Once the response for that has been received, it starts processing the next one.
At some point I started looking at the ringbuf for the
transceivers
task, and saw a lot ofQueueFull
errors:It looks in this case as though the outgoing socket queue is entirely full. I'm not sure how this is possible, for two reasons. First, as I mentioned, the host only sends one request at a time and there's only one controller running on the system. Second, the
transceivers
task should be just dropping packets it can't send. (That's a normal part of the protocol, the host is responsible for retrying requests.)At this point, the
transceivers
task is actually not responding to host requests at all, it really does appear that the queue is full and all packets are dropped. We can see that because theNDX
andGEN
columns are changing, if we look at theringbuf
output again:I've attached a dump I took here.
No tasks appear faulted:
Also note that the system is still running NDP correctly, and responding to pings. This is a snoop capture of the interface that allows communication between the host and SP, while pinging from another shell.
The link-local ending in
6a7e
is the host, and2734
is the SP. So the host sends a ping, the SP sends an NS to resolve the host's IP, gets a response, and sends the echo reply back to it. So thenet
task itself doesn't appear borked. Indeed it's ringbuf appears "minimal" (quoting @mkeeter!):Note
I should note that I first noticed the
QueueFull
errors earlier this evening, but had to run before I could write this up. At that time, I saw similar errors, but the task appeared to still be responding. That's actually where the noisy log output at the beginning of this issue came from. I have another dump from that state here.CC @mkeeter