Open cbiffle opened 2 years ago
Interesting!
Another issue is: even if we add a timeout, UdpSocket only tries to send the packet at the front of the queue. If that's unaddressable, it'll clog up the queue for the duration of the timeout even if other packets in the queue can be sent.
I wonder how Linux handles this...
I'm comfortable with the socket only trying to send the front of the queue, because it provides the property that packets sent on a socket emerge (at least, from the network stack) in the order they were sent. They may of course get reordered elsewhere, but that's a nice property to start with, and in systems like mine where reordering won't happen it's a real plus.
even if we add a timeout
I don't think adding a timeout per se is the right solution because, with rare exceptions, the choice of such a timeout is likely to be wrong. I can expand on this if you like. Providing either an enumeration of handling/evictions policies or the pluggable strategy I suggested seem like better options. (It's possible you were using "add a timeout" as shorthand for the entire space of options, in which case, ignore me -- I just want to make sure a timeout doesn't appear here as a reflex.)
I wonder how Linux handles this...
I don't know off the top of my head, and my initial searches are not bearing fruit, but I would not be surprised if the the answer were "a timeout." ;-)
it provides the property that packets sent on a socket emerge (at least, from the network stack) in the order they were sent
it shouldn't matter if packets for different dest IPs get reordered though, right? as long as all packets for the same IP are not reordered between themselves.
Providing either an enumeration of handling/evictions policies or the pluggable strategy I suggested seem like better options. (It's possible you were using "add a timeout" as shorthand for the entire space of options, in which case, ignore me -- I just want to make sure a timeout doesn't appear here as a reflex.)
Yeah I worded it poorly, I think the single FIFO queue is an issue for many such strategies. For example if we did "if queue full, drop oldest packet": say the queue size is 32, the evil client fills it with 32 poison packets, then a legit client wants a request served. First legit response will overwrite one poison, but there's still 31 others in the queue in front of it, so it won't get sent. The legit client would have to do 32 retries to unclog the queue.
I would not be surprised if the the answer were "a timeout." ;-)
:smiling_face_with_tear:
it shouldn't matter if packets for different dest IPs get reordered though, right? as long as all packets for the same IP are not reordered between themselves.
...yeah, I feel like that might be okay? Implementing that sounds a lot more expensive than always sending front of queue, but maybe there's a trick I haven't thought of.
The legit client would have to do 32 retries to unclog the queue.
You're right, that would still be unpleasant! Now that you mention it, it seems like it would be nice if unaddressable packets could be bumped to the back of the queue. That way we could have many concurrent NDP enquiries in flight, and a "cache eviction" mechanism could allow new traffic to escape if permitted.
(After writing that, it occurs to me that there would be subtleties in determining when to return "no activity" from poll
if unaddressable packets get bumped to the back of the queue, since rate-limiting the front packet and checking its addressability would no longer suffice; you'd kinda need to keep track of how many such packets you evaluated. Hmmmmm.)
Yeah, the SocketMeta NeighborState "silencing" assumes each socket only talks to one peer, which is not true for non-tcp sockets. I also ran into this for the WIP DnsSocket.
Is that assumption even valid for TCP? If some other node sends an unexpected packet to the port, doesn't the socket need to establish addressability to send RST? Or does that avoid the full NDP/addressability lookup because it's inside the stack and can simply reuse the original MAC?
RSTs are sent by Interface when no socket matches. If it's unadressable it gets dropped, but that shouldn't happen because the incoming packet would've filled the neighbor cache.
This is biting us quite regularly now -- I'm surprised more users don't appear to be hitting it. It's quite simple to produce if there's ever a case where a machine that was once addressable leaves the network, due to partition or disconnection.
(We are, for what it's worth, still on v0.8, but I don't see anything in the changelog that suggests this has been fixed in v0.9.)
I've tried out a fix in the form of evicting the oldest packet from the outgoing queue on a socket in the event that the queue is discovered full -- basically, switching from a "drop newest" policy to a "drop oldest" policy -- but the naive version of this is insufficient because the socket meta neighbor_state
remains permanently stuck in Waiting
for the original, unaddressable, address. Because of the way the code is factored, it is currently difficult for the layer that manages the neighbor_state
to discover that an eviction has occurred in the tx_buffer
in a different type.
Patching the Meta::egress_permitted
routine to set the state to Active
on timeout increases the chances of noticing that the unaddressable packet has been removed and future neighbor cache checks should use a different address, but if unaddressable packets are being generated regularly you can wind up with another one at the head of the queue.
For instance: given a task that sends UDP broadcasts every 500 ms, altering it to send an unaddressable UDP packet just before each broadcast reproduces the bug reliably. Altering smoltcp to attempt to make space in the UDP socket tx_buffer
before returning QueueFull
will occasionally leave the queue with a valid packet at the head, but only if the queue depth is odd since packets are being sent in pairs. This seems like it introduces a new error-prone case, and I don't love it.
So, it's starting to feel like the right thing to do is amend NeighborState
with a field for tracking how long we've been waiting on the particular neighbor. If this exceeds some limit, we must assume we have a poison packet at the head of the queue, and take recovery action, which would consist of (at minimum)
neighbor_state
back to Active
.(I'd love to increment a counter when this happens, but smoltcp's trace infrastructure is heavily text-message oriented rather than event-oriented, so that'd be hard to integrate.)
An extended solution might add an "unreachable" entry type to the neighbor cache, such that when we have to evict a packet due to neighbor resolution timeout, we could cache that fact and turn further transmissions to that address into prompt errors for a certain time period. This would reduce the cost of software that repeatedly attempts to send to an unaddressable peer by amortizing the wait-for-neighbor delay over the number of sends that are denied promptly. Our software doesn't behave this way, though, and I'm assuming other users' doesn't either since apparently nobody's sending to unaddressable peers but me.
The thing I'm proposing is basically a timeout, which is a thing I was trying to avoid. However, in this case, I think there's a good justification for treating a neighbor that is not promptly discoverable as undiscoverable -- because neighbor discovery is a local network operation, it's more predictable than a lot of network operations, and waiting e.g. 3x the current 3-second period should cover most cases.
I think there's two somewhat orthogonal issues here:
A. Packets are not dropped from the socket queue on neighbor discovery timeout. B. NeighborState assumes "1 socket = 1 peer".
Fixing just A doesn't fully solve the problem: if a socket is sendng packets to 2 peers and one of them is undiscoverable, packets to the discoverable peer get delayed for many seconds when the socket is silenced, which is still not great. At least the socket doesn't get stuck forever though.
Fixing B means no longer tracking which socket is waiting for which neighbor, since sockets can now wait for many.
Some interesting things:
Fixing just B doesn't fully solve the problem either. Stuck packets no longer delay packets for other peers, but they'll still permanently waste space in the socket queue, and can still clog it if there are too many.
I think we need to fix both. B is a complex refactor though, perhaps A is more doable short-term, and would mitigate the problem a lot.
Agreed. Packets getting stuck behind other packets is a result of smoltcp's admirably simple handling of queues, where only the head element is ever acted upon. Bumping the head element to the rear of the queue if it needs neighbor discovery is one option, I suppose, and that doesn't compromise the simplicity too much. In particular it's still a FIFO, just an occasionally-cycled FIFO. As you noted, if all packets in the queue are discovered to not be ready, the socket would silence itself
If the neighbor cache is extended with a PENDING entry type, adding an UNREACHABLE entry type would reduce the cost of repeated sends to an unaddressable peer; based on Linux's behavior I believe they do this with some timeout.
The potential mass-unsilencing behavior doesn't bother me, personally; we already wind up polling every socket for architectural reasons. (Architectural reasons that could likely be fixed, but currently they're lower priority than the thing where smoltcp sockets clog up and cannot be used again until a reset.)
Bumping packets to the back of the queue would mean packets destined for the same peer could get reordered, though. I was thinking something more along the lines of: iterate all packets in the queue front-to-back, try to send them all, but if one fails leave it in-place.
Maintaining the order of packets to a single destination would be a nice property. Random-access dequeueing with the current ring data structures is possible -- thankfully they are not fancy concurrent Lamport queues or something -- but would likely require the payloads to be copied to remain contiguous, lest it create fragmentation that blocks further packets.
Memcpy is faster than "stuck forever" so this sounds acceptable to me.
Also, I wanted to say that I greatly appreciate your thoughtful and detailed discussion of this issue. :-)
Update from our side:
We've implemented a rather hacked-up watchdog on top of smoltcp to catch and unclog this situation. The implementation is a little scattered around our network stack, but the part that takes action is here. Essentially what we do is:
// Reset the queue by closing + reopening it. This will lose
// packets in the RX queue as well; they're collateral damage
// because `smoltcp` doesn't expose a way to flush just the TX
// side.
let s = self.get_socket_mut(socket_index).unwrap_lite();
let e = s.endpoint();
s.close();
s.bind(e).unwrap_lite();
(unwrap_lite
is an idiom in our embedded stack -- it's unwrap
but without debug formatting. Read it as unwrap
.)
It's a little gross, yes, and results in loss of any other outgoing packets, plus any incoming packets on the same socket. However, in our application, we strongly emphasize application-layer end-to-end controls for things like redelivery/retransmission, and expect some amount of packet loss at L3 and below (we're all UDP, in fact). So for us, it is relatively easy to recover from losing a few packets in either direction -- certainly easier to recover from than a socket being stuck forever!
Anyway, I post this in case it's a useful workaround for anyone else struggling with this issue.
I can confirm that this issue is present in UDP sockets with ARP requests as well.
For us it's undesirable to miss/lose incoming packets (RX) even if we set the buffer to single packet.
For re-transmission we also plan to add this in our protocol on top of the embassy-net implementation but seems cumbersome to handle the smoltcp
re-transmission on a level of our protocol.
Would it be acceptable to expose the queue or a way to build a custom clean-up logic which can be either timeout or counter based or just drop the packet if destination is unaddressable? This will leave the implementation to user whether it's on application or protocol level.
I'm just getting into the smoltcp
stack code-base and any guidance is welcome.
I still want to bump this issue. It's a major issue if you expect some of your nodes to go offline (like IoT devices and such).
Would a setting for EvictionPolicy
be a good approach to solve this issue and leave it to the user to decide what to do?
E.g.
Steps:
Expected:
Socket would keep working -- perhaps the unaddressable packets would be dropped eventually.
Observed:
Socket is now permanently clogged, unaddressable packets are tried forever, no further data can be sent.
Musings:
So, I really I appreciate the fact that smoltcp does not discard packets after some arbitrary timeout, since timeouts are always wrong. I also appreciate that there is not some fixed "packet retry counter" that can be exhausted. I've used libraries that made both of these decisions and I had to fix it.
That being said, this is a case where the stack can be DDoS'd pretty easily, so I've been thinking about how to fix it without introducing arbitrary timeouts or counters. My initial thoughts were: