Unaddressable outgoing packet permanently clogs socket TX queue

cbiffle commented 2 years ago

Steps:

Send a packet to a device running smoltcp with an IP source address X.
Device: respond to the packet (maybe you're doing UDP echo, as a simple example).
Sender: don't ever respond to the IP address X via NDP, because you are a big mean person. (I've only had this happen on IPv6 but I suspect it would apply on IPv4+ARP as well.)
Do this a few times to fill up the sender's TX queue on that socket.

Expected:

Socket would keep working -- perhaps the unaddressable packets would be dropped eventually.

Observed:

Socket is now permanently clogged, unaddressable packets are tried forever, no further data can be sent.

Musings:

So, I really I appreciate the fact that smoltcp does not discard packets after some arbitrary timeout, since timeouts are always wrong. I also appreciate that there is not some fixed "packet retry counter" that can be exhausted. I've used libraries that made both of these decisions and I had to fix it.

That being said, this is a case where the stack can be DDoS'd pretty easily, so I've been thinking about how to fix it without introducing arbitrary timeouts or counters. My initial thoughts were:

Allow a policy to be specified on each socket that controls behavior when the TX queue is filled -- essentially a cache eviction policy. For instance, if you're doing an "isochronous"-style time-sensitive data stream, you probably want new packets to simply overwrite the oldest enqueued packet. In other cases one might insist on retrying older packets and drop new ones (current behavior).
Invoke some sort of "strategy" (in the gang-of-four sense, could be a function pointer or implementation of some trait) in situations like unaddressable packets, giving it the opportunity to request that the packet be killed, instead of swallowing the event inside the stack. This would let the application apply an arbitrary timeout/counter if that's appropriate for their use case.

Dirbaio commented 2 years ago

Interesting!

Another issue is: even if we add a timeout, UdpSocket only tries to send the packet at the front of the queue. If that's unaddressable, it'll clog up the queue for the duration of the timeout even if other packets in the queue can be sent.

I wonder how Linux handles this...

cbiffle commented 2 years ago

I'm comfortable with the socket only trying to send the front of the queue, because it provides the property that packets sent on a socket emerge (at least, from the network stack) in the order they were sent. They may of course get reordered elsewhere, but that's a nice property to start with, and in systems like mine where reordering won't happen it's a real plus.

even if we add a timeout

I don't think adding a timeout per se is the right solution because, with rare exceptions, the choice of such a timeout is likely to be wrong. I can expand on this if you like. Providing either an enumeration of handling/evictions policies or the pluggable strategy I suggested seem like better options. (It's possible you were using "add a timeout" as shorthand for the entire space of options, in which case, ignore me -- I just want to make sure a timeout doesn't appear here as a reflex.)

I wonder how Linux handles this...

I don't know off the top of my head, and my initial searches are not bearing fruit, but I would not be surprised if the the answer were "a timeout." ;-)

Dirbaio commented 2 years ago

it provides the property that packets sent on a socket emerge (at least, from the network stack) in the order they were sent

it shouldn't matter if packets for different dest IPs get reordered though, right? as long as all packets for the same IP are not reordered between themselves.

Providing either an enumeration of handling/evictions policies or the pluggable strategy I suggested seem like better options. (It's possible you were using "add a timeout" as shorthand for the entire space of options, in which case, ignore me -- I just want to make sure a timeout doesn't appear here as a reflex.)

Yeah I worded it poorly, I think the single FIFO queue is an issue for many such strategies. For example if we did "if queue full, drop oldest packet": say the queue size is 32, the evil client fills it with 32 poison packets, then a legit client wants a request served. First legit response will overwrite one poison, but there's still 31 others in the queue in front of it, so it won't get sent. The legit client would have to do 32 retries to unclog the queue.

I would not be surprised if the the answer were "a timeout." ;-)

:smiling_face_with_tear:

cbiffle commented 2 years ago

it shouldn't matter if packets for different dest IPs get reordered though, right? as long as all packets for the same IP are not reordered between themselves.

...yeah, I feel like that might be okay? Implementing that sounds a lot more expensive than always sending front of queue, but maybe there's a trick I haven't thought of.

The legit client would have to do 32 retries to unclog the queue.

You're right, that would still be unpleasant! Now that you mention it, it seems like it would be nice if unaddressable packets could be bumped to the back of the queue. That way we could have many concurrent NDP enquiries in flight, and a "cache eviction" mechanism could allow new traffic to escape if permitted.

cbiffle commented 2 years ago

(After writing that, it occurs to me that there would be subtleties in determining when to return "no activity" from poll if unaddressable packets get bumped to the back of the queue, since rate-limiting the front packet and checking its addressability would no longer suffice; you'd kinda need to keep track of how many such packets you evaluated. Hmmmmm.)

Dirbaio commented 2 years ago

Yeah, the SocketMeta NeighborState "silencing" assumes each socket only talks to one peer, which is not true for non-tcp sockets. I also ran into this for the WIP DnsSocket.

cbiffle commented 2 years ago

Is that assumption even valid for TCP? If some other node sends an unexpected packet to the port, doesn't the socket need to establish addressability to send RST? Or does that avoid the full NDP/addressability lookup because it's inside the stack and can simply reuse the original MAC?

Dirbaio commented 2 years ago

RSTs are sent by Interface when no socket matches. If it's unadressable it gets dropped, but that shouldn't happen because the incoming packet would've filled the neighbor cache.

cbiffle commented 1 year ago

This is biting us quite regularly now -- I'm surprised more users don't appear to be hitting it. It's quite simple to produce if there's ever a case where a machine that was once addressable leaves the network, due to partition or disconnection.

(We are, for what it's worth, still on v0.8, but I don't see anything in the changelog that suggests this has been fixed in v0.9.)

I've tried out a fix in the form of evicting the oldest packet from the outgoing queue on a socket in the event that the queue is discovered full -- basically, switching from a "drop newest" policy to a "drop oldest" policy -- but the naive version of this is insufficient because the socket meta neighbor_state remains permanently stuck in Waiting for the original, unaddressable, address. Because of the way the code is factored, it is currently difficult for the layer that manages the neighbor_state to discover that an eviction has occurred in the tx_buffer in a different type.

Patching the Meta::egress_permitted routine to set the state to Active on timeout increases the chances of noticing that the unaddressable packet has been removed and future neighbor cache checks should use a different address, but if unaddressable packets are being generated regularly you can wind up with another one at the head of the queue.

For instance: given a task that sends UDP broadcasts every 500 ms, altering it to send an unaddressable UDP packet just before each broadcast reproduces the bug reliably. Altering smoltcp to attempt to make space in the UDP socket tx_buffer before returning QueueFull will occasionally leave the queue with a valid packet at the head, but only if the queue depth is odd since packets are being sent in pairs. This seems like it introduces a new error-prone case, and I don't love it.

So, it's starting to feel like the right thing to do is amend NeighborState with a field for tracking how long we've been waiting on the particular neighbor. If this exceeds some limit, we must assume we have a poison packet at the head of the queue, and take recovery action, which would consist of (at minimum)

dequeueing the head packet and discarding it
switching the neighbor_state back to Active.

(I'd love to increment a counter when this happens, but smoltcp's trace infrastructure is heavily text-message oriented rather than event-oriented, so that'd be hard to integrate.)

An extended solution might add an "unreachable" entry type to the neighbor cache, such that when we have to evict a packet due to neighbor resolution timeout, we could cache that fact and turn further transmissions to that address into prompt errors for a certain time period. This would reduce the cost of software that repeatedly attempts to send to an unaddressable peer by amortizing the wait-for-neighbor delay over the number of sends that are denied promptly. Our software doesn't behave this way, though, and I'm assuming other users' doesn't either since apparently nobody's sending to unaddressable peers but me.

The thing I'm proposing is basically a timeout, which is a thing I was trying to avoid. However, in this case, I think there's a good justification for treating a neighbor that is not promptly discoverable as undiscoverable -- because neighbor discovery is a local network operation, it's more predictable than a lot of network operations, and waiting e.g. 3x the current 3-second period should cover most cases.

Dirbaio commented 1 year ago

I think there's two somewhat orthogonal issues here:

A. Packets are not dropped from the socket queue on neighbor discovery timeout. B. NeighborState assumes "1 socket = 1 peer".

Fixing just A doesn't fully solve the problem: if a socket is sendng packets to 2 peers and one of them is undiscoverable, packets to the discoverable peer get delayed for many seconds when the socket is silenced, which is still not great. At least the socket doesn't get stuck forever though.

Fixing B means no longer tracking which socket is waiting for which neighbor, since sockets can now wait for many.

When a socket tries to send to a peer not in the cache, insert it to the cache in PENDING state.
When the discovery is done, the cache entry goes from PENDING to REACHABLE, and all silenced sockets get unsilenced (since we don't know which ones were waiting for this neighbor)
When a socket tries to send a packet, the interface can return "neighbor not discovered yet". In this case the socket is no longer immediately silenced, it can try to send more packets.
Only if all packets the socket wants to send are unsendable, then the socket silences itself.

Some interesting things:

Perhaps we can still track the waited-for neighbor if there's just one, so "mass-unsilencing" only applies to sockets with many.
This means the socket queue is no longer a FIFO, which is fun...

Fixing just B doesn't fully solve the problem either. Stuck packets no longer delay packets for other peers, but they'll still permanently waste space in the socket queue, and can still clog it if there are too many.

I think we need to fix both. B is a complex refactor though, perhaps A is more doable short-term, and would mitigate the problem a lot.

cbiffle commented 1 year ago

Agreed. Packets getting stuck behind other packets is a result of smoltcp's admirably simple handling of queues, where only the head element is ever acted upon. Bumping the head element to the rear of the queue if it needs neighbor discovery is one option, I suppose, and that doesn't compromise the simplicity too much. In particular it's still a FIFO, just an occasionally-cycled FIFO. As you noted, if all packets in the queue are discovered to not be ready, the socket would silence itself

If the neighbor cache is extended with a PENDING entry type, adding an UNREACHABLE entry type would reduce the cost of repeated sends to an unaddressable peer; based on Linux's behavior I believe they do this with some timeout.

The potential mass-unsilencing behavior doesn't bother me, personally; we already wind up polling every socket for architectural reasons. (Architectural reasons that could likely be fixed, but currently they're lower priority than the thing where smoltcp sockets clog up and cannot be used again until a reset.)

Dirbaio commented 1 year ago

Bumping packets to the back of the queue would mean packets destined for the same peer could get reordered, though. I was thinking something more along the lines of: iterate all packets in the queue front-to-back, try to send them all, but if one fails leave it in-place.

cbiffle commented 1 year ago

Maintaining the order of packets to a single destination would be a nice property. Random-access dequeueing with the current ring data structures is possible -- thankfully they are not fancy concurrent Lamport queues or something -- but would likely require the payloads to be copied to remain contiguous, lest it create fragmentation that blocks further packets.

Memcpy is faster than "stuck forever" so this sounds acceptable to me.

cbiffle commented 1 year ago

Also, I wanted to say that I greatly appreciate your thoughtful and detailed discussion of this issue. :-)

cbiffle commented 1 year ago

Update from our side:

We've implemented a rather hacked-up watchdog on top of smoltcp to catch and unclog this situation. The implementation is a little scattered around our network stack, but the part that takes action is here. Essentially what we do is:

Keep track of the fact that we have observed a "queue full" error when attempting to send an outgoing packet on a particular socket.
Record the time at which that error occurred.
If the situation does not resolve itself within [arbitrary time interval], do this:

                // Reset the queue by closing + reopening it.  This will lose
                // packets in the RX queue as well; they're collateral damage
                // because `smoltcp` doesn't expose a way to flush just the TX
                // side.
                let s = self.get_socket_mut(socket_index).unwrap_lite();
                let e = s.endpoint();
                s.close();
                s.bind(e).unwrap_lite();

(unwrap_lite is an idiom in our embedded stack -- it's unwrap but without debug formatting. Read it as unwrap.)

It's a little gross, yes, and results in loss of any other outgoing packets, plus any incoming packets on the same socket. However, in our application, we strongly emphasize application-layer end-to-end controls for things like redelivery/retransmission, and expect some amount of packet loss at L3 and below (we're all UDP, in fact). So for us, it is relatively easy to recover from losing a few packets in either direction -- certainly easier to recover from than a socket being stuck forever!

Anyway, I post this in case it's a useful workaround for anyone else struggling with this issue.

elpiel commented 1 year ago

I can confirm that this issue is present in UDP sockets with ARP requests as well.

For us it's undesirable to miss/lose incoming packets (RX) even if we set the buffer to single packet. For re-transmission we also plan to add this in our protocol on top of the embassy-net implementation but seems cumbersome to handle the smoltcp re-transmission on a level of our protocol.

Would it be acceptable to expose the queue or a way to build a custom clean-up logic which can be either timeout or counter based or just drop the packet if destination is unaddressable? This will leave the implementation to user whether it's on application or protocol level.

I'm just getting into the smoltcp stack code-base and any guidance is welcome.

elpiel commented 7 months ago

I still want to bump this issue. It's a major issue if you expect some of your nodes to go offline (like IoT devices and such).

Would a setting for EvictionPolicy be a good approach to solve this issue and leave it to the user to decide what to do? E.g.

Eviction policy for timeout
Eviction policy for directly dropping any ARP requests instead of leaving them in the buffer
Eviction Policy for doing nothing? i.e. the current behaviour

smoltcp-rs / smoltcp

Unaddressable outgoing packet permanently clogs socket TX queue #594