Closed nemethf closed 6 months ago
However, it never sends the PATH_CHALLENGE for new_path. I'm guessing this is because poll_transmit() always has something better to send. If I modify poll_transmit to prioritize both PATH_CHALLENGEs, then the connection migration works as expected.
Looking at populate_packet()
it seems to actually prioritize PATH_CHALLENGE
a decent amount? In particular, it should definitely be sent before DATAGRAM
or STREAM
frames, so it's surprising that this doesn't happen. Are you sure other packets are being sent on the new path? Can you dig into how/why we're not hitting the populate_packet()
if let Some(token) = self.path.challenge {
path?
Can you dig into how/why we're not hitting the
populate_packet()
if let Some(token) = self.path.challenge {
path?
This might not be the only reason, but this if
condition evaluates to true:
if self.in_flight.bytes + bytes_to_send >= self.path.congestion.window() {
It definitely seems strange. Cwnd is per path, which is okay. But the server should be able to send a path_challenge even if there are lots of unacknowledged packets on prev_path. And since the prev_path is unreachable, the situation won't change. (Well, maybe it will change when a timeout occurs for the in-flight packets, but by that time the unsent path_challenge will expire as well. Or so it seems.)
But that happens after the migration, right, at which point self.path
should contain the new path?
That's right: self.path
is the new path, but in_flight contains info from the previous path as well. In my case: in_flight:20328 bytes_to_send:1200 cwnd:12000 space_idx:2
.
Okay, so because we haven't seen ACKs from the previous path we don't want to send more data, but we need to challenge/confirm the new path in order to be able to continue receiving ACKs?
Yes, that's my conclusion. It sounds plausible to me.
(Also, those ACKs might never arrive since the client might never receive the corresponding packets.)
it never sends the PATH_CHALLENGE for new_path
How long did you wait? The path validation timeout is equal to 3 probe timeouts, so there should be plenty of time for the server to send a tail loss probe, receive an ACK, and free up congestion window space by judging packets as delivered or lost in response.
Even if that does occur, though, this behavior seems needlessly pessimal. Perhaps we should track "bytes in flight" per path rather than globally, ensuring that we can always send immediately on a new path. Will discuss in the implementers' slack to make sure I'm not missing something.
it never sends the PATH_CHALLENGE for new_path
How long did you wait? The path validation timeout is equal to 3 probe timeouts, so there should be plenty of time for the server to send a tail loss probe, receive an ACK, and free up congestion window space by judging packets as delivered or lost in response.
I don't know, but I did see the creation of path challenges three times in migrate(). I think the problem is that even if the server can send something on the old path, the client won't be able to receive it, and therefore won't send an ACK back.
Even if that does occur, though, this behavior seems needlessly pessimal.
Exactly. Although I haven't measured it, my patch in the original issue description was able to bring down the outage to a reasonably small duration. (When a manual address change initiates the migration and both paths are operational, the outage is even smaller.)
Perhaps we should track "bytes in flight" per path rather than globally, ensuring that we can always send immediately on a new path.
If "bytes in flight" are used in flow-control as well, then I think naively maintaining per path values has a danger of overloading the receiver.
I think the problem is that even if the server can send something on the old path, the client won't be able to receive it, and therefore won't send an ACK back.
The server should send a tail loss probe (and receive the resulting ACK) on the new path. Because the PTO is shorter than the path validation time-out, so the new path should still be active when the PTO expires.
If "bytes in flight" are used in flow-control as well
They are not. Congestion control and flow control are independent.
The server should send a tail loss probe (and receive the resulting ACK) on the new path. Because the PTO is shorter than the path validation time-out, so the new path should still be active when the PTO expires.
I've attached a .pcap file. Maybe it helps to reveal what's going on. It was captured on the server size. Packet No. 787 is the client's first packet arriving to the server after the cut of primary path.
My client host has two network interfaces. I bind the client to 0.0.0.0:0 and bring down the primary interface with
ip link set dev eth0 down
during a download. This does not result in a connection migration.Firstly, because the client has nothing to send, so the server won't be notified about the new path. So I change the client-config to make it send pings periodically:
Now when the server notices the new path, it creates two path challenges in Connection::migrate(), one for the prev_path, one for the new_path. However, it never sends the PATH_CHALLENGE for new_path. I'm guessing this is because
poll_transmit()
always has something better to send. If I modify poll_transmit to prioritize both PATH_CHALLENGEs, then the connection migration works as expected.I copy my patch below. But most probably this is not the right fix, because probably it is not a good idea to send a PATH_CHALLENGE frame alone in a packet and before other more important frames.