mozilla / neqo

Neqo, the Mozilla Firefox implementation of QUIC in Rust
https://firefox-source-docs.mozilla.org/networking/http/http3.html
Apache License 2.0
1.84k stars 123 forks source link

test: multiple test timeout on Mac #1342

Closed dlrobertson closed 6 months ago

dlrobertson commented 2 years ago

The following tests hang on my Macbook Pro (12.3.1):

martinthomson commented 2 years ago

Hmm, both of these try to fill the congestion window, which does include looping (something most other tests don't have). Are you able to provide a log? Maybe with something like:

RUST_LOG=neqo cargo test -p neqo-transport --lib -- migrate_ack_delay >mad.log 2>&1

I'm curious about where these get stuck.

dlrobertson commented 2 years ago

Ah! Good point! Totally should have checked that and added it. See the log here

martinthomson commented 2 years ago

Thanks for that. It reveals something really interesting. We're stuck in this code:

https://github.com/mozilla/neqo/blob/0961ba9dc1e36778effd8e7607665dbfa700e48c/neqo-transport/src/connection/tests/mod.rs#L352-L367

This is looping infinitely because we are getting Output::Callback(t) where t < DEFAULT_RTT. We want to send more packets, but we are waiting on the packet pacer. The current time is one time unit before the time that the pacer will release more bytes we can send.

We arrived in this state after we returned an Instant from the pacer that was the result of calculating the next time. That was likely rounded down, which resulted in us falling short by a partial time unit. Instead of waiting 6.50208ms, we only waited 6.502041ms, coming up short. From this state the pacer relentlessly wants us to wait 41ns more (not sure about the rounding here, the granularity of the times here is weird) but doing now += t doesn't do anything, probably because that small a time increment rounds to zero somehow.

Put differently, the time is Instant(136457048959). The pacer will release bytes at Instant(136457048960). The connection returns a Duration(41ns), but when you add Instant(136457048959) + Duration(41ns) for some reason you get Instant(136457048959).

The only way this could happen is if there a two values of Instant, where y > x, the Duration d = y - x, d > 0 and x + d < y. That's perverse, but I guess maybe that's possible if you consider that maybe Instant and Duration use incompatible types in the underlying implementation.

I guess we'll have to work around this, probably by adding some sort of quantum to the value of process_output() so that the Duration value it returns, when added to the value of now that was passed in, will produce an Instant that is greater than that input.

martinthomson commented 2 years ago

OK, I can confirm that on a mac, this perverse scenario can indeed occur. The implementation here uses a u64 for Instant with a resolution based on whatever mach_timebase_info returns, but Duration uses nanoseconds resolution. That easily produces this sort of problem.

It seems like the best way to avoid having Instant + Duration have no effect for a small value of the Duration is to include a fudge factor. We could independently implement some of the Instant logic and work out exactly what the scaling factor is, but that seems worse than just adding a small increment.

For this machine, it looks like the increment is ~270ns. That's a small enough time increment that we shouldn't have any trouble that much. 1Ξs might do, though I have no way of knowing if other machines might use a larger quantum.

Just as a test, @dlrobertson, do you think you could try this patch?

diff --git a/neqo-transport/src/connection/mod.rs b/neqo-transport/src/connection/mod.rs
index b9e2339b..388785ba 100644
--- a/neqo-transport/src/connection/mod.rs
+++ b/neqo-transport/src/connection/mod.rs
@@ -1001,21 +1001,21 @@ impl Connection {
             }
         }

         match self.output(now) {
             SendOption::Yes(dgram) => Output::Datagram(dgram),
             SendOption::No(paced) => match self.state {
                 State::Init | State::Closed(_) => Output::None,
                 State::Closing { timeout, .. } | State::Draining { timeout, .. } => {
                     Output::Callback(timeout.duration_since(now))
                 }
-                _ => Output::Callback(self.next_delay(now, paced)),
+                _ => Output::Callback(self.next_delay(now, paced) + Duration::from_nanos(1)),
             },
         }
     }

     /// Process input and generate output.
     #[must_use = "Output of the process function must be handled"]
     pub fn process(&mut self, dgram: Option<Datagram>, now: Instant) -> Output {
         if let Some(d) = dgram {
             self.input(d, now, now);
             self.process_saved(now);

I'm wondering if that is enough to address the rounding error. Obviously, we'll want to do something a little more comprehensive and considered (there are other Durations we generate), but that might help isolate this.

dlrobertson commented 2 years ago

\o/ It works! Tests fail elsewhere, but the tests run!

0s113ms TRACE ack_time for ap = Some(Instant { t: 45259695734 })
0s113ms TRACE [Client be2aa4cb553b0b1c69c3931206336c] Delayed ACK timer Instant { t: 45259695734 }
0s113ms TRACE [Client be2aa4cb553b0b1c69c3931206336c] Idle/keepalive timer Instant { t: 45979215734 }
0s113ms TRACE [LossRecovery] next_timeout loss=None pto=Some(Instant { t: 45267859324 })
0s113ms TRACE [Client be2aa4cb553b0b1c69c3931206336c] Loss recovery timer Instant { t: 45267859324 }
0s113ms DEBUG [Client be2aa4cb553b0b1c69c3931206336c] delay duration 9.850416ms
0s113ms TRACE [Client be2aa4cb553b0b1c69c3931206336c] process_output Confirmed Instant { t: 45259695734 }
0s113ms TRACE [LossRecovery] timeout Instant { t: 45259695734 }
0s113ms TRACE detect lost ap: now=Instant { t: 45259695734 } delay=112.5ms
0s113ms TRACE lost=3, time sent Instant { t: 45253635489 } is before lost_delay 112.5ms
0s113ms TRACE lost=5, time sent Instant { t: 45254835489 } is before lost_delay 112.5ms
0s113ms TRACE [Client be2aa4cb553b0b1c69c3931206336c] output Instant { t: 45259695734 }
0s113ms DEBUG [LossRecovery] get send profile Instant { t: 45259695734 }
0s113ms DEBUG [Client be2aa4cb553b0b1c69c3931206336c] output_path send_profile SendProfile { limit: 255, pto: None, probe: (), paced: false }
0s113ms DEBUG Building Short dcid 080000000197cc3e
0s113ms DEBUG capacity=255
0s113ms DEBUG limit=255 len=0
0s113ms DEBUG 255 > 0 && 13 < 255 - 0
0s113ms DEBUG Packet Number
0s113ms DEBUG [Client be2aa4cb553b0b1c69c3931206336c] pn=20 type=Short pri-path:080000000197cc3e 192.0.2.1:443->192.0.2.1:443
  TX -> Ack { largest_acknowledged: 3, ack_delay: 2500, first_ack_range: 3, ack_ranges: [] }
0s113ms TRACE Packet build pn=20 hdr=20080000000197cc3e14 body=020349c40003
0s113ms TRACE [epoch 3 Write] encrypt pn=20 hdr=20080000000197cc3e14 body=020349c40003
0s113ms TRACE [epoch 3 Write] encrypt ct=8041a76d9fdef90c604dace0173b2d5595517095e5f5
0s113ms TRACE [epoch 3 Write] HP sample=6d9fdef90c604dace0173b2d55955170 mask=e66a83e0366d1cbdc4b771cb785bf2fc
0s113ms TRACE Packet built 26080000000197cc3e7e8041a76d9fdef90c604dace0173b2d5595517095e5f5
0s113ms DEBUG HANDSHAKE
0s113ms DEBUG [LossRecovery] packet ap-20 sent
0s113ms TRACE [Pacer 0/1357] spend 32 over 6685, 100ms
0s113ms TRACE [Server be2aa4cb553b0b1c69c3931206336c] pri-path:0800000001a014fd 192.0.2.1:443->192.0.2.1:443 input 26080000000197cc3e7e8041a76d9fdef90c604dace0173b2d5595517095e5f5
0s113ms TRACE [Server be2aa4cb553b0b1c69c3931206336c] Received unverified packet Short: [9]: 26080000000197cc3e [23]: 7e8041a76d9fdef90c604dace0173b2d5595517095e5f5
0s113ms TRACE unmask hdr=26080000000197cc3e7e8041a7
0s113ms TRACE [epoch 3 Read] HP sample=6d9fdef90c604dace0173b2d55955170 mask=e66a83e0366d1cbdc4b771cb785bf2fc
0s113ms TRACE unmasked hdr=20080000000197cc3e14
0s113ms TRACE [epoch 3 Read] decoded header: [32, 8, 0, 0, 0, 1, 151, 204, 62, 20]
0s113ms TRACE [epoch 3 Read] decrypt pn=20 hdr=20080000000197cc3e14 body=8041a76d9fdef90c604dace0173b2d5595517095e5f5
0s113ms DEBUG [Server be2aa4cb553b0b1c69c3931206336c] pn=20 type=Short pri-path:0800000001a014fd 192.0.2.1:443->192.0.2.1:443
  -> RX Ack { largest_acknowledged: 3, ack_delay: 2500, first_ack_range: 3, ack_ranges: [] }
0s113ms INFO [Server be2aa4cb553b0b1c69c3931206336c] Rx ACK space=ap, ranges=[0..=3]
0s113ms DEBUG [LossRecovery] ACK for ap - largest_acked=3.
0s113ms TRACE remove_packet outstanding == 0 for space ap
0s113ms TRACE RTT latest=100ms -> estimate=100ms~50ms
0s113ms TRACE detect lost ap: now=Instant { t: 45260895734 } delay=112.5ms
0s113ms TRACE [NewReno CongCtrl 1357/13370 ssthresh 18446744073709551615] app limited=true, bytes_in_flight:1357, cwnd: 13370, state: SlowStart pacing_burst_size: 2674
thread 'connection::tests::ackrate::migrate_ack_delay' panicked at 'assertion failed: self.bytes_in_flight >= pkt.size', neqo-transport/src/cc/classic_cc.rs:165:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
0s113ms TRACE [AgentIo] take output
0s113ms TRACE [AgentIo] take output
test connection::tests::ackrate::migrate_ack_delay ... FAILED

failures:

failures:
    connection::tests::ackrate::migrate_ack_delay

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 431 filtered out; finished in 0.12s

Thanks for the detailed comments! Certainly very interesting. I think I'm at like 50% comprehension 😆 , but I think I've got the gist of it, so I'll re-read a few times.

martinthomson commented 2 years ago

Hmm, that new problem is a little harder for me to diagnose. I can't replicate it locally either, but it doesn't seem to be down to differences in Instant and Duration implementation. I probably need to just open up my old mac and spend some time on it.

As for fixing the time issue, that's quite a bit harder than I made out. Our tests all depend on an assumption that time math is precise (consistency of relationships like t1 + d == t2 and t2 - t1 == d are central). Adding a nanosecond broke a bunch of assertions on Linux, where the relationships between Instant and Duration are strong.

What we really want here is for Instant::duration_since() to round up - not down - when it cannot be precise. That's a change to the rust std library, which I don't rate as having a high chance of success. And an attempt would take more time than I have to invest right now.

I guess we can see if we can fudge things ourselves for testing purposes, perhaps by implementing a duration_since that has a conditional fudge (or something, I don't know, it's hard).

dlrobertson commented 2 years ago

Hmm, that new problem is a little harder for me to diagnose. I can't replicate it locally either, but it doesn't seem to be down to differences in Instant and Duration implementation. I probably need to just open up my old mac and spend some time on it.

Let me know if there are things I can try or if there is extra info that could help debug it.

As for fixing the time issue, that's quite a bit harder than I made out. Our tests all depend on an assumption that time math is precise (consistency of relationships like t1 + d == t2 and t2 - t1 == d are central).

:+1: to get to the test output listed above I had to comment out a few sections.

What we really want here is for Instant::duration_since() to round up - not down - when it cannot be precise. That's a change to the rust std library, which I don't rate as having a high chance of success.

Do they take bribes? 😆 Maybe send a few slices pizza their way 🍕 Instant/Duration aren't exceedingly large, so in theory we could roll our own. Something similar was done for smoltcp (because there was no guarantee std would be there for the crate to use).

I guess we can see if we can fudge things ourselves for testing purposes, perhaps by implementing a duration_since that has a conditional fudge (or something, I don't know, it's hard).

I'm new, so I don't know much, but if this is only really used for testing, this seems reasonable.

larseggert commented 8 months ago

@dlrobertson is this still an issue? (I have not been able to reproduce.)

dlrobertson commented 8 months ago

@dlrobertson is this still an issue? (I have not been able to reproduce.)

Haven't been able to get tests running on my mac for a bit... I'll try again soon

larseggert commented 6 months ago

@dlrobertson any update?

dlrobertson commented 6 months ago

Just re-ran the tests and they don't seem to fail for me any more \o/

larseggert commented 6 months ago

Thanks for checking!