private-octopus / picoquic

Minimal implementation of the QUIC protocol
MIT License
561 stars 165 forks source link

Lack of execution consistency #1359

Closed huitema closed 1 year ago

huitema commented 2 years ago

For applications based on picoquic, there are multiple cases of tests passing on WIndows and MacOs but failing with Ubuntu, or passing with GCC but failing with Clang, or vice versa. When executing a test simulation, it appears that different packets are lost, or that timing changes. That's rather annoying.

victorstewart commented 2 years ago

this might be due to compiler flags that relax some conditions in the name of performance.

even if so, might not matter to application correctness though, and worth it in the name of performance.

maybe there's some way to make the expected test results fuzzy, aka correct within a range?

victorstewart commented 2 years ago

i remember when we tried enabling -Ofast it changed/broke some of the test math. though still resulted in a perfectly operable transport, at least as far as the download test in quicperf.

huitema commented 1 year ago

It is annoying, because people test on one platform, and then the CI tests on Github run on a different set of platform. I also saw an issue recently in which tests were passing on x86 and failing on M12.

But on the other hand, fixing that probably requires replacing all floating point computations by fixed point using integers. That's a lot of work, and it is difficult to test...

huitema commented 1 year ago

Trying to do an inventory of places where we use "double".

1) void picoquic_update_pacing_rate(picoquic_cnx_t * cnx, picoquic_path_t* path_x, double pacing_rate, uint64_t quantum). The pacing rate is defined in bytes per second, could be passed as uint64_t. The code computes integer variables pacing_packet_time_nanosec, pacing_packet_time_microsec, pacing_bucket_max. Those computations could probably be done directly using uint64_t, gaining speed and repeatability.

2) similar code in void picoquic_update_pacing_data(picoquic_cnx_t* cnx, picoquic_path_t * path_x, int slow_start).

3) double picoquic_test_gauss_random(uint64_t* random_context), an utility that is only used in tests, to compute a random jitter in picoquictest_sim_link_jitter(). Transform to integers is plausible. It would definitely improve test consistency.

4) picoquictest_sim_link_t* picoquictest_sim_link_create(double data_rate_in_gps, uint64_t microsec_latency, uint64_t* loss_mask, uint64_t queue_delay_max, uint64_t current_time). The gbps notation is convenient, but changing that API to handle uint64_t bits_per_second would be simple.

5) The leaky bucket implementation in picoquictest_sim_link_submit()

6) The test int ack_disorder_test_one(char const * log_name, int64_t horizon_delay, double range_average_max)

7) The unit test for the random generator, int random_public_tester_test()

8) Lots of double based computations in the congestion control algorithms.

9) Lots of double transform for display in log files.

10) Time checks and variation checks in the stress tests.

Different importance, probably.

huitema commented 1 year ago

It seems that we are chasing the wrong bug. We have a number of "logging" tests that are part of CI tests and are running just fine on multiple platforms and multiple compilers. They would fail if the execution changed randomly between those platforms. The only tests that do regularly cause issues are the tests http_corrupt and http_corrupt_rdpn. Here is a graph showing a series of successive simulated execution times for the http_corrupt_rpdn test:

image

It is very hard to obtain consistent execution for this kind of tests. The random number generator is seeded with a constant value, but some of the packet content is random -- for example, the various Connection Identifiers, or the randomized values of the initial packet numbers. These random values probably cause small differences in execution, which then create a cascade of changes. The random simulated times shown in the graph are probably caused by the cumulative effects of random events, such as timers.

It is simpler to modify the test and just assume that the simulated times are meaningless if the exchanges are corrupted.