Fix flaky tests - Githubissues

cajun-rat commented 2 years ago

This might be the last 2. See https://github.com/uptane/aktualizr/issues/45

pattivacek commented 2 years ago

Thanks for fixing the rotation bug! I think that was actually affecting multiple tests, but the rotation one is the most susceptible.

For the redirect test, it looks to me like we should expect the update to fail in that case. We're limiting to 10 redirects with CURLOPT_MAXREDIRS, which is good. Instead of reverting the test, can we change it so that we expect it to fail?

What confuses me, though, is that it apparently succeeds sometimes. I mean, it succeeds in CI all the time. I fear the problem goes deeper, but I have no insight on it.

pattivacek commented 2 years ago

After some more consideration, this should be two tests:

Expect success with <10 redirects, let's say 8.
Expect failure with >10 redirects; the number doesn't really matter, so 1000 is fine.

I think we have to split test_treehub_update_after_image_download_failure into two to do that (one that expects success and one that expects failure) if I'm understanding the code correctly. I am a bit confused if multiple handlers really means that the test is running multiple times; it's kind of hard to sort that out.

mike-sul commented 2 years ago

I think we have to split test_treehub_update_after_image_download_failure into two to do that (one that expects success and one that expects failure) if I'm understanding the code correctly. I am a bit confused if multiple handlers really means that the test is running multiple times; it's kind of hard to sort that out.

test_treehub_update_after_image_download_failure runs a few times, once for each handler. iirc, the goal of the test to verify whether an update eventually succeeds (retries) after an initial download failure cause by different reasons (hence different handlers). In the case of RedirectHandler, a download is supposed to fail because it exceeds maximum allowed redirects, but the following download try (no any failure injection) should be successful.

I don't think that there is a need in another test for the Expect success with <10 redirects, let's say 8. since there is already such test - test_customrepo_update_redirect.

pattivacek commented 2 years ago

test_treehub_update_after_image_download_failure runs a few times, once for each handler. iirc, the goal of the test to verify whether an update eventually succeeds (retries) after an initial download failure cause by different reasons (hence different handlers). In the case of RedirectHandler, a download is supposed to fail because it exceeds maximum allowed redirects, but the following download try (no any failure injection) should be successful.

Okay, great, thanks for jumping in! Do you have a suspicion why the test is flaky, though?

I don't think that there is a need in another test for the Expect success with <10 redirects, let's say 8. since there is already such test - test_customrepo_update_redirect.

Good call, thanks!

mike-sul commented 2 years ago

Do you have a suspicion why the test is flaky, though?

No, I'll check it out in the evening.

mike-sul commented 2 years ago

I don't know why test_treehub_failure is flaky, I couldn't reproduce it locally.

But, after looking closely into it, I believe that RedirectHandler is supposed to be commented because it actually doesn't test the scenario it is supposed to test, which is "update must succeed on next update cycle after unsuccessful ostree download during the current update cycle". So, what it really does is just making sure that libostree->libcurl supports redirects (redirects happens internally within libcurl and the first update cycle is successful.

On the other hand, this test might be useful in the future when ostree/libostree will allow configuring a maximum number of redirects at the libcurl it uses (as we do in the aktualizr's httpclient), in this case the test/handler can be enabled with number_of_redirects=<MAX_OSTREE_CURL_REDIRECTS> + 1 and it will test the expected scenario.

BTW, it is kind of security issue, once an ostree server is compromised a hacker can setup a server that returns 301 eternally so aklite gets stuck at a download phase and won't be able to pick up a new update.

mike-sul commented 2 years ago

Just in case, @with_aktualizr fixture has output_logs param, so if to enable it @with_aktualizr(start=False, run_mode='once', output_logs=True) then a test will output aktualizr's logs.

pattivacek commented 2 years ago

Thanks for the clarification @mike-sul! Based on what you've said, I have an idea why it's failing sometimes but not always. The test first checks if the image is pending, then it retries and expects it to succeed. The test expects the 1000 redirects to force a failure, but in fact it does not, so if libaktualizr and libostree can sort them out before the 120 seconds of wait_for_completion() are up, then the test will fail, because it's expecting the update to be pending, not complete. If the 1000 redirects take long enough to process that it exceeds the timeout, but it gets sorted out in time for the second check, it will succeed.

Clearly this test doesn't do what I'd originally expected. It should be commented out again. However, I'd like to leave a more helpful comment based on your insight about curl redirect limitation support in libostree. I also don't like restoring the ticket reference to a Jira instance none of us have access to.

mike-sul commented 2 years ago

Interestingly that CURLOPT(CURLOPT_MAXREDIRS, CURLOPTTYPE_LONG, 68), is set in my local curl.h, so I wondering why it keeps making requests to the test ostree server on my setup. I tried 500000 redirects and the test fails because of timeout (120s) but libostree->libcurl never exited because it reached the redirect limit (68). It managed to make 257542 redirects :).

132: 127.0.0.1 - - [01/Dec/2021 13:47:46] "GET /objects/41/5ce9717fc7a5f4d743a4f911e11bd3ed83930e46756303fd13a3eb7ed35892.filez HTTP/1.0" 301 -
132: 127.0.0.1 - - [01/Dec/2021 13:47:46] "GET /objects/41/5ce9717fc7a5f4d743a4f911e11bd3ed83930e46756303fd13a3eb7ed35892.filez HTTP/1.0" 301 -
132: 127.0.0.1 - - [01/Dec/2021 13:47:46] "GET /objects/41/5ce9717fc7a5f4d743a4f911e11bd3ed83930e46756303fd13a3eb7ed35892.filez HTTP/1.0" 301 -
132: 127.0.0.1 - - [01/Dec/2021 13:47:46] "GET /objects/41/5ce9717fc7a5f4d743a4f911e11bd3ed83930e46756303fd13a3eb7ed35892.filez HTTP/1.0" 301 -
132: multiprocessing.pool.RemoteTraceback: 
132: """
132: Traceback (most recent call last):
132:   File "/home/mike/work/foundries/projects/aktualizr/tests/test_treehub_failure.py", line 43, in test_treehub_update_after_image_download_failure
132:     aktualizr.wait_for_completion()
132:   File "/home/mike/work/foundries/projects/aktualizr/tests/test_fixtures.py", line 285, in wait_for_completion
132:     self._process.wait(timeout)
132:   File "/usr/lib/python3.6/subprocess.py", line 1469, in wait
132:     raise TimeoutExpired(self.args, timeout)
132: subprocess.TimeoutExpired: Command '['src/aktualizr_primary/aktualizr', '-c', '/tmp/tmpcx6_jmxb/config.toml', '--run-mode', 'once']' timed out after 120 seconds

pattivacek commented 2 years ago

Interestingly that CURLOPT(CURLOPT_MAXREDIRS, CURLOPTTYPE_LONG, 68), is set in my local curl.h, so I wondering why it keeps making requests to the test ostree server on my setup.

Are you sure that's actually getting used? According to https://curl.se/libcurl/c/CURLOPT_MAXREDIRS.html, the default is unlimited.

mike-sul commented 2 years ago

Are you sure that's actually getting used? According to https://curl.se/libcurl/c/CURLOPT_MAXREDIRS.html, the default is unlimited.

You are right, I am mistaken, this is just the option declaration and 68 is just its order number :).

uptane / aktualizr

Fix flaky tests #46