Closed cajun-rat closed 2 years ago
Thanks for fixing the rotation bug! I think that was actually affecting multiple tests, but the rotation one is the most susceptible.
For the redirect test, it looks to me like we should expect the update to fail in that case. We're limiting to 10 redirects with CURLOPT_MAXREDIRS
, which is good. Instead of reverting the test, can we change it so that we expect it to fail?
What confuses me, though, is that it apparently succeeds sometimes. I mean, it succeeds in CI all the time. I fear the problem goes deeper, but I have no insight on it.
After some more consideration, this should be two tests:
I think we have to split test_treehub_update_after_image_download_failure
into two to do that (one that expects success and one that expects failure) if I'm understanding the code correctly. I am a bit confused if multiple handlers really means that the test is running multiple times; it's kind of hard to sort that out.
I think we have to split
test_treehub_update_after_image_download_failure
into two to do that (one that expects success and one that expects failure) if I'm understanding the code correctly. I am a bit confused if multiple handlers really means that the test is running multiple times; it's kind of hard to sort that out.
test_treehub_update_after_image_download_failure
runs a few times, once for each handler.
iirc, the goal of the test to verify whether an update eventually succeeds (retries) after an initial download failure cause by different reasons (hence different handlers). In the case of RedirectHandler, a download is supposed to fail because it exceeds maximum allowed redirects, but the following download try (no any failure injection) should be successful.
I don't think that there is a need in another test for the Expect success with <10 redirects, let's say 8.
since there is already such test - test_customrepo_update_redirect
.
test_treehub_update_after_image_download_failure
runs a few times, once for each handler. iirc, the goal of the test to verify whether an update eventually succeeds (retries) after an initial download failure cause by different reasons (hence different handlers). In the case of RedirectHandler, a download is supposed to fail because it exceeds maximum allowed redirects, but the following download try (no any failure injection) should be successful.
Okay, great, thanks for jumping in! Do you have a suspicion why the test is flaky, though?
I don't think that there is a need in another test for the
Expect success with <10 redirects, let's say 8.
since there is already such test -test_customrepo_update_redirect
.
Good call, thanks!
Do you have a suspicion why the test is flaky, though?
No, I'll check it out in the evening.
I don't know why test_treehub_failure
is flaky, I couldn't reproduce it locally.
But, after looking closely into it, I believe that RedirectHandler
is supposed to be commented because it actually doesn't test the scenario it is supposed to test, which is "update must succeed on next update cycle after unsuccessful ostree download during the current update cycle". So, what it really does is just making sure that libostree->libcurl supports redirects (redirects happens internally within libcurl and the first update cycle is successful.
On the other hand, this test might be useful in the future when ostree/libostree will allow configuring a maximum number of redirects at the libcurl it uses (as we do in the aktualizr's httpclient), in this case the test/handler can be enabled with number_of_redirects=<MAX_OSTREE_CURL_REDIRECTS> + 1
and it will test the expected scenario.
BTW, it is kind of security issue, once an ostree server is compromised a hacker can setup a server that returns 301 eternally so aklite gets stuck at a download phase and won't be able to pick up a new update.
Just in case, @with_aktualizr
fixture has output_logs
param, so if to enable it @with_aktualizr(start=False, run_mode='once', output_logs=True)
then a test will output aktualizr's logs.
Thanks for the clarification @mike-sul! Based on what you've said, I have an idea why it's failing sometimes but not always. The test first checks if the image is pending, then it retries and expects it to succeed. The test expects the 1000 redirects to force a failure, but in fact it does not, so if libaktualizr and libostree can sort them out before the 120 seconds of wait_for_completion()
are up, then the test will fail, because it's expecting the update to be pending, not complete. If the 1000 redirects take long enough to process that it exceeds the timeout, but it gets sorted out in time for the second check, it will succeed.
Clearly this test doesn't do what I'd originally expected. It should be commented out again. However, I'd like to leave a more helpful comment based on your insight about curl redirect limitation support in libostree. I also don't like restoring the ticket reference to a Jira instance none of us have access to.
Interestingly that CURLOPT(CURLOPT_MAXREDIRS, CURLOPTTYPE_LONG, 68),
is set in my local curl.h, so I wondering why it keeps making requests to the test ostree server on my setup. I tried 500000 redirects and the test fails because of timeout (120s) but libostree->libcurl never exited because it reached the redirect limit (68). It managed to make 257542 redirects :).
132: 127.0.0.1 - - [01/Dec/2021 13:47:46] "GET /objects/41/5ce9717fc7a5f4d743a4f911e11bd3ed83930e46756303fd13a3eb7ed35892.filez HTTP/1.0" 301 -
132: 127.0.0.1 - - [01/Dec/2021 13:47:46] "GET /objects/41/5ce9717fc7a5f4d743a4f911e11bd3ed83930e46756303fd13a3eb7ed35892.filez HTTP/1.0" 301 -
132: 127.0.0.1 - - [01/Dec/2021 13:47:46] "GET /objects/41/5ce9717fc7a5f4d743a4f911e11bd3ed83930e46756303fd13a3eb7ed35892.filez HTTP/1.0" 301 -
132: 127.0.0.1 - - [01/Dec/2021 13:47:46] "GET /objects/41/5ce9717fc7a5f4d743a4f911e11bd3ed83930e46756303fd13a3eb7ed35892.filez HTTP/1.0" 301 -
132: multiprocessing.pool.RemoteTraceback:
132: """
132: Traceback (most recent call last):
132: File "/home/mike/work/foundries/projects/aktualizr/tests/test_treehub_failure.py", line 43, in test_treehub_update_after_image_download_failure
132: aktualizr.wait_for_completion()
132: File "/home/mike/work/foundries/projects/aktualizr/tests/test_fixtures.py", line 285, in wait_for_completion
132: self._process.wait(timeout)
132: File "/usr/lib/python3.6/subprocess.py", line 1469, in wait
132: raise TimeoutExpired(self.args, timeout)
132: subprocess.TimeoutExpired: Command '['src/aktualizr_primary/aktualizr', '-c', '/tmp/tmpcx6_jmxb/config.toml', '--run-mode', 'once']' timed out after 120 seconds
Interestingly that
CURLOPT(CURLOPT_MAXREDIRS, CURLOPTTYPE_LONG, 68),
is set in my local curl.h, so I wondering why it keeps making requests to the test ostree server on my setup.
Are you sure that's actually getting used? According to https://curl.se/libcurl/c/CURLOPT_MAXREDIRS.html, the default is unlimited.
Are you sure that's actually getting used? According to https://curl.se/libcurl/c/CURLOPT_MAXREDIRS.html, the default is unlimited.
You are right, I am mistaken, this is just the option declaration and 68 is just its order number :).
This might be the last 2. See https://github.com/uptane/aktualizr/issues/45