Closed rochacbruno closed 1 year ago
Related smash issue https://github.com/PulpQE/pulp-smash/issues/527
More history reference about this issue: https://github.com/PulpQE/pulp-smash/issues/976#issuecomment-384420122
Another concern (perhaps worth a separate issue) is that pulp-smash waits over an hour for the test to timeout:
21:36:14 tests/rpm/api_v2/test_updateinfo.py::OpenSuseErrataTestCase::test_01_create_sync_publish <- ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_2_tests/tests/rpm/api_v2/test_updateinfo.py FAILED [ 94%]
22:38:08 tests/rpm/api_v2/test_updateinfo.py::OpenSuseErrataTestCase::test_02_relogin_suggested <- ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_2_tests/tests/rpm/api_v2/test_updateinfo.py SKIPPED [ 94%]
Edit: What about something like this where poll_limit is set aggressively but tests can override the default:
```diff diff --git a/pulp_smash/api.py b/pulp_smash/api.py index 44fbb2e..1bd0236 100644 --- a/pulp_smash/api.py +++ b/pulp_smash/api.py @@ -482,7 +482,7 @@ class Client(): ) -def poll_spawned_tasks(cfg, call_report, pulp_host=None): +def poll_spawned_tasks(cfg, call_report, pulp_host=None, poll_limit=100): """Recursively wait for spawned tasks to complete. Yield response bodies. Recursively wait for each of the spawned tasks listed in the given `call @@ -493,6 +493,7 @@ def poll_spawned_tasks(cfg, call_report, pulp_host=None): :param call_report: A dict-like object with a `call report`_ structure. :param pulp_host: The host to poll. If ``None``, a host will automatically be selected by :class:`Client`. + :param poll_limit int: A limit to the number of times Pulp is polled (default is 100) :returns: A generator yielding task bodies. :raises: Same as :meth:`poll_task`. @@ -504,11 +505,11 @@ def poll_spawned_tasks(cfg, call_report, pulp_host=None): else: hrefs = [call_report['_href']] for href in hrefs: - for final_task_state in poll_task(cfg, href, pulp_host): + for final_task_state in poll_task(cfg, href, pulp_host, poll_limit): yield final_task_state -def poll_task(cfg, href, pulp_host=None): +def poll_task(cfg, href, pulp_host=None, poll_limit=100): """Wait for a task and its children to complete. Yield response bodies. Poll the task at ``href``, waiting for the task to complete. When a @@ -519,14 +520,14 @@ def poll_task(cfg, href, pulp_host=None): :param href: The path to a task you'd like to monitor recursively. :param pulp_host: The host to poll. If ``None``, a host will automatically be selected by :class:`Client`. + :param poll_limit int: A limit to the number of times Pulp is polled (default is 100) :returns: An generator yielding response bodies. :raises pulp_smash.exceptions.TaskTimedOutError: If a task takes too long to complete. """ - # 900 * 2s == 1800s == 30m + # for poll_limit of 100: 100 * 2s == 200s == 3.3m # NOTE: The timeout counter is synchronous. We query Pulp, then count down, # then query pulp, then count down, etc. This is… dumb. - poll_limit = 900 poll_counter = 0 json_client = Client(cfg, json_handler, pulp_host=pulp_host) while True: ```
I thought that overriding the timeout was already built in?
For now we are going to set a skip
on this test until we have a fixture for it. (and prioritize this issue)
@omaciel I am no expert on pulp-smash but it looks like it's hardcoded. Let me know if I should open an issue to make it configurable.
If I remember correctly, this test takes a long time to complete because of CPU usage requirements. Pulp (or some component thereof) takes a long, long time to parse the metadata in the target repository.
The test passes locally, but takes long time to finish, we should find a way to make it faster (local fixture or smaller repo)
At the time that this test was written, I decided against adding a fixture to Pulp Fixtures. The reason is that there are no tools in the Fedora or RHEL repos for generating SUSE-compliant repositories, and simply copying files directly into Pulp Fixtures has a few additional drawbacks. The simplest solution to this issue is to find a smaller SUSE repository from which to sync, where that repository must have certain types of content. The more elegant solution is to generate a SUSE repository using official tooling as part of Pulp Fixtures.
Another concern (perhaps worth a separate issue) is that pulp-smash waits over an hour for the test to timeout:
First, know that the API client waits up to half an hour for any single asynchronous task to time out. That test case waited for over an hour because there were two separate tests.
Second, we can try shortening this timeout, but I have two concerns about such a change.
First, shortening the timeout might kill correctly-functioning tests. Pulp Smash initially had a timeout much shorter than 30 minutes, but correctly-functioning tests were being prematurely killed. While a half hour for a single test seems ridiculous, remember that the VMs we use for testing are puny compared to a typical developers' host. This is true whether the VMs come from OpenStack or from Travis.
Second, shortening the timeout will make it hard to write punishing tests. What if we want to write an internal test that syncs an entire RHEL repo? How long will that test take to complete?
Let's step back and think about what's going on here: Pulp Smash is waiting around for asynchronous tasks to complete because Pulp says that the asynchronous task is running. In other words, Pulp Smash believes Pulp, and that's a problem.
If we're going to make Pulp Smash better about dealing with hung tasks, I would suggest that we do so in as targeted a fashion as possible. @daviddavis suggestion of making the timeout short and adding a new parameter for customizing the timeout is a reasonable solution. But let's brainstorm for other solutions before adopting it and moving on. For example, what if Pulp Smash had this logic?
# pseudocode
elapsed_time_limit = 120 # seconds
while task['state'] not in TASK_END_STATES: # e.g. finished or failed
if elapsed_time() > elapsed_time_limit and task['state'] == 'WAITING':
raise TaskWaitingException(formatter(task))
sleep(1)
task = client.get(task)
The pseudocode above is predicated on the idea that when Pulp incorrectly states that a task is still running, it does so by leaving tasks in the WAITING state. If the WAITING state is an indicator that something is wrong, then let's take advantage of that info.
As an aside, know that the timeout logic is awful. It's a hack. From the source code:
The timeout counter is synchronous. We query Pulp, then count down, then query pulp, then count down, etc. This is… dumb.
This implementation issue means that timeouts may easily exceed 30 minutes. If we're trying to clamp down on excessive timeouts, it'd be great to fix this.
@goosemania
Adding Reference from @goosemania for updating the fixture:
the test :
is failing with:
when trying to sync from: https://download.opensuse.org/update/leap/42.3/oss/
The test passes locally, but takes long time to finish, we should find a way to make it faster (local fixture or smaller repo)
complete traceback: