Timeout error on OpenSuseErrataTestCase sync operation

rochacbruno commented 6 years ago

the test :

tests/rpm/api_v2/test_updateinfo.py::OpenSuseErrataTestCase::test_01_create_sync_publish <- ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_2_tests/tests/rpm/api_v2/test_updateinfo.py FAILED

is failing with:

pulp_smash.exceptions.TaskTimedOutError: Task /pulp/api/v2/tasks/007c7852-068f-49a3-b8e9-63a38bffc7d0/ is ongoing after 900 polls.

when trying to sync from: https://download.opensuse.org/update/leap/42.3/oss/

The test passes locally, but takes long time to finish, we should find a way to make it faster (local fixture or smaller repo)

complete traceback:

```pytb 17:58:14 =================================== FAILURES =================================== 17:58:14 ______________ SyncInvalidFeedTestCase.test_task_error_traceback _______________ 17:58:14 17:58:14 self = 17:58:14 17:58:14 def test_task_error_traceback(self): 17:58:14 """Assert the task's "error" and "traceback" fields are non-null.""" 17:58:14 for key in {'error', 'traceback'}: 17:58:14 with self.subTest(key=key): 17:58:14 > self.assertIsNotNone(self.tasks[0][key]) 17:58:14 E AssertionError: unexpectedly None 17:58:14 17:58:14 ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_2_tests/tests/puppet/api_v2/test_sync_publish.py:213: AssertionError 17:58:14 ______________ OpenSuseErrataTestCase.test_01_create_sync_publish ______________ 17:58:14 17:58:14 self = 17:58:14 17:58:14 def test_01_create_sync_publish(self): 17:58:14 """Create, sync, and publish an openSUSE repository.""" 17:58:14 cfg = config.get_config() 17:58:14 if not selectors.bug_is_fixed(3377, cfg.pulp_version): 17:58:14 self.skipTest('https://pulp.plan.io/issues/3377') 17:58:14 17:58:14 # Create, sync, and publish a repository. 17:58:14 client = api.Client(cfg, api.json_handler) 17:58:14 body = gen_repo() 17:58:14 body['importer_config']['download_policy'] = 'on_demand' 17:58:14 body['importer_config']['feed'] = OPENSUSE_FEED_URL 17:58:14 body['distributors'] = [gen_distributor()] 17:58:14 repo = client.post(REPOSITORY_PATH, body) 17:58:14 self.addCleanup(client.delete, repo['_href']) 17:58:14 > sync_repo(cfg, repo) 17:58:14 17:58:14 ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_2_tests/tests/rpm/api_v2/test_updateinfo.py:621: 17:58:14 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 17:58:14 ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_smash/pulp2/utils.py:483: in sync_repo 17:58:14 return api.Client(cfg).post(urljoin(repo['_href'], 'actions/sync/')) 17:58:14 ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_smash/api.py:443: in post 17:58:14 return self.request('POST', url, **kwargs) 17:58:14 ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_smash/api.py:481: in request 17:58:14 requests.request(method, **request_kwargs), 17:58:14 ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_smash/api.py:135: in safe_handler 17:58:14 _handle_202(client._cfg, response, client.pulp_host) # pylint:disable=protected-access 17:58:14 ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_smash/api.py:83: in _handle_202 17:58:14 tasks = tuple(poll_spawned_tasks(cfg, call_report, pulp_host)) 17:58:14 ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_smash/api.py:507: in poll_spawned_tasks 17:58:14 for final_task_state in poll_task(cfg, href, pulp_host): 17:58:14 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 17:58:14 17:58:14 cfg = PulpSmashConfig(pulp_version='2.16.4', pulp_auth=['admin', 'admin'], hosts=[PulpHost(hostname='host-172-16-46-13.opens...erify': '/etc/pki/ca-trust/source/anchors/pulp-cacert.pem'}, 'pulp resource manager': {}})], pulp_selinux_enabled=True) 17:58:14 href = '/pulp/api/v2/tasks/007c7852-068f-49a3-b8e9-63a38bffc7d0/' 17:58:14 pulp_host = PulpHost(hostname='host-172-16-46-13.openstacklocal', roles={'amqp broker': {'service': 'qpidd'}, 'pulp cli': {}, 'pul...'api': {'scheme': 'https', 'verify': '/etc/pki/ca-trust/source/anchors/pulp-cacert.pem'}, 'pulp resource manager': {}}) 17:58:14 17:58:14 def poll_task(cfg, href, pulp_host=None): 17:58:14 """Wait for a task and its children to complete. Yield response bodies. 17:58:14 17:58:14 Poll the task at ``href``, waiting for the task to complete. When a 17:58:14 response is received indicating that the task is complete, yield that 17:58:14 response body and recursively poll each child task. 17:58:14 17:58:14 :param cfg: A :class:`pulp_smash.config.PulpSmashConfig` object. 17:58:14 :param href: The path to a task you'd like to monitor recursively. 17:58:14 :param pulp_host: The host to poll. If ``None``, a host will automatically 17:58:14 be selected by :class:`Client`. 17:58:14 :returns: An generator yielding response bodies. 17:58:14 :raises pulp_smash.exceptions.TaskTimedOutError: If a task takes too 17:58:14 long to complete. 17:58:14 """ 17:58:14 # 900 * 2s == 1800s == 30m 17:58:14 # NOTE: The timeout counter is synchronous. We query Pulp, then count down, 17:58:14 # then query pulp, then count down, etc. This is… dumb. 17:58:14 poll_limit = 900 17:58:14 poll_counter = 0 17:58:14 json_client = Client(cfg, json_handler, pulp_host=pulp_host) 17:58:14 while True: 17:58:14 task = json_client.get(href) 17:58:14 if cfg.pulp_version < Version('3'): 17:58:14 task_end_states = _TASK_END_STATES 17:58:14 else: 17:58:14 task_end_states = _P3_TASK_END_STATES 17:58:14 if task['state'] in task_end_states: 17:58:14 # This task has completed. Yield its final state, then recursively 17:58:14 # iterate through children and yield their final states. 17:58:14 yield task 17:58:14 for spawned_task in task['spawned_tasks']: 17:58:14 for descendant_tsk in poll_task(cfg, spawned_task['_href'], pulp_host): 17:58:14 yield descendant_tsk 17:58:14 break 17:58:14 poll_counter += 1 17:58:14 if poll_counter > poll_limit: 17:58:14 raise exceptions.TaskTimedOutError( 17:58:14 > 'Task {} is ongoing after {} polls.'.format(href, poll_limit) 17:58:14 ) 17:58:14 E pulp_smash.exceptions.TaskTimedOutError: Task /pulp/api/v2/tasks/007c7852-068f-49a3-b8e9-63a38bffc7d0/ is ongoing after 900 polls. 17:58:14 17:58:14 ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_smash/api.py:549: TaskTimedOutError ```

rochacbruno commented 6 years ago

More history reference about this issue: https://github.com/PulpQE/pulp-smash/issues/976#issuecomment-384420122

daviddavis commented 6 years ago

Another concern (perhaps worth a separate issue) is that pulp-smash waits over an hour for the test to timeout:

21:36:14 tests/rpm/api_v2/test_updateinfo.py::OpenSuseErrataTestCase::test_01_create_sync_publish <- ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_2_tests/tests/rpm/api_v2/test_updateinfo.py FAILED [ 94%]
22:38:08 tests/rpm/api_v2/test_updateinfo.py::OpenSuseErrataTestCase::test_02_relogin_suggested <- ../../.virtualenvs/pulp-2-tests/lib64/python3.5/site-packages/pulp_2_tests/tests/rpm/api_v2/test_updateinfo.py SKIPPED [ 94%]

Edit: What about something like this where poll_limit is set aggressively but tests can override the default:

```diff diff --git a/pulp_smash/api.py b/pulp_smash/api.py index 44fbb2e..1bd0236 100644 --- a/pulp_smash/api.py +++ b/pulp_smash/api.py @@ -482,7 +482,7 @@ class Client(): ) -def poll_spawned_tasks(cfg, call_report, pulp_host=None): +def poll_spawned_tasks(cfg, call_report, pulp_host=None, poll_limit=100): """Recursively wait for spawned tasks to complete. Yield response bodies. Recursively wait for each of the spawned tasks listed in the given `call @@ -493,6 +493,7 @@ def poll_spawned_tasks(cfg, call_report, pulp_host=None): :param call_report: A dict-like object with a `call report`_ structure. :param pulp_host: The host to poll. If ``None``, a host will automatically be selected by :class:`Client`. + :param poll_limit int: A limit to the number of times Pulp is polled (default is 100) :returns: A generator yielding task bodies. :raises: Same as :meth:`poll_task`. @@ -504,11 +505,11 @@ def poll_spawned_tasks(cfg, call_report, pulp_host=None): else: hrefs = [call_report['_href']] for href in hrefs: - for final_task_state in poll_task(cfg, href, pulp_host): + for final_task_state in poll_task(cfg, href, pulp_host, poll_limit): yield final_task_state -def poll_task(cfg, href, pulp_host=None): +def poll_task(cfg, href, pulp_host=None, poll_limit=100): """Wait for a task and its children to complete. Yield response bodies. Poll the task at ``href``, waiting for the task to complete. When a @@ -519,14 +520,14 @@ def poll_task(cfg, href, pulp_host=None): :param href: The path to a task you'd like to monitor recursively. :param pulp_host: The host to poll. If ``None``, a host will automatically be selected by :class:`Client`. + :param poll_limit int: A limit to the number of times Pulp is polled (default is 100) :returns: An generator yielding response bodies. :raises pulp_smash.exceptions.TaskTimedOutError: If a task takes too long to complete. """ - # 900 * 2s == 1800s == 30m + # for poll_limit of 100: 100 * 2s == 200s == 3.3m # NOTE: The timeout counter is synchronous. We query Pulp, then count down, # then query pulp, then count down, etc. This is… dumb. - poll_limit = 900 poll_counter = 0 json_client = Client(cfg, json_handler, pulp_host=pulp_host) while True: ```

omaciel commented 6 years ago

I thought that overriding the timeout was already built in?

rochacbruno commented 6 years ago

For now we are going to set a skip on this test until we have a fixture for it. (and prioritize this issue)

daviddavis commented 6 years ago

@omaciel I am no expert on pulp-smash but it looks like it's hardcoded. Let me know if I should open an issue to make it configurable.

Ichimonji10 commented 6 years ago

If I remember correctly, this test takes a long time to complete because of CPU usage requirements. Pulp (or some component thereof) takes a long, long time to parse the metadata in the target repository.

The test passes locally, but takes long time to finish, we should find a way to make it faster (local fixture or smaller repo)

At the time that this test was written, I decided against adding a fixture to Pulp Fixtures. The reason is that there are no tools in the Fedora or RHEL repos for generating SUSE-compliant repositories, and simply copying files directly into Pulp Fixtures has a few additional drawbacks. The simplest solution to this issue is to find a smaller SUSE repository from which to sync, where that repository must have certain types of content. The more elegant solution is to generate a SUSE repository using official tooling as part of Pulp Fixtures.

Ichimonji10 commented 6 years ago

Another concern (perhaps worth a separate issue) is that pulp-smash waits over an hour for the test to timeout:

First, know that the API client waits up to half an hour for any single asynchronous task to time out. That test case waited for over an hour because there were two separate tests.

Second, we can try shortening this timeout, but I have two concerns about such a change.

First, shortening the timeout might kill correctly-functioning tests. Pulp Smash initially had a timeout much shorter than 30 minutes, but correctly-functioning tests were being prematurely killed. While a half hour for a single test seems ridiculous, remember that the VMs we use for testing are puny compared to a typical developers' host. This is true whether the VMs come from OpenStack or from Travis.

Second, shortening the timeout will make it hard to write punishing tests. What if we want to write an internal test that syncs an entire RHEL repo? How long will that test take to complete?

Let's step back and think about what's going on here: Pulp Smash is waiting around for asynchronous tasks to complete because Pulp says that the asynchronous task is running. In other words, Pulp Smash believes Pulp, and that's a problem.

If we're going to make Pulp Smash better about dealing with hung tasks, I would suggest that we do so in as targeted a fashion as possible. @daviddavis suggestion of making the timeout short and adding a new parameter for customizing the timeout is a reasonable solution. But let's brainstorm for other solutions before adopting it and moving on. For example, what if Pulp Smash had this logic?

# pseudocode
elapsed_time_limit = 120  # seconds
while task['state'] not in TASK_END_STATES:  # e.g. finished or failed
    if elapsed_time() > elapsed_time_limit and task['state'] == 'WAITING':
        raise TaskWaitingException(formatter(task))
    sleep(1)
    task = client.get(task)

The pseudocode above is predicated on the idea that when Pulp incorrectly states that a task is still running, it does so by leaving tasks in the WAITING state. If the WAITING state is an indicator that something is wrong, then let's take advantage of that info.

As an aside, know that the timeout logic is awful. It's a hack. From the source code:

The timeout counter is synchronous. We query Pulp, then count down, then query pulp, then count down, etc. This is… dumb.

This implementation issue means that timeouts may easily exceed 30 minutes. If we're trying to clamp down on excessive timeouts, it'd be great to fix this.

rochacbruno commented 5 years ago

@goosemania

bherrin3 commented 5 years ago

Adding Reference from @goosemania for updating the fixture:

https://github.com/openSUSE/libzypp/blob/master/zypp/parser/yum/schema/updateinfo.rnc

pulp / Pulp-2-Tests

Timeout error on OpenSuseErrataTestCase sync operation #6