pulp / pulpcore

Pulp 3 pulpcore package https://pypi.org/project/pulpcore/
GNU General Public License v2.0
279 stars 111 forks source link

Pulp sync gets stuck forever after specific connection issues #5439

Open hao-yu opened 3 months ago

hao-yu commented 3 months ago

Clone from https://bugzilla.redhat.com/show_bug.cgi?id=2247646

Description of problem: After specific temporary Sat<->Caps connection issues, Capsule sync gets stuck pending on downloading artifacs that will never happen.

Particular scenarios for the connectivity issues:

I think the reproducer can be applicable to Satellite (syncing from some upstream repo) as well, since the scenarios are quite generic - just restrict connectivity during a repo sync task.

Version-Release number of selected component (if applicable): Sat 6.12 / python39-pulpcore-3.18.16-1.el8pc.noarch (I expect any Sat / pulp3 version affected)

How reproducible: 100%

Steps to Reproduce:

  1. Have a Capsule with Immediate download policy and no content downloaded.

  2. Have one CV with one bigger repo (say, RHEL9 BaseOS) to have syncable content to a Capsule (optionally, do a Complete Sync of such one repo after purging any repo and content from the Capsule)

  3. Launch the Capsule (Complete) Sync.

  4. Check the dynflow step will track pulp task that will be downloading packages, like:

    message: Downloading Artifacts code: sync.downloading.artifacts state: running done: 259

  5. When the packages download is progressing, run either below scenario to mimic connectivity issues (run on Capsule):

5a. drop packets to/from Satellite: SATELLITE=ip.address.of.satellite iptables -F iptables -A OUTPUT -p tcp -d $SATELLITE -j DROP iptables -A INPUT -p tcp -s $SATELLITE -j DROP

5b. reject poackets to/from Satellite: SATELLITE=ip.address.of.satellite iptables -F iptables -A OUTPUT -p tcp -d $SATELLITE -j REJECT iptables -A INPUT -p tcp -s $SATELLITE -j REJECT

5c. apply rate limiting and drop packets over the limit: SATELLITE=ip.address.of.satellite LIMIT=10/s iptables -F iptables -A OUTPUT -p tcp -d $SATELLITE -m limit --limit $LIMIT -j ACCEPT iptables -A INPUT -p tcp -s $SATELLITE -m limit --limit $LIMIT -j ACCEPT iptables -A OUTPUT -p tcp -d $SATELLITE -j DROP iptables -A INPUT -p tcp -s $SATELLITE -j DROP

5d. apply rate limiting and reject packages over the limit: SATELLITE=ip.address.of.satellite LIMIT=10/s iptables -F iptables -A OUTPUT -p tcp -d $SATELLITE -m limit --limit $LIMIT -j ACCEPT iptables -A INPUT -p tcp -s $SATELLITE -m limit --limit $LIMIT -j ACCEPT iptables -A OUTPUT -p tcp -d $SATELLITE -j REJECT iptables -A INPUT -p tcp -s $SATELLITE -j REJECT

  1. Wait some time. At least until Capsule logs something like: Nov 2 13:43:36 pmoravec-caps612 pulpcore-worker-3[72785]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pmoravec-sat612.domain.redhat.com:443 ssl:default [Connect call failed ('1.2.3.4', 443)])

Once that is logged, run "iptables -F" (to stop the temporary connectivity problem you mimic).

  1. Monitor progress of the pulp task - it will download packets at a usual high speed now.

  2. Once the progress does not move for some time, check the pulp task status - either it is done (as well as whole Caps sync), or it is stuck forever (waiting on downloading the few packets affected by the "Backing off download_wrapper" error that wont complete any time).

Actual results: Scenario 5b. leads to failed Caps sync that is unable to download some package - that is expected. /var/log/messages contains a backtrace about it (BT in Additional info):

Scenarios 5a., 5c. and sometimes 5d. leads to stuck Capsule sync. 5d. leads to stuck only when iptables are flushed before the exception from "BT in Additional info". If you are too slow, pulp detects too many connection problems and cancels the sync like in Scenario 5b. Per my tests, the BT from Additional info pops up 13 seconds after the "Backing off download_wrapper", so the 13s is the window of opportunity to hit the bug in Scenario 5d.

Scenarios 5a. and 5c. (packages drop) seem to get stuck every time (just flush iptables after the "Backing off .." error, any time later on).

(I think the Scenario 5b can mean that if you trigger Capsule Sync and restart some Satellite services and you are unlucky, you hit exactly 5b and Capsule gets stuck forever - that can be difficult to troubleshoot)

Expected results: pulp task to somehow terminate in reasonable time in any Scenario. Even canceling the task due to temporary connection problems is definitely better outcome than Capsule sync stuck forever.

Additional info: BT from pulp task giving up sync attempts (expected behaviour):

Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: result = func(*args, *kwargs) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 576, in synchronize Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: repo_version = dv.create() or repo.latest_version() Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py", line 161, in create Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: loop.run_until_complete(pipeline) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: return future.result() Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 225, in create_pipeline Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: await asyncio.gather(futures) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 43, in call Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: await self.run() Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 183, in run Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: pb.done += task.result() # download_count Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 209, in _handle_content_unit Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: await asyncio.gather(downloaders_for_content) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/models.py", line 89, in download Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: download_result = await downloader.run(extra_data=self.extra_data) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 273, in run Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: return await download_wrapper() Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/backoff/_async.py", line 133, in retry Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: ret = await target(args, **kwargs) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 258, in download_wrapper Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: return await self._run(extra_data=extra_data) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulp_rpm/app/downloaders.py", line 114, in _run Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: async with self.session.get( Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/client.py", line 1138, in aenter Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: self._resp = await self._coro Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/client.py", line 535, in _request Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: conn = await self._connector.connect( Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 542, in connect Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: proto = await self._create_connection(req, traces, timeout) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 907, in _createconnection Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: , proto = await self._create_direct_connection(req, traces, timeout) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: raise last_exc Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: transp, proto = await self._wrap_create_connection( Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: raise client_error(req.connection_key, exc) from exc

BT of a worker stuck forever in download attempts:

Thread 1 (Thread 0x7ff0931f1740

8 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559328068b10, for file /usr/lib64/python3.9/selectors.py, line 469, in select

15 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x5593280688e0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 1869, in _run_once

22 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4469a0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 601, in run_forever

29 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4465e0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 634, in run_until_complete

36 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a461dd0, for file /usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py, line 161, in create

43 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559327fb7990, for file /usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py, line 576, in synchronize

51 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559328051300, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 410, in _perform_task

59 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a431040, for file /usr/lib64/python3.9/multiprocessing/process.py, line 108, in run

66 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x5593280509b0, for file /usr/lib64/python3.9/multiprocessing/process.py, line 315, in _bootstrap

75 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a492240, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 71, in _launch

82 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4c8040, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 19, in init

95 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a496380, for file /usr/lib64/python3.9/multiprocessing/context.py, line 277, in _Popen

102 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4961e0, for file /usr/lib64/python3.9/multiprocessing/context.py, line 224, in _Popen

109 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a483d40, for file /usr/lib64/python3.9/multiprocessing/process.py, line 121, in start

116 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x55932803e2a0, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 298, in supervise_task

123 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a46d040, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 366, in run_forever

130 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4c7040, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/entrypoint.py, line 28, in worker

138 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a529840, for file /usr/lib/python3.9/site-packages/click/core.py, line 754, in invoke

148 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a5a9580, for file /usr/lib/python3.9/site-packages/click/core.py, line 1395, in invoke

155 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559327e5b4f0, for file /usr/lib/python3.9/site-packages/click/core.py, line 1053, in main

165 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff083448b20, for file /usr/lib/python3.9/site-packages/click/core.py, line 1128, in call

(some "futures" object is incomplete (forever) due to improperly caught exception..?)

hao-yu commented 3 months ago

It appears to hang in "pulpcore/download/http.py" at the following line

        while True:
            chunk = await response.content.read(1048576)  # 1 megabyte   <===============
            if not chunk:
                await self.finalize()
                break  # the download is done
            await self.handle_data(chunk)

I think it hangs forever trying to stream the data but the packets have been dropped. It appears that the clientTimeout object won't be honored here because it is reading data from another object (aio stream reader. See https://docs.aiohttp.org/en/stable/streams.html).

I only test the scenario 5a in comment #0. I am able to avoid the issue after changing the code below and set a "sock_read_timeout" (such as 20 seconds) for all the remotes.

        while True:
            chunk = await asyncio.wait_for(response.content.read(1048576), timeout=self.session.timeout.sock_read)  # 1 megabyte
            if not chunk:

It should now raise timeout to the data streaming and retry the download.

I wrote the following script to test:

cd /tmp             
PULP_SETTINGS=/etc/pulp/settings.py pulpcore-manager shell
import os 
import asyncio
import shutil
from pulp_rpm.app.shared_utils import urlpath_sanitize 
from pulp_rpm.app.models import RpmRemote
from pulpcore.app.models import Remote
remote = Remote.objects.prefetch_related("remoteartifact_set").get(url="https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os")
rpm_remote = RpmRemote.objects.get(pk=remote.pulp_id)
temp_dir = "/tmp/pulp_download_tests"
if os.path.isdir(temp_dir):
    shutil.rmtree(temp_dir)

os.mkdir(temp_dir)
os.chdir(temp_dir)
tasks = []
async def main():
    async def check_tasks():
        d = remote.get_downloader(url="https://test/")
        while(True):
            print("Semaphore: %s" % d.semaphore)
            incompleted = []
            for t in tasks:
                if not t.done():
                    incompleted.append(t)
            if len(incompleted) == 0:
                break
            print("Incompleted task count: %d" % len(incompleted))
            await asyncio.sleep(5)
    async def download(url):
        await remote.get_downloader(url=url).run()
    for ra in remote.remoteartifact_set.all()[:100]:
        tasks.append(asyncio.create_task(download(ra.url)))
    all_tasks = tasks + [asyncio.create_task(check_tasks())]
    await asyncio.gather(*all_tasks)

asyncio.run(main())

And then open another window to block the traffics to cdn.

CDN=cdn.redhat.com
iptables -F
iptables -A OUTPUT -p tcp -d $CDN -j DROP
iptables -A INPUT -p tcp -s $CDN -j DROP
ahumbe commented 2 months ago

@dralley Hi, there are many of customers hitting this issue (linked to the Jira: https://issues.redhat.com/browse/SAT-21126). It would be great if we can prioritize this issue to address it in 6.15.z and 6.16.0.