pulp / pulp_python

A Pulp plugin to support Python packages
GNU General Public License v2.0
36 stars 76 forks source link

Replicate feature is not processing a pull-through cache of pypi #718

Open PotentialIngenuity opened 1 month ago

PotentialIngenuity commented 1 month ago

Version core: 3.57.0 python: 3.12.1

Describe the bug I am trying to use the /replicate/ endpoint to sync a pull-through cache to another pulp instance. After the replication is done I see no reference of any python resources in the logs or through the api.

pulp python distribution list
[
  {
    "pulp_href": "/pulp/api/v3/distributions/python/pypi/018dd189-f781-7ee0-a6de-b93d044dce0b/",
    "pulp_created": "2024-02-22T15:58:16.450965Z",
    "pulp_last_updated": "2024-02-22T15:58:20.737703Z",
    "base_path": "pypi-mirror",
    "base_url": "https://pulp-api/pypi/pypi-mirror/",
    "content_guard": null,
    "hidden": false,
    "pulp_labels": {
      "content_type": "pypi-mirror"
    },
    "name": "pypi-mirror",
    "repository": null,
    "publication": null,
    "allow_uploads": true,
    "remote": "/pulp/api/v3/remotes/python/python/018dcdde-5748-7239-8e91-79ee820fec0d/"
  }
]
[
  {
    "pulp_href": "/pulp/api/v3/remotes/python/python/018dcdde-5748-7239-8e91-79ee820fec0d/",
    "pulp_created": "2024-02-21T22:51:57.129136Z",
    "pulp_last_updated": "2024-02-21T22:51:57.129154Z",
    "name": "pypi-mirror",
    "url": "https://pypi.org",
    "ca_cert": null,
    "client_cert": null,
    "tls_validation": true,
    "proxy_url": null,
    "pulp_labels": {},
    "download_concurrency": null,
    "max_retries": null,
    "policy": "on_demand",
    "total_timeout": null,
    "connect_timeout": null,
    "sock_connect_timeout": null,
    "sock_read_timeout": null,
    "headers": null,
    "rate_limit": null,
    "hidden_fields": [
      {
        "name": "client_key",
        "is_set": false
      },
      {
        "name": "proxy_username",
        "is_set": false
      },
      {
        "name": "proxy_password",
        "is_set": false
      },
      {
        "name": "username",
        "is_set": false
      },
      {
        "name": "password",
        "is_set": false
      }
    ],
    "includes": [],
    "excludes": [],
    "prereleases": false,
    "package_types": [],
    "keep_latest_packages": 0,
    "exclude_platforms": []
  }
]

To Reproduce Setup and pull-through and try to replicate it.

Expected behavior All python resources are created on the 2nd pulp instance

Additional context

lubosmj commented 1 month ago

I guess the main issue here is that distributions which have pull-through caching enabled, contain an empty index, that is supposed to list all available packages. These packages are not a part of any repository. Therefore, once a replica hits the base-path of a distribution, it just gets nothing to work on and ignores it. Have you tried manually accessing the index via cURL or httpie?

cc: @gerrod3

gerrod3 commented 1 month ago

Not being able to sync from a pull-through cache was an intentional design choice when I implemented pull-through caching. I expected that most people would set their pull-through to PyPI and thus syncing a pull-through cache would sync all of PyPI which would be terribly slow because of how large PyPI is. Maybe it would make sense to allow syncing from a pull-through if the user supplies an includes on their remote pointing to the pull-through, that way we know we aren't trying to sync all of PyPI.

As for the replicate feature I see two potential options for dealing with upstream pull-through distributions:

  1. Simply ignore distributions that only have a remote and no backing repository/publication since we can't sync from them.
  2. Recreate the pull-through distribution on the downstream without trying to sync from it. Now would it make sense for the downstream pull-through to point to the upstream pull-through or to point to the actual cache source? I don't know.
ipanova commented 1 month ago

probably option 1 makes more sense

lubosmj commented 1 month ago

Same for me. Replicating is about replicating content from Pulp, not entity objects.

PotentialIngenuity commented 1 month ago

What are my options right now to get the cached content to a secondary pulp instance? Does pulp_python have a copy function like rpm does? https://pulpproject.org/pulp_rpm/restapi/#tag/Rpm:-Copy

gerrod3 commented 1 month ago

Well currently your cached content is probably just living as orphaned content if you haven't been adding them to a repository. I would get a list of them first, then add them to a repo using the modify endpoint. https://pulpproject.org/pulp_python/restapi/#tag/Repositories:-Python/operation/repositories_python_python_modify

ORPHANS=$(http https://pulp-api/pulp/api/v3/content/python/packages/?orphaned_for=0 | jq -jc '[.results[].pulp_href]')
http POST '$REPO_HREF'modify/ add_content_units:="$ORPHANS"
PotentialIngenuity commented 3 weeks ago

Thank you. That works well but the results are pagenated. Using the limit query doesnt seem to have any effect.

http --verify false GET https://localhost/pulp/api/v3/content/python/packages/ limit:=999999999

{
    "count": 335036,
    "next": "http://localhost/pulp/api/v3/content/python/packages/?limit=100&offset=100",
    "previous": null,
    "results": [
        {....
lubosmj commented 3 weeks ago

You have to append the query option to the URL, like so: https...packages/?offset=0'&'limit=9999.

PotentialIngenuity commented 3 weeks ago

Thank you. The replication doesnt pick up the correct url to sync from. It should be looking at the upstream url: https://repo.company.com/pypi/python-pypi-freeze

pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulpcore.tasking.tasks:INFO: Starting task 019151cc-d244-75d6-aa21-d7c196ffeb6f
pulp [a0b73760ce164b47bc3cb0b7a696f924]: bandersnatch:INFO: Initialized release plugin blocklist_release, filtering []
pulp [a0b73760ce164b47bc3cb0b7a696f924]: bandersnatch.mirror:INFO: Syncing with https://pulp-api/pypi/python-pypi-freeze.
pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulp_python.app.tasks.sync:INFO: Attempt 0 to get package list from https://pulp-api/pypi/python-pypi-freeze
pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulp_python.app.tasks.sync:INFO: Syncing all packages.
pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulp_python.app.tasks.sync:INFO: Attempt 1 to get package list from https://pulp-api/pypi/python-pypi-freeze
pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulp_python.app.tasks.sync:INFO: Syncing all packages.
pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulp_python.app.tasks.sync:INFO: Attempt 2 to get package list from https://pulp-api/pypi/python-pypi-freeze
pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulp_python.app.tasks.sync:INFO: Syncing all packages.
pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulp_python.app.tasks.sync:INFO: Failed to get package list using XMLRPC, trying parse simple page.
Backing off download_wrapper(...) for 0.4s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pulp-api:443 ssl:default [Connect call failed ('IPv4', 443)])
pulp [a0b73760ce164b47bc3cb0b7a696f924]: backoff:INFO: Backing off download_wrapper(...) for 0.4s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pulp-api:443 ssl:default [Connect call failed ('IPv4', 443)])
Backing off download_wrapper(...) for 1.9s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pulp-api:443 ssl:default [Connect call failed ('IPv6', 443, 0, 0)])
pulp [a0b73760ce164b47bc3cb0b7a696f924]: backoff:INFO: Backing off download_wrapper(...) for 1.9s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pulp-api:443 ssl:default [Connect call failed ('IPv6', 443, 0, 0)])
Backing off download_wrapper(...) for 3.7s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pulp-api:443 ssl:default [Connect call failed ('IPv6', 443, 0, 0)])
pulp [a0b73760ce164b47bc3cb0b7a696f924]: backoff:INFO: Backing off download_wrapper(...) for 3.7s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pulp-api:443 ssl:default [Connect call failed ('IPv6', 443, 0, 0)])
Giving up download_wrapper(...) after 4 tries (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pulp-api:443 ssl:default [Connect call failed ('IPv4', 443)])
pulp [a0b73760ce164b47bc3cb0b7a696f924]: backoff:ERROR: Giving up download_wrapper(...) after 4 tries (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pulp-api:443 ssl:default [Connect call failed ('IPv4', 443)])
pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulpcore.tasking.tasks:INFO: Task 019151cc-d244-75d6-aa21-d7c196ffeb6f failed (Cannot connect to host pulp-api:443 ssl:default [Connect call failed ('IPv4', 443)])
pulp [a0b73760ce164b47bc3cb0b7a696f924]: pulpcore.tasking.tasks:INFO:   File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/tasks.py", line 75, in _execute_task
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/pulp_python/app/tasks/sync.py", line 61, in sync
    DeclarativeVersion(first_stage, repository, mirror).create()

  File "/usr/local/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py", line 161, in create
    loop.run_until_complete(pipeline)

  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/usr/local/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 220, in create_pipeline
    await asyncio.gather(*futures)

  File "/usr/local/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 41, in __call__
    await self.run()

  File "/usr/local/lib/python3.9/site-packages/pulp_python/app/tasks/sync.py", line 152, in run
    await pmirror.synchronize(packages_to_sync)

  File "/usr/local/lib/python3.9/site-packages/bandersnatch/mirror.py", line 65, in synchronize
    await self.determine_packages_to_sync()

  File "/usr/local/lib/python3.9/site-packages/pulp_python/app/tasks/sync.py", line 227, in determine_packages_to_sync
    result = await downloader.run()

  File "/usr/local/lib/python3.9/site-packages/pulpcore/download/http.py", line 269, in run
    return await download_wrapper()

  File "/usr/local/lib/python3.9/site-packages/backoff/_async.py", line 151, in retry
    ret = await target(*args, **kwargs)

  File "/usr/local/lib/python3.9/site-packages/pulpcore/download/http.py", line 254, in download_wrapper
    return await self._run(extra_data=extra_data)

  File "/usr/local/lib/python3.9/site-packages/pulpcore/download/http.py", line 287, in _run
    async with self.session.get(

  File "/usr/local/lib64/python3.9/site-packages/aiohttp/client.py", line 1197, in __aenter__
    self._resp = await self._coro

  File "/usr/local/lib64/python3.9/site-packages/aiohttp/client.py", line 581, in _request
    conn = await self._connector.connect(

  File "/usr/local/lib64/python3.9/site-packages/aiohttp/connector.py", line 544, in connect
    proto = await self._create_connection(req, traces, timeout)

  File "/usr/local/lib64/python3.9/site-packages/aiohttp/connector.py", line 944, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)

  File "/usr/local/lib64/python3.9/site-packages/aiohttp/connector.py", line 1257, in _create_direct_connection
    raise last_exc

  File "/usr/local/lib64/python3.9/site-packages/aiohttp/connector.py", line 1226, in _create_direct_connection
    transp, proto = await self._wrap_create_connection(

  File "/usr/local/lib64/python3.9/site-packages/aiohttp/connector.py", line 1033, in _wrap_create_connection
    raise client_error(req.connection_key, exc) from exc
ggainey commented 3 weeks ago

Redacted ip4/6 addrs from prev comment - prob not a serious exposure, but seemed the polite thing to do :)

PotentialIngenuity commented 3 weeks ago

I made sure I have PYPI_API_HOSTNAME set but the replicate still uses pulp-api

gerrod3 commented 3 weeks ago

On your upstream Pulp, when you list your distributions what is the host name in base_url? If it doesn’t match what you set PYPI_API_HOSTNAME then you need to restart your Pulp for the setting change to take effect.

Another possibility is that your downstream Pulp hasn’t updated the remotes url it is using to sync. Check the url field of the remotes on this system and see if the hostname is correct. I’m pretty sure the replicate task is supposed to update the remote automatically, but if it isn’t then you can delete the remotes and rerun the task and have it recreate them.

PotentialIngenuity commented 2 weeks ago

I have made a lot of progress in the right direction. I am getting this internal server error during the replicate process now.

Worker log on downstream

pulp [d45a02c66941451b8e738701d2c0d14c]: bandersnatch.package:INFO: Fetching metadata for package: 667bot (serial 0)
pulp [d45a02c66941451b8e738701d2c0d14c]: pulp_python.app.tasks.sync:ERROR: Sync encountered an error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/bandersnatch/mirror.py", line 129, in package_syncer
    await package.update_metadata(self.master, attempts=3)
  File "/usr/local/lib/python3.9/site-packages/bandersnatch/package.py", line 61, in update_metadata
    self._metadata = await master.get_package_metadata(
  File "/usr/local/lib/python3.9/site-packages/bandersnatch/master.py", line 220, in get_package_metadata
    metadata_response = await metadata_generator.asend(None)
  File "/usr/local/lib/python3.9/site-packages/pulp_python/app/tasks/sync.py", line 170, in get
    async for r in super().get(path, required_serial, **kw):
  File "/usr/local/lib/python3.9/site-packages/bandersnatch/master.py", line 132, in get
    async with self.session.get(path, **kw) as r:
  File "/usr/local/lib64/python3.9/site-packages/aiohttp/client.py", line 1197, in __aenter__
    self._resp = await self._coro
  File "/usr/local/lib64/python3.9/site-packages/aiohttp/client.py", line 696, in _request
    resp.raise_for_status()
  File "/usr/local/lib64/python3.9/site-packages/aiohttp/client_reqrep.py", line 1070, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 500, message='Internal Server Error', url=URL('https://<upstream_pulp>/pypi/python-pypi-freeze/pypi/5kodds-distribution/json/')

api logs on upstream

pulp [f00c10e8c5f14d4c959401c1bdbbc3d6]: django.request:ERROR: Internal Server Error: /pypi/python-pypi-freeze/pypi/5kodds-distribution/json/
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/views/decorators/csrf.py", line 56, in wrapper_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/viewsets.py", line 124, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 509, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 469, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
    raise exc
  File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 506, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/pulp_python/app/pypi/views.py", line 335, in retrieve
    json_body = python_content_to_json(
  File "/usr/local/lib/python3.9/site-packages/pulp_python/app/utils.py", line 170, in python_content_to_json
    full_metadata.update({"releases": python_content_to_releases(content_query, base_path, domain)})
  File "/usr/local/lib/python3.9/site-packages/pulp_python/app/utils.py", line 254, in python_content_to_releases
    python_content_to_download_info(content, base_path, domain)
  File "/usr/local/lib/python3.9/site-packages/pulp_python/app/utils.py", line 289, in python_content_to_download_info
    "digests": {"md5": artifact.md5, "sha256": artifact.sha256},
AttributeError: 'NoneType' object has no attribute 'md5'
PotentialIngenuity commented 2 weeks ago

Here is the python package.

{
    "count": 1,
    "next": null,
    "previous": null,
    "results": [
        {
            "artifact": null,
            "author": "",
            "author_email": "",
            "classifiers": "[]",
            "description": "",
            "description_content_type": "",
            "download_url": "",
            "filename": "5kodds_distribution-0.1.tar.gz",
            "home_page": "",
            "keywords": "",
            "license": "",
            "maintainer": "",
            "maintainer_email": "",
            "metadata_version": "",
            "name": "5kodds-distribution",
            "obsoletes_dist": "[]",
            "packagetype": "sdist",
            "platform": "",
            "project_url": "https://pypi.org/project/5kodds-distribution/",
            "project_urls": "null",
            "provides_dist": "[]",
            "pulp_created": "2024-02-21T22:14:47.138060Z",
            "pulp_href": "/pulp/api/v3/content/python/packages/018dcdbc-4402-734f-a849-e2711f4b0ff5/",
            "pulp_last_updated": "2024-02-21T22:14:47.138070Z",
            "requires_dist": "null",
            "requires_external": "[]",
            "requires_python": "",
            "sha256": "cbfc05b303b388baaf421e34b88c336cab427e64afcb1f344afa10aef03d9d64",
            "summary": "Gaussian distributions",
            "supported_platform": "",
            "version": "0.1"
        }
    ]
}
gerrod3 commented 1 week ago

I see. Replicating an upstream repository that is on-demand fails because we don't have the artifact to get the md5 hash. This is definitely a bug (and I'll try to find a fix for it), but I will advise not to replicate on-demand repositories as replicate currently always syncs from upstream using an immediate policy. This means every on-demand package will also have to be downloaded on the upstream during a sync.