pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
448 stars 141 forks source link

Problem with metadata fetching #948

Open Lulu300 opened 3 years ago

Lulu300 commented 3 years ago

I previously used Bandersnatch 4.4 to mirror specific packages, and it worked perfectly, but since I updated it to version 5.0, it does download the package releases into my allowlist but it downloads the metadata of all packages are worrying about my allowlist. I looked in the documentation but I don't understand how I can prevent it from downloading package metadata that I don't want to mirror, can you help me please?

Here's my config:

[plugins]
enabled =
    blocklist_project
    blocklist_release
    whitelist_project
    allowlist_release
    exclude_platform
[blocklist]
plugins =
    exclude_platform
platforms =
    windows
    macos
    freebsd
[allowlist]
packages =
    altgraph>=0.17
    ansible>=2.9.12
    asn1crypto>=0.24.0
    bcrypt>=3.2.0
    Cerberus>=1.3.2
    certifi>=2018.8.24
    cffi>=1.14.3
    chardet>=3.0.4
    chrome-gnome-shell>=0.0.0
    colorclass>=2.2.0
    cryptography>=2.6.1
    cupshelpers>=1.0
    distro>=1.3.0
    distro-info>=0.21
    easygui>=0.98.1
    entrypoints>=0.3
    httplib2>=0.11.3
    idna>=2.6
    importlib-metadata>=1.7.0
    Jinja2>=2.11.2
    jmespath>=0.10.0
    jsonpickle>=1.4.1
    keyring>=17.1.1
    keyrings.alt>=3.1.1
    MarkupSafe>=1.1.1
    msoffcrypto-tool>=4.10.2
    olefile>=0.46
    oletools>=0.55.1
    paho-mqtt>=1.5.0
    paramiko>=2.7.2
    pcodedmp>=1.2.6
    pip>=18.1
    prometheus-client>=0.8.0
    psutil>=5.7.2
    pycairo>=1.16.2
    pycparser>=2.20
    pycrypto>=2.6.1
    pycups>=1.9.73
    pycurl>=7.43.0.2
    PyGObject>=3.30.4
    pyinotify>=0.9.6
    pyinstaller>=4.0
    pyinstaller-hooks-contrib>=2020.8
    pylru>=1.2.0
    PyNaCl>=1.4.0
    pyparsing>=2.4.7
    pysftp>=0.2.9
    PySimpleSOAP>=1.16.2
    pysmbc>=1.0.15.6
    pystemd>=0.7.0
    python-apt>=1.8.4.1
    python-debian>=0.1.35
    python-debianbts>=2.8.2
    python-magic>=0.4.18
    pyxdg>=0.25
    PyYAML>=5.3.1
    raptorq>=1.4.2
    reportbug>=7.5.3-deb10u1
    requests>=2.21.0
    SecretStorage>=2.3.1
    setuptools>=40.8.0
    six>=1.12.0
    tornado>=6.0.4
    typing-extensions>=3.6.4
    unattended-upgrades>=0.1
    Unidecode>=1.1.1
    uptime>=3.0.1
    urllib3>=1.24.1
    wheel>=0.32.3
    zipp>=3.1.0
    zstandard>=0.14.0
    paho-mqtt>=1.5.1
    toml>=0.9.0
    semantic-version>=2.6.0
    setuptools-rust>=0.11.4

And the output when I execute bandersnatch mirror

2021-06-17 12:13:19,437 INFO: Selected storage backend: filesystem (configuration.py:126)
2021-06-17 12:13:19,437 INFO: Selected compare method: hash (configuration.py:172)
2021-06-17 12:13:19,574 INFO: Initialized project plugin blocklist_project, filtering [] (blocklist_name.py:27)
2021-06-17 12:13:19,669 INFO: Initialized release plugin allowlist_release, filtering [<Requirement('cryptography>=2.6.1')>, <Requirement('idna>=2.6')>, <Requirement('pcodedmp>=1.2.6')>, <Requirement('pygobject>=3.30.4')>, <Requirement('python-debianbts>=2.8.2')>, <Requirement('urllib3>=1.24.1')>, <Requirement('setuptools>=40.8.0')>, <Requirement('cupshelpers>=1.0')>, <Requirement('keyring>=17.1.1')>, <Requirement('psutil>=5.7.2')>, <Requirement('uptime>=3.0.1')>, <Requirement('easygui>=0.98.1')>, <Requirement('pysftp>=0.2.9')>, <Requirement('entrypoints>=0.3')>, <Requirement('wheel>=0.32.3')>, <Requirement('pycairo>=1.16.2')>, <Requirement('pysmbc>=1.0.15.6')>, <Requirement('pystemd>=0.7.0')>, <Requirement('distro>=1.3.0')>, <Requirement('pyinstaller>=4.0')>, <Requirement('toml>=0.9.0')>, <Requirement('prometheus-client>=0.8.0')>, <Requirement('colorclass>=2.2.0')>, <Requirement('keyrings-alt>=3.1.1')>, <Requirement('typing-extensions>=3.6.4')>, <Requirement('msoffcrypto-tool>=4.10.2')>, <Requirement('olefile>=0.46')>, <Requirement('pip>=18.1')>, <Requirement('python-magic>=0.4.18')>, <Requirement('six>=1.12.0')>, <Requirement('asn1crypto>=0.24.0')>, <Requirement('raptorq>=1.4.2')>, <Requirement('pycrypto>=2.6.1')>, <Requirement('pylru>=1.2.0')>, <Requirement('paho-mqtt>=1.5.1')>, <Requirement('jsonpickle>=1.4.1')>, <Requirement('pyinotify>=0.9.6')>, <Requirement('ansible>=2.9.12')>, <Requirement('requests>=2.21.0')>, <Requirement('tornado>=6.0.4')>, <Requirement('pycups>=1.9.73')>, <Requirement('distro-info>=0.21')>, <Requirement('pyparsing>=2.4.7')>, <Requirement('altgraph>=0.17')>, <Requirement('semantic-version>=2.6.0')>, <Requirement('paho-mqtt>=1.5.0')>, <Requirement('pyyaml>=5.3.1')>, <Requirement('markupsafe>=1.1.1')>, <Requirement('bcrypt>=3.2.0')>, <Requirement('jmespath>=0.10.0')>, <Requirement('setuptools-rust>=0.11.4')>, <Requirement('pynacl>=1.4.0')>, <Requirement('zstandard>=0.14.0')>, <Requirement('unidecode>=1.1.1')>, <Requirement('jinja2>=2.11.2')>, <Requirement('pycurl>=7.43.0.2')>, <Requirement('cerberus>=1.3.2')>, <Requirement('chrome-gnome-shell>=0.0.0')>, <Requirement('reportbug>=7.5.3-deb10u1')>, <Requirement('paramiko>=2.7.2')>, <Requirement('chardet>=3.0.4')>, <Requirement('zipp>=3.1.0')>, <Requirement('certifi>=2018.8.24')>, <Requirement('pycparser>=2.20')>, <Requirement('pyxdg>=0.25')>, <Requirement('python-debian>=0.1.35')>, <Requirement('httplib2>=0.11.3')>, <Requirement('oletools>=0.55.1')>, <Requirement('secretstorage>=2.3.1')>, <Requirement('cffi>=1.14.3')>, <Requirement('pysimplesoap>=1.16.2')>, <Requirement('pyinstaller-hooks-contrib>=2020.8')>, <Requirement('unattended-upgrades>=0.1')>, <Requirement('python-apt>=1.8.4.1')>, <Requirement('importlib-metadata>=1.7.0')>] (allowlist_name.py:170)
2021-06-17 12:13:19,672 INFO: Initialized release plugin blocklist_release, filtering [] (blocklist_name.py:110)
2021-06-17 12:13:19,687 INFO: Initialized exclude_platform plugin with ['.win32', '-win32', 'win_amd64', 'win-amd64', 'macosx_', 'macosx-', '.freebsd', '-freebsd'] (filename_name.py:85)
2021-06-17 12:13:19,937 INFO: Status file /home/admsrv/.bandersnatch/status missing. Starting over. (mirror.py:594)
2021-06-17 12:13:19,937 INFO: Syncing with https://pypi.org. (mirror.py:58)
2021-06-17 12:13:19,937 INFO: Current mirror serial: 0 (mirror.py:263)
2021-06-17 12:13:19,937 INFO: Resuming interrupted sync from local todo list. (mirror.py:270)
2021-06-17 12:13:21,566 INFO: Trying to reach serial: 10666172 (mirror.py:295)
2021-06-17 12:13:21,566 INFO: 277245 packages to sync. (mirror.py:297)
2021-06-17 12:13:21,566 INFO: No metadata filters are enabled. Skipping metadata filtering (mirror.py:77)
cooperlees commented 3 years ago

Hi,

Thanks for reporting.

Just to clarify, when you say "metadata" do you mean bandersnatch saves the JSON file to disk for every package on PyPI and not just what's in your Allow List?

Thanks.

Lulu300 commented 3 years ago

Hi, Yes exactly

RWoodring79 commented 3 years ago

I was investigating a different issue yesterday using the allowlist plugin to limit my downloads to only 3-4 packages. I was using the master branch, not the 5.0.0 tag, but in those tests, the tool only downloaded json files for the packages I expected.

It looks like you might be setting up the exclude_platform plugin incorrectly in your config file. You have the plugins = exclude_platform under both the [plugins] and the [blocklist] sections. Not sure if/how that could be related, but its and observation. Check out https://bandersnatch.readthedocs.io/en/latest/filtering_configuration.html#platform-specific-binaries-filtering for the latest filter config syntax.

Other observations about your config file. I think you no longer need whitelist_project since you have converted to the allowlist naming Your bandersnatch output seems to indicate the block_package and block_release filters are not doing anything

RWoodring79 commented 3 years ago

@Lulu300 I know this is the same question you already answered, but I want to be sure I understand your problem. When you say it "downloads the metadata of all packages", does that mean

  1. metadata for every version of the packages in your allow list
  2. metadata for every package on pypi

In the first case you would have about 75 files in the json folder and in the second case you would have over 400,000 files.

RWoodring79 commented 3 years ago

I did some more testing over the weekend with different filter combinations mirror a single version of a package: [allowlist] packages=jsonschema==3.2.0.

@Lulu300 According to the CHANGES.md, the blacklist/whitelist names will no longer work in the 5.0 configuration file. If you edit your config to replace whitelist_project with allowlist_project, that may solve your issue.

cooperlees commented 3 years ago

FWIW, a PR fixing the filtering to include JSON metadata files saving is welcome.

RWoodring79 commented 3 years ago

I am still trying to find time to dig into how the filtering works, but it seems to me the allowlist_release should either require or imply the allowlist_package filter. My first thought was that AllowListRelease class should inherit from AllowListProject, not FilterReleasePlugin, but if there was a way to implicitly enable the package filter, that would work too.

I hope to find time to address this as well as the issue I found with verify, but my free time mostly comes after 10PM when ambition and focus are low.