pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
448 stars 141 forks source link

Packages are not automatically deleted + delete CLI bugs #1273

Open 89ao opened 1 year ago

89ao commented 1 year ago

Could you tell me how to remove official removed packages automatically?

for example : https://pypi.org/project/apicolors/

the apicolors are deleted by pypi.org 4 days ago(Nov 9), but after my bandersnatch server synced it locally,It exist till now (Nov 11).(but my sync interval is 30min)

here is the bander.log:

2022-11-06 10:21:15,841 bandersnatch.package: INFO Fetching metadata for package: apicolors (serial 15671340)2022-11-06 10:21:15,966 bandersnatch.mirror: INFO Downloading: https://mirrors.tuna.tsinghua.edu.cn//packages/12/27/92bfd44c97e3ed74a028da41b3ae419d4b2c6e7233003841f2c49cafec98/apicolors-6.6.6.tar.gz
2022-11-06 10:21:17,947 bandersnatch.mirror: INFO Continuing to next candidate URL after error downloading: https://mirrors.tuna.tsinghua.edu.cn//packages/12/27/92bfd44c97e3ed74a028da41b3ae419d4b2c6e7233003841f2c49cafec98/apicolors-6.6.6.tar.gz2022-11-06 10:21:17,948 bandersnatch.mirror: INFO Downloading: https://files.pythonhosted.org/packages/12/27/92bfd44c97e3ed74a028da41b3ae419d4b2c6e7233003841f2c49cafec98/apicolors-6.6.6.tar.gz
2022-11-06 10:21:17,980 bandersnatch.mirror: INFO Storing index page(s): apicolors - in /opt/bandersnatch/web/simple/apicolors2022-11-07 08:51:20,898 bandersnatch.package: INFO Fetching metadata for package: apicolors (serial 15678961)
2022-11-07 08:51:20,939 bandersnatch.mirror: INFO Downloading: https://mirrors.tuna.tsinghua.edu.cn//packages/2d/0a/d4c6fa3f16b71d70ab2ca6387aee93a84c191fd9711daa812df0054c17b4/apicolors-6.6.7.tar.gz2022-11-07 08:51:20,974 bandersnatch.mirror: INFO Continuing to next candidate URL after error downloading: https://mirrors.tuna.tsinghua.edu.cn//packages/2d/0a/d4c6fa3f16b71d70ab2ca6387aee93a84c191fd9711daa812df0054c17b4/apicolors-6.6.7.tar.gz2022-11-07 08:51:20,975 bandersnatch.mirror: INFO Downloading: https://files.pythonhosted.org/packages/2d/0a/d4c6fa3f16b71d70ab2ca6387aee93a84c191fd9711daa812df0054c17b4/apicolors-6.6.7.tar.gz
2022-11-07 08:51:21,008 bandersnatch.mirror: INFO Storing index page(s): apicolors - in /opt/bandersnatch/web/simple/apicolors2022-11-09 08:21:31,336 bandersnatch.package: INFO Fetching metadata for package: apicolors (serial 15704728)
2022-11-09 08:21:31,625 bandersnatch.package: INFO apicolors no longer exists on PyPI

And here is the bandersnatch.conf and I'am using bandersnatch-6.0.0 on docker-compose.

[mirror]
directory = /opt/bandersnatch
storage-backend = filesystem
master = https://pypi.org/
json = true
timeout = 300
workers = 3
hash-index = false
stop-on-error = false
delete-packages = true
compare-method = stat
log-config = /conf/bandersnatch-log.conf
download-mirror = https://mirrors.tuna.tsinghua.edu.cn/

[plugins]
enabled =
    blocklist_project
    blocklist_release
    regex_project

[blocklist]
packages =
    uselesscapitalquiz
    tf-nightly-gpu
    tf-nightly
    tensorflow-io-nightly
    tf-nightly-cpu
    pyagrum-nightly
    appium
[filter_regex]
packages =
    .+-nightly.*
cooperlees commented 1 year ago

Bandersnatch does not support delete during the mirror. There is not enough metadata to know what blobs to delete. That said, I have not dug into yanking, we might have enough metadata for those - might be worth looking into.

We only have bandersnatch verify --delete as it has to walk the file system and workout what files on the file system are not part of any JSON metadata anymore ...

Without adding more metadata to PyPI we can't make this more efficient.

89ao commented 1 year ago

@cooperlees thanks cooper, problem is that may someday risk packages may appear online.After official delete it , I'd like to stay consistent. please consider adding this feature ,tks!

cooperlees commented 1 year ago

This is not an easy fix. As I said, ideally we'd need to put more metadata into Warehouse (pypi.org). If you have cycles, opening an issue on warehouse (if we don't have one) asking for better metadata to allow mirroring to delete packages would be a good start.

bandersnatch will correctly generate correct Simple API HTML + JSON, so the package manager (e.g. pip) won't know the deleted/yanked version exists. The artifacts/blobs are just sitting there wasting disk space. A verify running in the background could slowly reclaim space. Walking filesystems is slow tho, I get that :(

89ao commented 1 year ago

Thanks to you @cooperlees ,It's not only the disk space's issue , It'seems that once in a while the official will delete some risk packages just like "rest-framework" and "apicolors" as I said.We also don't want them can still be downloaded. May "bandersnatch verify --deleted" deleted the outdated packages automately? If not we may need to write some shell to manually do this.

cooperlees commented 1 year ago

https://bandersnatch.readthedocs.io/en/latest/#bandersnatch-verify

Yes, running a verify with --delete will keep track and delete packages. It's not smart or incremental and needs to walk project by project to do so. All enhancements welcome.

I would love to know how you imagine doing this via shell? It should be no easier than just enhancing bandersnatch's logic.

89ao commented 1 year ago

Maybe obtain a official package list and compare it to local list ? If one package is not exist ,delete it locally?

89ao commented 1 year ago

just as a infomation-sync, this situation happens again as below: https://medium.com/checkmarx-security/py-torch-a-leading-ml-framework-was-poisoned-with-malicious-dependency-e30f88242964

https://pypi.org/project/torchtriton/ has already deleted torchtriton,but it just did't not delete it automaticlly while using bandersnatch. So we deleted it manually, looking forward to some more update ,tks!

89ao commented 1 year ago

@cooperlees hello cooperlees, recently I write a small tool to compare local project list and official project list () and now I've found a bunch of projects exist locally but no longer exist official any more for example:

...
a-plus-b
a-simple-modu
a1g0py8128
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaa-lama-ze-lo-oved
aabs7calc
aaron
aashika-calculator
abaxador-de-arquivo
abc-reader
abc0123
abcmikivideo
abenity
abhishekwebcodett
abhishekwebcodett2
abhishekwebcodett3
abilityrequest
...

Question is that when I plan to delete them manually , I just can't make it done,take project "aaaaaaaaaaa" for example :

[root@VM_21_104_centos /data/home/motorao/bandersnatch]# ls -al /yum/pip/web/simple/aaaaaaaaaaa/
total 18580
drwxr-xr-x 2 root root     4096 Jun  7  2022 .
drwxr-xr-x 1 root root 19009536 Jan  5 22:48 ..
-rw-r--r-- 1 root root      452 Jun  7  2022 index.html
[root@VM_21_104_centos /data/home/motorao/bandersnatch]# cat /yum/pip/web/simple/aaaaaaaaaaa/index.html
<!DOCTYPE html>
<html>
  <head>
    <title>Links for aaaaaaaaaaa</title>
  </head>
  <body>
    <h1>Links for aaaaaaaaaaa</h1>
    <a href="../../packages/6d/c1/2d60ee949b1be5382703260b0bdd4345e2711abdddc2b9e2bbb46f788ac1/aaaaaaaaaaa-1.1.1-py2.py3-none-any.whl#sha256=05ff699e6eb769bdcc489f4390a51d1056332e8d16bb0bd0ef5f15709341b88f" data-requires-python="&gt;=2">aaaaaaaaaaa-1.1.1-py2.py3-none-any.whl</a><br/>
  </body>
</html>
<!--SERIAL 14055105-->[root@VM_21_104_centos /data/home/motorao/bandersnatch]# bandersnatch delete aaaaaaaaaaa
2023-01-05 22:50:00,019 ERROR: Unable to load entry point swift_plugin = bandersnatch_storage_plugins.swift:SwiftStorage: No module named 'keystoneauth1'
2023-01-05 22:50:00,020 ERROR: /yum/pip/web/json/aaaaaaaaaaa does not exist. Pulling from PyPI
2023-01-05 22:50:00,021 INFO: Fetching https://pypi.python.org/pypi/aaaaaaaaaaa/json
2023-01-05 22:50:00,399 ERROR: /yum/pip/web/json/aaaaaaaaaaa.new does not exist - Did not get new JSON metadata
2023-01-05 22:50:00,399 ERROR: Unable to HTTP get JSON for /yum/pip/web/json/aaaaaaaaaaa

could you help me explain why does it happens?

cooperlees commented 1 year ago

So I don't have any plans to work on this. To do this correctly we need to store packages differently, change PyPI metadata or add another API to PyPI to let us know what to delete.

In the logs I see /yum/pip/web/json/aaaaaaaaaaa - Seems it's not adding /data/home/motorao/bandersnatch to the path? I haven't read the code but we must have a bug there.

If that's not the issue, then it's the fact the the package is deleted, and so it the JSON metadata, so we need to use local metadata only. If that's somehow been deleted too we're out of luck and need to manually delete.

Fix PR with unittest covering bug/new behavior welcome!

89ao commented 1 year ago

Maybe the delete CLI needs a --no-json-update to try not to pull from pypi.org

yes it is indeed. I'll learn and try how to make Fix PR later. tks a lot !@cooperlees

cooperlees commented 1 year ago

Should just need a boolean around the code that calls pipit.org to pull the JSON in verify.py - I haven't read the code tho, and I have a terrible memory :)