pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
447 stars 141 forks source link

Unmatched size of mirrored data while finishing `bandersnatch mirror` #1105

Open r00t1900 opened 2 years ago

r00t1900 commented 2 years ago

desc

I use bandersnatch to sync from pypi.org, for almost 10days. Today it finally comes to "generating global index page..." and then finish all its work, while I found that the size is only 8822G, which is not the desired size told in https://pypi.org/stats.

details

command: bandersnatch -c bs.conf mirror bs.conf:

[mirror]
directory = /mnt/storage/data
master = https://pypi.org
json = true
timeout = 300
workers = 10
hash-index = false
stop-on-error = false
delete-packages = true
compare-method = stat
download-mirror = https://pypi.tuna.tsinghua.edu.cn
download-mirror-no-fallback = false

[plugins]
enabled = blocklist_project

[blocklist]
packages =
  tf-nightly
  tf-nightly-gpu
  tf-nightly-cpu
  tensorflow-io-nightly
  pyagrum-nightly

As is shown in the config file, I use an alternative download mirror, and also block serveral packages. But even I take the blocked packages in conclusion, the number still did not match:

item from size
size in pypi / 10.8T
size in tuna / 9.75T
size of blocked packages manually calc from pypi 1353G=1353/1024 T = 1.32T
size of mirrored df -h -B G 8822G=8822/1024 T = 8.61T

questions

btw

Recent days when running to "generating global index page...", bandersnatch always come begin with an Response timeout error: pic1: image pic2: image

The command I use is bandersnatch -c bs.conf mirror as usual even for the incremental update. Q: Should I run bandersnatch verify instead?

r00t1900 commented 2 years ago

something else

Today I found something more interesting:

cooperlees commented 2 years ago

HI there,

The size on PyPI is a sum of the database metadata. I wouldn't be surprised of the deletions are not updating it correctly or something. Could be worth a check.

Usually when this happens it's 1 package causing issue. This file can be removed and bandersnatch will try sync again from the serial in the serial file along side the todo. So you should be safe to delete it and let it resume.

r00t1900 commented 2 years ago

OK, maybe another reason is, some deleted data in upstream can not be synced with bandersnatch, but still exists on upstream server?

happyaron commented 2 years ago

According to pypi.sh in tunasync-scripts, the pypi mirror hosted by tuna is exactly the same configuration of mine, at least the [blocklist] part is. But why the size shown in tuna server status is 9.75T, not the 9.48T(as is calculated above)?

This might relate to the fact that bandersnatch does not automatically remove files that's gone upstream, so the mirror only does garbage collection when a full bandersnatch verify run is performed.

cooperlees commented 2 years ago

Good call. This is 100% the sad state of bandersnatch. We don't have a good mechanism to know what files to delete as we keep the service stateless apart from the blob store (i.e. filesystem, s3 etc.). bandersnatch verify has to walk to whole filesystem .

Only options I see are:

lxyeternal commented 1 month ago

I have the same issue, I don't know why there are so many missing package files in the image. How can I make a complete mirror of pypi?

2024-07-30 20:31:59,734 INFO: Fetching metadata for package: zwero-brain-games1 (serial 14011926) (package.py:58)
2024-07-30 20:31:59,796 INFO: zutnlp no longer exists on PyPI (package.py:66)
2024-07-30 20:31:59,796 INFO: Fetching metadata for package: zx-core-backend (serial 3916140) (package.py:58)
2024-07-30 20:31:59,901 INFO: zwdata no longer exists on PyPI (package.py:66)
cooperlees commented 1 month ago

If this is from a failed sync, go to the resume file and remove the packages from there. I don't have a better solution or time to try fix this sorry.

lxyeternal commented 1 month ago

If this is from a failed sync, go to the resume file and remove the packages from there. I don't have a better solution or time to try fix this sorry.

Is there a more detailed process? I'm not sure how to operate it specifically.