Add subcmd to use metadata to roughly calculate the size of the local bandersnatch mirror

leochen12-rgb commented 1 year ago

At present, I can obtain the official directory size of pypi（https://pypi.org/stats/）, while I am synchronizing the pypi directory. However, the du or duc command takes too long to count. Is there a more convenient way to do this?

cooperlees commented 1 year ago

Howdy,

This isn't really a bandersnatch question. This is all a limitation of lots of small files on your storage backend.

The only ideas we could possibly try:

Use the JSON metadata in parallel and check if a simple dir exists and if so just sum up all the packages
- Many bugs, but if you use filtering, that won't be applied
Use the JSON metadata in parallel and check if the files exist, but I think this will be just as expensive as du (but not sure all the operations du does under the covers)

Another hack I've generally recommended is making a dedicated partition or volume for each part of bandersnatch's storage - e.g. simple and packages directories to be in their own filesystems and then df -h can give quicker insight too.

If you use hash-index = true you could also create a volume/file system per shard to get further insight

I don't have the cycles to look into these ideas, but would take a PR add docs or a bandersnatch du like command that works out the sizes quicker if possible. But I feel we'd need to use a lower level language than python to get true speed here. Will leave open incase someone smarter comes along with better ideas.

leochen12-rgb commented 1 year ago

Thank you for your reply, and look forward to adding the du parameter to bandersmatch.

cooperlees commented 1 year ago

Awesome. Yeah I’ll be surprised if it’s much faster and will be hard to get accurate without checking if the files exist, which is the expensive part. It might surprise us and be much quicker than du …

pypa / bandersnatch

Add subcmd to use metadata to roughly calculate the size of the local bandersnatch mirror #1305