Open r00t1900 opened 2 years ago
Howdy, This is a good idea.
But to do this more accurately than a bunch of regexes we need metadata stored somewhere accessible @ pypi.org. Then bandersnatch can use that. Today (and I hope to be wrong) I know of no such metadata.
I quickly checked the JSON API for tf-nightly (the largest nightly package @ ~400gb) and there is nothing that indicates it is a nightly package. Adding such metadata would need to be a warehouse issue raised.
Any ideas other people reading?
Howdy, This is a good idea.
But to do this more accurately than a bunch of regexes we need metadata stored somewhere accessible @ pypi.org. Then bandersnatch can use that. Today (and I hope to be wrong) I know of no such metadata.
I quickly checked the JSON API for tf-nightly (the largest nightly package @ ~400gb) and there is nothing that indicates it is a nightly package. Adding such metadata would need to be a warehouse issue raised.
Potential Metadata Options
JSON API extension
- Maybe add a "info" field bool for nightly or a package type
- This would require users to specify it
Add a classifier for Nightly or release type
- https://pypi.org/classifiers/
- This would also rely on users to add the classifier
Any ideas other people reading?
That would be very nice. But adding meta info to all packages maybe really a huge project, will pypi accept this?
[plugins]
enabled =
regex_project
blocklist_project
prerelease_release
[filter_regex]
packages =
.+-nightly(-|$)
[blocklist]
packages =
uselesscapitalquiz
[filter_prerelease]
packages =
duckdb
graphscope-client
lalsuite
gs-engine
gs-include
bigdl-dllib
bigdl-dllib-spark2
bigdl-dllib-spark3
Some metadata would be nice, I'd suggest PyPA to enforce some naming convention or metadata label for projects with constantly frequent releasing, especially with relatively large sizes. In case of other sound use cases, an request can be filed in warehouse like those for size limits.
In the meantime, I'll be using the config above, excluding all *-nightly-*
and *-nightly
, and some handpick awful projects spamming their pre-releases with nightly or even commit-ly builds. The uselesscapitalquiz
causes file name length overflow.
Though we have size limits in place (per project and per file), but we have no traffic limit... So constantly refreshing a relatively large project will incur huge traffic for mirrors, 10 builds of 500MiB is much more horrific than one 2GiB build.
By this, for example, I mean duckdb
, do release for literally each commit.
brief
I would like to ignore all nightly build type packages and releases when mirroring PyPi with bandersnatch.
description
Here I found that I did not need the nightly build version of each packages, and in https://pypi.org/stats it shows that about 1TB data are "-nightly-". I've manually add them to
blocklist
, but I would like to make it much more accuracy, to match and ignore downloading all nightly build data.appeal
I find that we have
regex
plugin, but I did not know how to write the pattern, since my purpose it to ignore but not acquire and I don't know if we can make it much more easier in another way. Finally I decide like to seek for community and official suggestion and hope someone who know it can help me in this.Looking forward to the answer.