pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
448 stars 141 forks source link

How can I ignore "*-nightly" packages and releases when using bandersnatch? #1100

Open r00t1900 opened 2 years ago

r00t1900 commented 2 years ago

brief

I would like to ignore all nightly build type packages and releases when mirroring PyPi with bandersnatch.

description

Here I found that I did not need the nightly build version of each packages, and in https://pypi.org/stats it shows that about 1TB data are "-nightly-". I've manually add them to blocklist, but I would like to make it much more accuracy, to match and ignore downloading all nightly build data.

appeal

I find that we have regex plugin, but I did not know how to write the pattern, since my purpose it to ignore but not acquire and I don't know if we can make it much more easier in another way. Finally I decide like to seek for community and official suggestion and hope someone who know it can help me in this.

Looking forward to the answer.

cooperlees commented 2 years ago

Howdy, This is a good idea.

But to do this more accurately than a bunch of regexes we need metadata stored somewhere accessible @ pypi.org. Then bandersnatch can use that. Today (and I hope to be wrong) I know of no such metadata.

I quickly checked the JSON API for tf-nightly (the largest nightly package @ ~400gb) and there is nothing that indicates it is a nightly package. Adding such metadata would need to be a warehouse issue raised.

Potential Metadata Options

Any ideas other people reading?

r00t1900 commented 2 years ago

Howdy, This is a good idea.

But to do this more accurately than a bunch of regexes we need metadata stored somewhere accessible @ pypi.org. Then bandersnatch can use that. Today (and I hope to be wrong) I know of no such metadata.

I quickly checked the JSON API for tf-nightly (the largest nightly package @ ~400gb) and there is nothing that indicates it is a nightly package. Adding such metadata would need to be a warehouse issue raised.

Potential Metadata Options

  • JSON API extension

    • Maybe add a "info" field bool for nightly or a package type
    • This would require users to specify it
  • Add a classifier for Nightly or release type

Any ideas other people reading?

That would be very nice. But adding meta info to all packages maybe really a huge project, will pypi accept this?

TechCiel commented 1 year ago
[plugins]
enabled =
    regex_project
    blocklist_project
    prerelease_release

[filter_regex]
packages =
    .+-nightly(-|$)

[blocklist]
packages =
    uselesscapitalquiz

[filter_prerelease]
packages =
    duckdb
    graphscope-client
    lalsuite
    gs-engine
    gs-include
    bigdl-dllib
    bigdl-dllib-spark2
    bigdl-dllib-spark3

Some metadata would be nice, I'd suggest PyPA to enforce some naming convention or metadata label for projects with constantly frequent releasing, especially with relatively large sizes. In case of other sound use cases, an request can be filed in warehouse like those for size limits.

In the meantime, I'll be using the config above, excluding all *-nightly-* and *-nightly, and some handpick awful projects spamming their pre-releases with nightly or even commit-ly builds. The uselesscapitalquiz causes file name length overflow.

TechCiel commented 1 year ago

Though we have size limits in place (per project and per file), but we have no traffic limit... So constantly refreshing a relatively large project will incur huge traffic for mirrors, 10 builds of 500MiB is much more horrific than one 2GiB build.

By this, for example, I mean duckdb, do release for literally each commit.