pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.59k stars 965 forks source link

Automatic PyPi package rating and removal #16923

Open LucaCappelletti94 opened 2 weeks ago

LucaCappelletti94 commented 2 weeks ago

What's the problem this feature will solve? At this time, there are lots of dead packages hosted on Pip.

These packages are characterized by no link to the source code, no README, and sometimes a single almost empty release long ago by a user who has never logged in again since. This impacts the name availability, which seems to have a rather large backlog at the time of writing, and generally makes Pip suffer from package rot (more and more results I get when I search a package name are just dead things).

Describe the solution you'd like This may be Déformation professionnelle, but I believe it should be possible to create a ranking of sorts for package quality, based on an open-source algorithm that people may contribute to. This algorithm would receive a package directory in input alongside its structure metadata and spits out a score depending on how many desirable traits it has (or has not). Packages that fall outside of a certain percentile of the distribution, get flagged and an email is sent to their owner advising them that their package has entered a grace period after which, if no action is taken by them, it will be deleted, or the ownership of the package transferred to some other user who may take over its development.

Examples of such rules Examples of rules, which I stress I believe should be defined as an open source library and be gradually improved upon by the community, could be:

I believe such simple rules could already help eliminate a large amount of dead packages.

I would be most definitely open to contributing to such a project.

User Interface This score may be integrated into the PyPi website itself, allowing users who are comparing packages to see their ratings.

Pypi warning A pip install operation may warn the user installing a given package that it has a very low score and may get deleted.

woodruffw commented 2 weeks ago

See https://github.com/pypi/warehouse/issues/16034 for some previous discussion around this topic.

As a rough summary: "dead" packages are not something that PyPI currently has much of an opinion around -- others have proposed a more curatorial approach to the index, but that approach requires a significant amount of administrative/maintenance overhead. That overhead would draw time from other ongoing maintenance and development efforts.

I think rankings of package quality are an interesting idea, but IMO PyPI shouldn't be in the business of issuing those kinds of value judgements. This is especially true when such metrics include things like package age, since a significant "draw" of the Python packaging ecosystem is that old packages typically still work and don't need to be updated with every CPython release. In other words: if we were to begin penalizing packages based on (perceived) inactivity, we'd likely end up penalizing some of the most important packages on the index, regardless of whether they're actually abandoned or not.

(This isn't meant to imply that PyPI can't get rid of obviously dead packages. Only that an "objective" metric for identifying those packages may be difficult to obtain, outside of a handful of obvious cases!)

LucaCappelletti94 commented 2 weeks ago

Hi @woodruffw, Thank you for your answer. I understand that such a metric is hard to design, which is why I stressed its open-source aspect.

The age of a package that I have mentioned is an important aspect to consider alongside all of the others, not by itself: If a person has just published an empty package, it is not as much of a problem as a person who published an empty package 15 years ago and never looked back.

None of the ranking rules I mentioned makes sense on its own, but taken all together I think they should have a decent false positive rate. I don't advocate to blindly deploy such quality metric, but to measure their efficacy and then determine whether it is a good idea to deploy it.

I see that in the link you provided, @di mentions that there indeed is already the 541, but the year-long backlog (currently open issues go back to March 2023) is only increasing. Here is an example of a package name I was interested in, which seems rather dead, and yet my request has received no reply whatsoever, nor I believe I will receive one as there are hundreds of such requests before my own.

Without some automation helping out, a year of backlog can start to grow out of control.

woodruffw commented 2 weeks ago

None of the ranking rules I mentioned makes sense on its own, but taken all together I think they should have a decent false positive rate. I don't advocate to blindly deploy such quality metric, but to measure their efficacy and then determine whether it is a good idea to deploy it.

I agree, but it's my personal (non-maintainer!) opinion that such a metric, even if it's a low-FP one, doesn't belong on PyPI itself. It'd be good for such a metric to exist in a third-party context, but I don't think PyPI itself should be in the business of making value judgements around package quality.

OTOH, I think there two things that would reasonably be within PyPI's purview:

  1. Allowing users to set status markers on projects, similar to how GitHub allows owners to set repos as "archived." This would allow maintainers to explicitly mark a project as e.g. abandoned or similar, which in turn could accelerate the PEP 541 process by reducing the need for reachout attempts.
  2. Removing trivially invalid projects. There are a lot of projects on PyPI that have no releases, and can't even be updated, since they belong to accounts that are functionally invalid (e.g. have a missing or invalid email address from before the legacy PyPI did any validation). These can be removed without any disruption since they're functionally empty, but they're also only a tiny slice compared to projects that might be flagged as "low quality" by an active metric.

I see that in the link you provided, @di mentions that there indeed is already the 541, but the year-long backlog (currently open issues go back to March 2023) is only increasing. Here is an example of a package name I was interested in, which seems rather dead, and yet my request has received no reply whatsoever, nor I believe I will receive one as there are hundreds of such requests before my own.

Just for context, the backlog is actually no longer increasing: it's been decreasing for a few weeks now that there's someone working on it full time: https://discuss.python.org/t/is-pep-541-still-the-correct-solution/27436/25. That trend should continue into the future.

I understand that it's frustrating to wait a long time for a support request, but things are getting better on that front.

LucaCappelletti94 commented 2 weeks ago

I am currently scraping (with an appropriate frequency) the PyPI metadata for the packages from the API. Hopefully, I should be able to create a public anonymised dataset and try out approaches, so to determine the number of problematic packages.

With the speed I am using, it will take at least 10 days or so to build a first version of the dataset.

Hopefully the percentage of problematic packages is small, but nevertheless I believe such a study to be of some usefulness, all open sourced of course.