pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.6k stars 963 forks source link

Provide metrics for top N package storage "hogs" #4288

Closed dstufft closed 5 years ago

dstufft commented 6 years ago

What's the problem this feature will solve?

Miroring PyPI currently takes > 2TB of storage, and that is continuing to grow, some mirroring tools have the ability to blacklist projects from being mirrored, but it's difficult to know which projects should be targeted for blacklisting without insight into which packages take up the most space.

Additionally, as operators it can be useful to see if particular packages are consuming more or less of the total space used by PyPI.

Describe the solution you'd like

Add metrics that indicate the top N packages by total space used.

cooperlees commented 6 years ago

Maybe we should go the generic https://pypi.org/stats/ and add more over time starting with this one.

Lets start with:

dstufft commented 6 years ago

👍

wayneworkman commented 6 years ago

I would be highly appreciative of a blacklist that has the biggest 100 projects. This would probably save a ton of space.

cooperlees commented 6 years ago

This is live - Just tweaking some cache config: https://pypi.org/stats/

brainwane commented 6 years ago

Is there anything left to do for this issue or shall we announce it on distutils-sig and close the issue?

di commented 6 years ago

@brainwane I think this is done!

cooperlees commented 6 years ago

Yeah the initial stuff is all done here. I may add more stats one day.

ewdurbin commented 6 years ago

https://pypi.org/stats/

ewdurbin commented 6 years ago

hmmm I guess there is an open question. do we want to commit to keeping this around by documenting it? both where to find it and the alternate JSON representation available when sending Accept: application/json?

ewdurbin commented 6 years ago

The question basically boils down to if we want this to be an interim/internal solution for bandersnatch users... or "own it" until we create a better replacement endpoint.

cooperlees commented 6 years ago

I'm happy to document it.

What does "own it" mean? I don't have preference where the API endpoint is. This was @dstufft's suggestion as to where to put it. What are the alternatives you're thinking?

ewdurbin commented 6 years ago

The endpoint is useful primarily for bandersnatch users and other mirror clients. If we document it and "publicize it" we'll want to ensure that it continues working until we begin and complete the process of deprecating it. Additionally changes to this endpoint will have to remain backward compatible.

ewdurbin commented 6 years ago

@cooperlees adding a page similar to https://github.com/pypa/warehouse/blob/master/docs/api-reference/json.rst is probably good!

di commented 5 years ago

Resolved by #5072.