scikit-learn / sklearn-pypi-package

27 stars 11 forks source link

Maintain a list of reverse dependencies of sklearn #6

Closed mirekphd closed 11 months ago

mirekphd commented 1 year ago

We have just got our first container build broken by this error. The containers packages lists are very large (hundreds of packages, mostly in the form of secondary and tertiary dependencies), with many data scientists contributing their desired packages to the installation list. Yet pip is very uninformative as to the source of the problem, failing to show which package has deprecated sklearn in its requirements.

Can you perhaps start a packages blacklist with primary packages that still require sklearn and let Github users maintain it?

lesteve commented 1 year ago

Yet pip is very uninformative as to the source of the problem, failing to show which package has deprecated sklearn in its requirements.

There may be a way to use pip options to have more info where dependencies come from during the install, but I haven't found something convincing in less than 5 minutes.

If you have a working environment with sklearn installed you may be able to use pipdeptree to figure out which package requires sklearn, something like:

pipdeptree -r -p sklearn

Can you perhaps start a packages blacklist with primary packages that still require sklearn and let Github users maintain it?

This does not seem like a very workable approach, there are likely many thousand packages depending on sklearn for example github says more than 3,000 although this is probably an order of magnitude rather than an exact number: https://github.com/scikit-learn/scikit-learn/network/dependents?package_id=UGFja2FnZS01MjU4ODg0Mw%3D%3D

My hope so far is that people can identify which package still depend on sklearn, open an issue in the relevant repository, so that the situation will gradually improve.

hugovk commented 1 year ago

Here's a list of 1,605 PyPI packages that depend on sklearn:

Taken from the database dump from https://github.com/sethmlarson/pypi-data via:

sqlite3 'pypi.db' 'SELECT package_name FROM deps WHERE dep_name LIKE "sklearn" GROUP BY package_name;' > deps.txt
lesteve commented 1 year ago

Nice, thanks a lot! I guess it could also be useful to have it ordered by number of downloads (descending), which seems doable if I read the project README correctly. This would allow to potentially open issues/PRs in most downloaded repos first.

Note that there are likely some caveats in this kind of things:

hugovk commented 1 year ago
sqlite3 'pypi.db' 'SELECT DISTINCT downloads, package_name FROM deps INNER JOIN packages ON deps.package_name = packages.name WHERE dep_name LIKE "sklearn" ORDER BY downloads DESC;' > deps-by-downloads.txt

deps-by-downloads.txt

Here's the top 50, it quickly tails off:

13133|statsforecast
4675|mmdet
3173|anndata
2854|scrubadub
1686|fn-graph
1448|miceforest
1296|astro-ghost
1075|fastestimator-nightly
684|tabpy
668|atlantis
570|psmpy
544|fairdynamicrec
542|sciann
532|spatialcluster
494|junky
481|gps-building-blocks
458|python-video-annotator
377|mlrose
369|tfidf-matcher
340|biosaur2
334|mlbench-core
318|hicstuff
313|textpack
302|accuinsight
288|autoads-test
287|pykeen
275|deep-training
233|autooptimizer
227|lepmlutils
226|paddleseg
225|pydelling
222|chronometry
222|iacs-ipac-reader
220|palmari
219|napari-filament-annotator
212|sherlockpipe
209|arbok
209|fast-scores
205|pysurvival
201|segsrgan
200|hivecode
198|pydatamodel
189|ai-graphics
189|lytools
188|nolds
187|augraphy
186|acmecontentcollectors-pkg-rioatmadja2018
184|auquan-toolbox
184|nerda
181|catsim
lesteve commented 1 year ago

Thanks a lot for this! I checked the top 10:

lesteve commented 1 year ago

Also as a side comment, it seems like packages depending on sklearn account for a small portion of all the sklearn downloads (this was already noted in previous attempts trying to figure out where sklearn downloads were coming from ...)

for sklearn, ~332k downloads per day

❯ sqlite3 'pypi.db' 'SELECT DISTINCT downloads, name from packages WHERE name LIKE "sklearn";'
331695|sklearn

Summing the number of downloads in the top 50 packages depending on sklearn listed in https://github.com/scikit-learn/sklearn-pypi-package/issues/6#issuecomment-1375471786, I get ~30k so less than 10% of the total sklearn downloads.

lesteve commented 11 months ago

Closing this one, the brownout period stops in a few days (December 1st) and we are not planning to do something more about this.