Closed mirekphd closed 11 months ago
Yet pip is very uninformative as to the source of the problem, failing to show which package has deprecated sklearn in its requirements.
There may be a way to use pip options to have more info where dependencies come from during the install, but I haven't found something convincing in less than 5 minutes.
If you have a working environment with sklearn installed you may be able to use pipdeptree to figure out which package requires sklearn, something like:
pipdeptree -r -p sklearn
Can you perhaps start a packages blacklist with primary packages that still require sklearn and let Github users maintain it?
This does not seem like a very workable approach, there are likely many thousand packages depending on sklearn
for example github says more than 3,000 although this is probably an order of magnitude rather than an exact number: https://github.com/scikit-learn/scikit-learn/network/dependents?package_id=UGFja2FnZS01MjU4ODg0Mw%3D%3D
My hope so far is that people can identify which package still depend on sklearn, open an issue in the relevant repository, so that the situation will gradually improve.
Here's a list of 1,605 PyPI packages that depend on sklearn:
Taken from the database dump from https://github.com/sethmlarson/pypi-data via:
sqlite3 'pypi.db' 'SELECT package_name FROM deps WHERE dep_name LIKE "sklearn" GROUP BY package_name;' > deps.txt
Nice, thanks a lot! I guess it could also be useful to have it ordered by number of downloads (descending), which seems doable if I read the project README correctly. This would allow to potentially open issues/PRs in most downloaded repos first.
Note that there are likely some caveats in this kind of things:
sklearn==0.0
you will never get an error. It would still be nice to tell the project that using scikit-learn
is recommended.sqlite3 'pypi.db' 'SELECT DISTINCT downloads, package_name FROM deps INNER JOIN packages ON deps.package_name = packages.name WHERE dep_name LIKE "sklearn" ORDER BY downloads DESC;' > deps-by-downloads.txt
Here's the top 50, it quickly tails off:
13133|statsforecast
4675|mmdet
3173|anndata
2854|scrubadub
1686|fn-graph
1448|miceforest
1296|astro-ghost
1075|fastestimator-nightly
684|tabpy
668|atlantis
570|psmpy
544|fairdynamicrec
542|sciann
532|spatialcluster
494|junky
481|gps-building-blocks
458|python-video-annotator
377|mlrose
369|tfidf-matcher
340|biosaur2
334|mlbench-core
318|hicstuff
313|textpack
302|accuinsight
288|autoads-test
287|pykeen
275|deep-training
233|autooptimizer
227|lepmlutils
226|paddleseg
225|pydelling
222|chronometry
222|iacs-ipac-reader
220|palmari
219|napari-filament-annotator
212|sherlockpipe
209|arbok
209|fast-scores
205|pysurvival
201|segsrgan
200|hivecode
198|pydatamodel
189|ai-graphics
189|lytools
188|nolds
187|augraphy
186|acmecontentcollectors-pkg-rioatmadja2018
184|auquan-toolbox
184|nerda
181|catsim
Thanks a lot for this! I checked the top 10:
pip install mmdet
does not install sklearn
on my machine)Also as a side comment, it seems like packages depending on sklearn
account for a small portion of all the sklearn
downloads (this was already noted in previous attempts trying to figure out where sklearn
downloads were coming from ...)
for sklearn, ~332k downloads per day
❯ sqlite3 'pypi.db' 'SELECT DISTINCT downloads, name from packages WHERE name LIKE "sklearn";'
331695|sklearn
Summing the number of downloads in the top 50 packages depending on sklearn
listed in https://github.com/scikit-learn/sklearn-pypi-package/issues/6#issuecomment-1375471786, I get ~30k so less than 10% of the total sklearn
downloads.
Closing this one, the brownout period stops in a few days (December 1st) and we are not planning to do something more about this.
We have just got our first container build broken by this error. The containers packages lists are very large (hundreds of packages, mostly in the form of secondary and tertiary dependencies), with many data scientists contributing their desired packages to the installation list. Yet pip is very uninformative as to the source of the problem, failing to show which package has deprecated
sklearn
in its requirements.Can you perhaps start a packages blacklist with primary packages that still require
sklearn
and let Github users maintain it?