scikit-learn-contrib / hiclass

A python library for hierarchical classification compatible with scikit-learn
BSD 3-Clause "New" or "Revised" License
114 stars 20 forks source link

Make `ray` an optional depencency #45

Closed flo-ri-an closed 2 years ago

flo-ri-an commented 2 years ago

This PR intends to make ray an optional dependency. Due to the large size of ray and all its dependencies, this helps to bring down container size when ray is not needed.

hiclass will fallback to joblib.Parallel if ray is not installed and n_jobs is not 1.

I wrote a couple of test cases which pass, along with all tests that passed before the changes.

This PR at the current state is most likely not ready to be merged. Treat it as a suggestion for the time being.

Main areas of uncertainty:

mirand863 commented 2 years ago

Hi @flo-ri-an,

Thank you for the contribution! I will take a look into it. Out of curiosity, what is the difference in containers size with and without Ray?

I migrated to Ray because parallelism in Python tends to be clumsy, and it used to be the case that the more CPUs that are used require more memory as well.

flo-ri-an commented 2 years ago

I don't know exactly, MLOps just told me that it's quite hefty. I think ~150MB for ray alone, plus possibly some more dependencies? Also, a big issue seems to be that the pypi-package does not distribute sources, which makes it difficult to optimize/cache/build from source. A related issue is here: https://github.com/ray-project/ray/issues/6246

Ray will eventually be the way forward on our end as well. But until we're ready, it would be nice to get by with joblib.

mirand863 commented 2 years ago

I don't know exactly, MLOps just told me that it's quite hefty. I think ~150MB for ray alone, plus possibly some more dependencies? Also, a big issue seems to be that the pypi-package does not distribute sources, which makes it difficult to optimize/cache/build from source. A related issue is here: ray-project/ray#6246

Ray will eventually be the way forward on our end as well. But until we're ready, it would be nice to get by with joblib.

Perhaps using the conda installation could work for you. I know they distribute some files, e.g., https://anaconda.org/conda-forge/hiclass/files, but I am not 100% sure if you can use those to cache.

In my benchmark it took the same amount of time with joblib and it saved 1GB of RAM when compared with Ray. Good job!

flo-ri-an commented 2 years ago

I don't know exactly, MLOps just told me that it's quite hefty. I think ~150MB for ray alone, plus possibly some more dependencies? Also, a big issue seems to be that the pypi-package does not distribute sources, which makes it difficult to optimize/cache/build from source. A related issue is here: ray-project/ray#6246 Ray will eventually be the way forward on our end as well. But until we're ready, it would be nice to get by with joblib.

Perhaps using the conda installation could work for you. I know they distribute some files, e.g., https://anaconda.org/conda-forge/hiclass/files, but I am not 100% sure if you can use those to cache.

In my benchmark it took the same amount of time with joblib and it saved 1GB of RAM when compared with Ray. Good job!

Awesome!

We use poetry everywhere for dependency management, so it would be quite some effort to switch over to conda. Also, imho poetry is the most advanced dependency manager currently out there. We'll have to find some way around this issue once we start using ray.

flo-ri-an commented 2 years ago

black hiclass tests succeeds for me locally.

All done! ✨ 🍰 ✨
17 files left unchanged.
mirand863 commented 2 years ago

black hiclass tests succeeds for me locally.

All done! ✨ 🍰 ✨
17 files left unchanged.

@flo-ri-an Perhaps we have different versions of black. There were some lines that had to be added or removed. I also ran some more installation tests and everything worked fine, so I will merge this PR. Thank you again!

flo-ri-an commented 2 years ago

Great, thank you as well for the speedy resolution.