[FEA] Native HPO support and implementation on conda in Windows.

DaPiePiece commented 1 year ago

Hi, I'm currently working on machine learning applications in exoplanet detection, so I work with massive datasets that put the sklearn library to its limits, which made me discover cuML. I write this not as a professional software developer, but as a physicist playing around with tools, so I may be wrong in some of the things I say, or missing a few technicalities, so I do apologise in advance for my lack of knowledge in the field. For reference, I primarily use Spyder on the Windows anaconda environment, so I had to go through a WSL2 conda environment to be able to use cuML.

I'm a big fan of cuML, as it practically supersedes sklearn in its implemention, making migrating from sklearn to cuML as simple as changing the imports. However, cuML currently does not have native HPO, functions such as GridSearchCV, that are present in sklearn. I would love to see a native implementation of something like GridSearchCV inside of cuML.

Currently, the alternative is to use both sklearn and cuML and juggle data between your CPU and your GPU, which from what I've read online, can sometimes come out to be slower than just using sklearn. (source: https://blog.dask.org/2019/03/27/dask-cuml). A native implementation of HPO inside of cuML would:

Remove the need to import both libraries
For people like me, with a GPU with CUDA cores better than their CPU (RTX 3080 vs i7 9700K), off load all of the computation to the GPU
Massively accelerate GridSearchCV which is O(n^2) at best, if it is possible to use the architecture of a GPU on this problem
Remove the need for juggling data between devices thus preventing potential bottlenecks (correct me if I am wrong on this)
Allow for multitasking while waiting for the optimisation to complete: This is a more personal one, but off loading work to my GPU just prevents me from launching games while waiting for the training, but running optimisation on my CPU, if this process requires a lot of computation and thus uses up a lot of resources, would majorly slow down my work flow as I use my CPU for other programs.

As for the other request, it may already be work in progess, but bringing cuML to the anaconda environment in Windows would allow me to natively use the library in Spyder which is my IDE of preference for data analysis.

I was also able to get cuML working on a Windows 10 machine with WSL2, while the doc mentions that cuML is only available for Windows 11 machines. WSL2 isn't exclusive to Windows 11 (despite running a lot better on it), so it might be judicious to update the doc.

wphicks commented 1 year ago

Thanks very much for the thoughtful writeup on this, @DaPiePiece!

It is currently possible to take advantage of cuML's GPU acceleration using sklearn's HPO implementation. That is to say, if you were to use sklearn's GridSearchCV and apply it to a cuML estimator, it should work just fine. The CPU operations in GridSearchCV are relatively lightweight, and the cuML estimator would perform the heavy-duty work of actually running its algorithm on GPU each time it is invoked by GridSearchCV. In fact, we probably could not improve on the very fine work that the sklearn team has done on most of those HPO methods.

That being said, this question comes up a lot, so it's worth considering how to better highlight the compatibility between sklearn HPO and cuML estimators. One thing we have done in limited ways in the past is to alias sklearn operations into the cuML namespace. We want to be thoughtful about that because we want to make sure sklearn gets credit for the brilliant work that they do, but here I think it might be reasonable. We would import the sklearn HPO utilities into cuML (so that you could import them from the cuML package) and update the docstring with appropriate attribution. Since cuML requires sklearn anyway, that method would then be available without a separate import, and it would be more obvious that those tools work just fine with cuML estimators.

Does that seem like a reasonable solution to you? Any concerns or anything about your use case that that does not cover?

I was also able to get cuML working on a Windows 10 machine with WSL2, while the doc mentions that cuML is only available for Windows 11 machines. WSL2 isn't exclusive to Windows 11 (despite running a lot better on it), so it might be judicious to update the doc.

Thank you! That's very useful feedback. I'm going to be playing around with cuML on Windows in the near future (since I do not do so nearly enough), and I'll be sure to check this out then if not before. We would want to just double-check that there aren't any additional caveats we'd have to mention for Windows 10 or any potential roadbumps for the install.

DaPiePiece commented 1 year ago

Hi,

Yes that does sound very resonable. My source on the "slowness" of cuML and sklearn being used together might have just been wrong, and if they work well together, I guess my only issue was the confision I had with whether it is possible to juggle data between the CPU and the GPU without extra work (in PyTorch, for example, you have to call a function to transfer data between devices).

Looking forward to seeing more Windows support! Cheers!

I'll leave the closing of the thread to the admins as I'm unsure whether they need to classify the feature request before closure. Feel free to close it if classifiction can be done regardless.

wphicks commented 1 year ago

My source on the "slowness" of cuML and sklearn being used together might have just been wrong, and if they work well together, I guess my only issue was the confision I had with whether it is possible to juggle data between the CPU and the GPU without extra work

You're quite right that in general using cuML and sklearn together can be slow, since you can end up with a lot of data transfer back and forth. Why it works in this case is that the CPU code essentially just orchestrates work that is entirely confined to the GPU. You don't end up with extra data transfers because the CPU doesn't need to actually access the data.

I'll leave the closing of the thread to the admins as I'm unsure whether they need to classify the feature request before closure. Feel free to close it if classifiction can be done regardless.

Thanks! I'm going to leave this open for now but note for my fellow contributors that the specific task that needs to be done is:

Alias sklearn HPO utilities into the cuML namespace and update the docstrings with clear attribution.

Either I or someone else on the team will handle that, and then we'll close from there. I may also spin off the Windows 10 support question into a separate issue.

Zekrom-7780 commented 1 year ago

@wphicks , can I pick this Issue up?

wphicks commented 1 year ago

@Zekrom-7780 That would be wonderful! Please do reach out if you have any questions on how to proceed with it.

Zekrom-7780 commented 1 year ago

@wphicks Thanks a lot, I just finished installing Rapidsai from the RAPIDS Release Selector, and I read this entire issue, and I also read your highlighted message

Alias sklearn HPO utilities into the cuML namespace and update the docstrings with clear attribution.

Could you suggest me on how to like start this, once I get an idea, I would finish this pretty quickly

wphicks commented 1 year ago

Sure! The general idea is that when we do:

from cuml import some_hpo_thing

this should return the corresponding sklearn object/class/function/... but with its docstring updated to make it clear that the imported thing is coming from sklearn. An example might look something like this:

# File: cuml/foo/__init__.py
from sklearn.foo import bar

bar.__docstring__ == f"""This class is implemented in scikit-learn and imported without modification into the cuml namespace. Please be sure to cite scikit-learn for work that makes use of this class.

{bar.__docstring__}"""

For a similar effort, you might check out this PR: https://github.com/rapidsai/cuml/pull/2645. Note that that one required us to actually pull sklearn code into cuML, which should not be required here. All we need is to import the relevant objects into the cuML namespace.

Zekrom-7780 commented 12 months ago

Thanks a lot @wphicks for the explanation, Could you please provide guidance on where to specifically locate the code related to the explanation you provided, especially since the codebase is extensive?

I also attempted to review the PR you shared (https://github.com/rapidsai/cuml/pull/2645), but I couldn't easily pinpoint the relevant code. Can you help me to the relevant section or clarify how to find it within the large codebase?

wphicks commented 12 months ago

I can do even better than that! ;) Looks like we already started this effort here: https://github.com/rapidsai/cuml/blob/branch-23.12/python/cuml/model_selection/__init__.py#L24. For any HPO method we want available from sklearn, we'll want to do exactly the same thing. I'd pull that little string addition out into its own variable and then just add it to whatever methods you import from sklearn.

rapidsai / cuml

[FEA] Native HPO support and implementation on conda in Windows. #5380