cuml implementation of SuperVectorizer, GapEncoder, SimilarityEncoder

skrub-data / skrub

Prepping tables for machine learning

https://skrub-data.org/

BSD 3-Clause "New" or "Revised" License

1.21k stars 97 forks source link

cuml implementation of SuperVectorizer, GapEncoder, SimilarityEncoder #369

Closed dcolinmorgan closed 3 weeks ago

dcolinmorgan commented 2 years ago

engine flag to enable cuml-based implementation of class functions

Benefits to the change: gpu-based speedup

Naive pseudocode for the new behavior (realistically much tougher to implement):

if self.engine == 'cuml':
  from cuml.cluster import KMeans
  from cuml.feature_extraction.text import CountVectorizer, HashingVectorizer
  from cuml.neighbors import NearestNeighbors
  from cuml.preprocessing import OneHotEncoder

GaelVaroquaux commented 2 years ago

It's not absolutely obvious that a naive implementation would lead to speed-ups.

Anyhow, an important question for this alley to be possible: how would we do CI (continuous integration)?

lmeyerov commented 2 years ago

Maybe a useful, more declarative question is: What might be a good path to enabling plugging in custom handlers, and limiting selection to them?

In our case, sometimes we want CPU-only, sometimes GPU-only (ex: an end-to-end GPU pipeline), and sometimes, ambivalent (e.g., care more about quality). There are other variants of this, like local vs remote, dask_cpu vs dask_cudf, competing impls of same alg, .... .

Our use is pretty limited:

E.g.,

https://github.com/graphistry/pygraphistry/blob/97b52e22b4d9c17b40a4a33fed115dd789c981e7/graphistry/feature_utils.py#L892

https://github.com/graphistry/pygraphistry/blob/97b52e22b4d9c17b40a4a33fed115dd789c981e7/graphistry/feature_utils.py#L944

jeromedockes commented 3 weeks ago

this hasn't seen any activity in a while, should we close it @GaelVaroquaux ?

my personal opinion is that this isn't really a priority in the short term, and that usage of the gpu will come from underlying libraries (scikit-learn and array api & possibly one day plugins, polars' gpu support, if we start using narwhals it has a cudf backend, the new TextEncoder uses pytorch, ...) rather than explicit options in skrub itself (at least for now)

GaelVaroquaux commented 3 weeks ago

this hasn't seen any activity in a while, should we close it @.*** ?

Yes

lmeyerov commented 3 weeks ago

@GaelVaroquaux @jeromedockes Our fork has been active around the cu-cat and pygraphistry repos

Answering the direct questions --

We see significant speedups on a variety of real-world workloads , like machine logs at 2x-50x, even on small GPUs ..
...Small datasets do not see as much benefit, but they are small so not our target. In pygraphistry, we have a similar pattern of recommending CPU mode for the small stuff so as to not hog the GPU.
We currently do local CI. I believe GitHub now support GPU runners.
This is a semi-active project. We have been preparing the pygraphistry side to turn this on by default, and updating the cu-cat side to support that.

lmeyerov commented 3 weeks ago

Also worth noting, we see similar patterns in pygraphistry around building on top of cupy/cudf, as was done here, and so our solution isn't fundamentally different as we are using this same GPU pydata ecosystem.

Cudf was too high-level so we do cupy. Changing to packages that do not exist or insufficient is hard feedback to act on. Can we get more specific review recommendations?

If there is a different interface preferred for enabling, let us know. In pygraphistry, we evolved to nowadays try to detect color and dispatch based on that (default ''just works'), and allow optional engine flags to trigger coercions. I can imagine other patterns.