skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.1k stars 96 forks source link

cuml implementation of SuperVectorizer, GapEncoder, SimilarityEncoder #369

Open dcolinmorgan opened 1 year ago

dcolinmorgan commented 1 year ago

engine flag to enable cuml-based implementation of class functions

Benefits to the change: gpu-based speedup

Naive pseudocode for the new behavior (realistically much tougher to implement):

if self.engine == 'cuml':
  from cuml.cluster import KMeans
  from cuml.feature_extraction.text import CountVectorizer, HashingVectorizer
  from cuml.neighbors import NearestNeighbors
  from cuml.preprocessing import OneHotEncoder
GaelVaroquaux commented 1 year ago

It's not absolutely obvious that a naive implementation would lead to speed-ups.

Anyhow, an important question for this alley to be possible: how would we do CI (continuous integration)?

lmeyerov commented 1 year ago

Maybe a useful, more declarative question is: What might be a good path to enabling plugging in custom handlers, and limiting selection to them?

In our case, sometimes we want CPU-only, sometimes GPU-only (ex: an end-to-end GPU pipeline), and sometimes, ambivalent (e.g., care more about quality). There are other variants of this, like local vs remote, dask_cpu vs dask_cudf, competing impls of same alg, .... .

Our use is pretty limited:

E.g.,

https://github.com/graphistry/pygraphistry/blob/97b52e22b4d9c17b40a4a33fed115dd789c981e7/graphistry/feature_utils.py#L892

https://github.com/graphistry/pygraphistry/blob/97b52e22b4d9c17b40a4a33fed115dd789c981e7/graphistry/feature_utils.py#L944