Closed dcolinmorgan closed 3 weeks ago
It's not absolutely obvious that a naive implementation would lead to speed-ups.
Anyhow, an important question for this alley to be possible: how would we do CI (continuous integration)?
Maybe a useful, more declarative question is: What might be a good path to enabling plugging in custom handlers, and limiting selection to them?
In our case, sometimes we want CPU-only, sometimes GPU-only (ex: an end-to-end GPU pipeline), and sometimes, ambivalent (e.g., care more about quality). There are other variants of this, like local vs remote, dask_cpu vs dask_cudf, competing impls of same alg, .... .
Our use is pretty limited:
E.g.,
this hasn't seen any activity in a while, should we close it @GaelVaroquaux ?
my personal opinion is that this isn't really a priority in the short term, and that usage of the gpu will come from underlying libraries (scikit-learn and array api & possibly one day plugins, polars' gpu support, if we start using narwhals it has a cudf backend, the new TextEncoder uses pytorch, ...) rather than explicit options in skrub itself (at least for now)
this hasn't seen any activity in a while, should we close it @.*** ?
Yes
@GaelVaroquaux @jeromedockes Our fork has been active around the cu-cat and pygraphistry repos
Answering the direct questions --
We see significant speedups on a variety of real-world workloads , like machine logs at 2x-50x, even on small GPUs ..
...Small datasets do not see as much benefit, but they are small so not our target. In pygraphistry, we have a similar pattern of recommending CPU mode for the small stuff so as to not hog the GPU.
We currently do local CI. I believe GitHub now support GPU runners.
This is a semi-active project. We have been preparing the pygraphistry side to turn this on by default, and updating the cu-cat side to support that.
Also worth noting, we see similar patterns in pygraphistry around building on top of cupy/cudf, as was done here, and so our solution isn't fundamentally different as we are using this same GPU pydata ecosystem.
Cudf was too high-level so we do cupy. Changing to packages that do not exist or insufficient is hard feedback to act on. Can we get more specific review recommendations?
If there is a different interface preferred for enabling, let us know. In pygraphistry, we evolved to nowadays try to detect color and dispatch based on that (default ''just works'), and allow optional engine flags to trigger coercions. I can imagine other patterns.
engine flag to enable cuml-based implementation of class functions
Benefits to the change: gpu-based speedup
Naive pseudocode for the new behavior (realistically much tougher to implement):