ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.12k stars 590 forks source link

GPU-Accelerated Backend #9986

Closed adamamer20 closed 3 weeks ago

adamamer20 commented 3 weeks ago

Which new backend would you like to see in Ibis?

I think it would be useful to have a GPU-Accelerated backend for operations on big DFs. In this paper they tested duckdb againts other GPU-Accelerated databases and the performance difference is significant. Since ibis used to support pandas, RAPIDS cuDF would be an obvious choice as it probably wouldn't need too much refactoring.

Code of Conduct

lostmygithubaccount commented 3 weeks ago

hi @adamamer20, thanks for opening this issue! apologies for the slow reply

you are right that with the pandas backend, you could use cuDF (and in theory other pandas-API compatible tools) -- this was shown here: https://voltrondata.com/blog/ibis-cudf-pandas

while supporting GPU-based execution engines is important to us, the pandas backend with Ibis (and pandas API in general) leaves a ton of performance on the table, largely negating the purpose of using hardware acceleration. the pandas API assumes all data can fit in memory and a single-threaded eager execution model

Ibis is an independently governed open source project, with its main sponsor being Voltron Data -- BlazingSQL (mentioned in the paper you link) was effectively merged (idk the exact corporate language here) into Voltron Data: https://voltrondata.com/news/fundinglaunch. a lot of the founders and engineers at Voltron Data that many of the Ibis contributors work with were largely responsible for RAPIDS and cuDF and BlazingSQL

separately, Polars is working with the RAPIDS team at NVIDIA to bring a new version of GPU execution that presumably improves on the pandas API version: https://pola.rs/posts/polars-on-gpu/. our tentative plan is to leverage this via the Polars backend for Ibis once it becomes available for single-node NVIDIA GPU execution

so our general thinking on this is:

we also have maintainers who were heavily involved in Dask and may have more thoughts on cuDF via Dask, though my understanding would be you generally still suffer the performance hit of the pandas API

adamamer20 commented 3 weeks ago

ehi @lostmygithubaccount, thanks for the detailed response! In my tests, pandas-cudf was also slower than eager polars, glad to hear it wasn't due to my implementation. I hadn't heard about Theseus, that looks promising. From what I understand it is not publicly available yet, right? I would keep the issue open until RAPIDS polars comes out or Theseus is a supported backend, but you can close it if you'd like (since it's not dependent on ibis itself).

lostmygithubaccount commented 3 weeks ago

yep, Theseus is not public (and probably won't be anytime soon) -- the general thinking is these modern single-node OLAP engines like DuckDB, DataFusion, and eventually Polars (once it implements its new "streaming" engine that works like the other two) are sufficient for 90-99% of data use cases, as they allow you to scale up to ~10TB size queries

it's unclear if single-node GPU w/ current tooling would be of much use in real world scenarios. we'll keep an eye on developments with Polars and look to add via the Polars backend if it's compelling

I'll close this out because there's nothing to do immediately, just waiting and watching. Ibis already does support Theseus (the repo for that backend is private, though could be made public eventually)

adamamer20 commented 1 week ago

Wanted to leave a note that Polars GPU engine has just been released in open beta

https://pola.rs/posts/gpu-engine-release/

lostmygithubaccount commented 1 week ago

yep! in theory this should "just work" through the Ibis backend already (we might need to bump the supported version of Polars or something), though I don't have a NVIDIA GPU readily available to try it out

lostmygithubaccount commented 1 week ago

I believe this will allow you to use it: https://github.com/ibis-project/ibis/pull/10151