Open rht opened 2 months ago
This seems amazing! Single API with a polars-like feel, 20+ backends and great performance, it should be a priority. The API seems very expressive and it also supports geospatial operations. It seems to me that it can also support the chosen backend API, although it's not very straightforward (https://ibis-project.org/reference/scalar-udfs#ibis.expr.operations.udf.scalar.builtin). No mention of GPU backend (maybe we can open an issue to ask them about it?) but they might use RAPIDS-polars when it comes out.
Regarding pandas performance, I believe it might always be a bottleneck. We might consider ditching it completely in the future.
I read a bit about ibis and the greatest point it has is the use of duckdb as backend (which might be even more performant than lazy polars). However I also read online that some find ibis API a bit counterintuitive. Also, if duckdb is objectively faster than any other backend, why should you switch to any other? The greatest selling point is that, by decoupling the API from the backend, you can quickly switch the backend in the future if anything faster comes around, without changing the code you have written. We need to acknowledge that a lot of people are used to pandas API currently. Maybe we could have pandas (maybe switching to Dask or Modin), polars and ibis as the three interfaces.
EDIT: Another pro of Ibis is the possibility of using a distributed engine like PySpark.
The problem is that we have an abstract DF interface, which is yet-another-Ibis. Ibis has long tried to support pandas themselves because it is crucial to their userbase growth. Do you resonate with Ibis' gripes while writing the concrete pandas backend for mesa-frames?
We need to acknowledge that a lot of people are used to pandas API currently. Maybe we could have pandas (maybe switching to Dask or Modin), polars and ibis as the three interfaces.
I'd say the main reason for using mesa-frames is performance, and people would be willing to use a faster option than pandas as long as the syntax is not as tedious as CUDA/FLAME 2. What is the fastest you can get while still on pandas syntax (i.e. Dask/Modin)? And whether it is faster than lazy Polars. We need to know this before proceeding.
I'd say the main reason for using mesa-frames is performance, and people would be willing to use a faster option than pandas as long as the syntax is not as tedious as CUDA/FLAME 2. What is the fastest you can get while still on pandas syntax (i.e. Dask/Modin)? And whether it is faster than lazy Polars. We need to know this before proceeding.
I think both Dask and Modin are definitely slower than polars / duckdb (on a single node). I think performance is key. However I fear modelers might not use mesa-frames because it's yet another API to learn. But having a single DF backend would simplify development greatly. We could also have a single ibis backend and return values in a pandas DataFrame for convenience. pandas now should have support for the pyarrow backend, so it shouldn't be too slow. I have to do some experiments to compare performance between different approaches. But this would be the easiest.
Opened an issue on GPU Acceleration
@rht checkout the answer to the issue.
The potential with Ibis is quite promising. If Theseus gets released in the future, we could run Ibis models on multi-node GPUs, which would be a game-changer for companies using ABMs in production – it could enable much larger-scale simulations. You can transition smoothly from prototyping on a single CPU to running on multi-node GPUs without code changes. Interestingly, I read that GPUs provide the most significant speed-up for join operations, which is our main bottleneck. Adopting Ibis also seems like a good way to future-proof the codebase and reduce the complexity. We can implement an option to return DFs in various formats anyway (ibis, pandas, lazy-polars, polars, etc.). This would allow users to work with methods they're familiar with for writing the actual method implementation while we benefit from Ibis under the hood.
Polars is adding GPU support Ibis gets "for free", covering single-node GPU query engine
Which means, the current mesa-frames will soon support single-node GPU, via Polars. What Ibis brings to the table is localized to multi-node GPU support.
yep, Theseus is not public (and probably won't be anytime soon) -- the general thinking is these modern single-node OLAP engines like DuckDB, DataFusion, and eventually Polars are sufficient for 90-99% of data use cases, as they allow you to scale up to ~10TB size queries
Unless there is another multi-node GPU library that Ibis already supports, I think adding an Ibis implementation now won't provide any immediate benefit to the user.
Also, to ease maintenance cost, supporting Ibis means we have to drop support for Polars, since Ibis already wraps Polars, and maintaining 3 backends is tedious.
Unless there is another multi-node GPU library that Ibis already supports, I think adding an Ibis implementation now won't provide any immediate benefit to the user.
There is still the benefit of distributed CPU with the Spark backend. But if Theseus gets released in the future (and I don't see why it shouldn't), we would get multi-node GPU for free.
Also, to ease maintenance cost, supporting Ibis means we have to drop support for Polars, since Ibis already wraps Polars, and maintaining 3 backends is tedious.
Yes, completely agree on this one.
I'm creating an ibis
branch, and I'm going to start making PRs there.
https://ibis-project.org/ has various backends: Dask, pandas, Polars, and many more. But apparently, the pandas backend is so problematic to maintain that they have decided that they will remove them.