pola-rs / tpch

MIT License
64 stars 35 forks source link

Are TPCH Benchmark results actual or not? #94

Closed anmyachev closed 2 months ago

anmyachev commented 3 months ago

Hi!

Polars' performance is very impressive. I would like to know if the results are up to date, because I did not find the library versions used. Do you have this information?

image
stinodego commented 3 months ago

I have actually just rerun the benchmarks for the latest versions of Polars, DuckDB, pandas, Dask, and PySpark. We will be publishing the results in the near future.

YarShev commented 3 months ago

@stinodego, which Modin engine do you use for benchmarking? Note that Modin on Ray is the most mature.

stinodego commented 3 months ago

We do not run Modin at the moment.

YarShev commented 3 months ago

As I see, Modin is on the picture. Is there a reason why you do not run Modin at the moment?

stinodego commented 3 months ago

Modin was a bit annoying to run because Ray does not yet support Python 3.12 (we were using the Ray backend). And it's the smallest of our competitors and not really comparable in terms of performance, so we dropped it for now. Can always re-add the queries later without much hassle.

YarShev commented 3 months ago

Modin is a library that is in the ecosystem of data processing as well as cudf, ibis and many more. It would be great to see the performance of those too.

anmyachev commented 2 months ago

I have actually just rerun the benchmarks for the latest versions of Polars, DuckDB, pandas, Dask, and PySpark. We will be publishing the results in the near future.

@stinodego It's good to hear that, just tried to reproduce these results and on n2-highmem-16 machine, with a factor of 10 including IO, I got the following results (for 8 queries):

It looks like the pandas speed up was a little less than twice as fast. However, to be sure, I need to know which requests were taken into account in the total time shown on the graph. Could you also share this information?

stinodego commented 2 months ago

Modin is a library that is in the ecosystem of data processing as well as cudf, ibis and many more. It would be great to see the performance of those too.

As far as I understand, Ibis is not an engine but just a front-end. So benchmarking it makes no sense. Benchmarking CuDF also makes no sense as it's a GPU engine - it would be like comparing apples to oranges.

I have no objection to including Modin once it can run on Python 3.12. We already have some infrastructure in place, I just have to re-add it.

It looks like the pandas speed up was a little less than twice as fast. However, to be sure, I need to know which requests were taken into account in the total time shown on the graph. Could you also share this information?

I'm not exactly sure how the numbers on the Polars website that you screenshotted came about - though I think it's probably the sum of the first 7 or 8 queries. There is a bit more background here. That was posted on January 1st, 2023. So I'm sure all dataframe libraries have made great improvements since then. Specifically, pandas is now ran with PyArrow data types and copy-on-write optimizations enabled.

I think the original question has been answered adequately, so I'll close this issue. Keep an eye out for a new blogpost on the Polars website with updated benchmarks in the next two weeks or so.