rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 28 forks source link

Run TPC-H on Databricks #315

Open jacobtomlinson opened 8 months ago

jacobtomlinson commented 8 months ago

Reproduce the TPC-H work that @rjzamora has been looking into with Dask+cuDF and the TPC-H work Coiled has been doing with Dask but on Databricks.

https://tpch.coiled.io/

rjzamora commented 8 months ago

cc @charlesbluca

skirui-source commented 6 months ago

https://medium.com/coiled-hq/tpc-h-benchmarks-for-query-optimization-with-dask-expressions-c42921aafbd3

skirui-source commented 6 months ago

Initially I was running into this bug, which i resolved by downgrading from pytest v8.0.1, had to downgrade to v7.4.4

skirui-source commented 4 months ago

@jacobtomlinson just to clarify, is the goal here to:

  1. run TPC-H benchmarks with dask-expr + cudf in Databricks?
  2. run TPC-H benchmarks with dask + coiled in Databricks?

You also mentioned something about running benchmarks for cudf.pandas, is that part of this?

skirui-source commented 4 months ago

@rjzamora with your recent work merged into rapidsai/cudf, what are the steps to recreate the test environment you used to produce the benchmark results?

jacobtomlinson commented 4 months ago

The primary goal is dask-expr + cudf on Databricks. Secondary goal is cudf.pandas on Databricks.

rjzamora commented 4 months ago

Thanks for working on this @skirui-source !

what are the steps to recreate the test environment you used to produce the benchmark results?

I haven't updated my tpch-rapids branch of coiled/benchmarks to align with main, but the latest version of cudf-24.06 seems to work fine with that branch. You can also use cudf-24.04, but need to set the DASK_DATAFRAME__QUERY_PLANNING=True environment variable before running to use dask-expr under the hood (setting False like this will disable dask-expr for both 24.04 and 24.6, but it is turned on by default in 24.06).

To run locally, I started with a 24.04 or 24.06 rapids environment. Then did the following:

Then, from within benchmarks, you can execute a query:

export DASK_DATAFRAME__QUERY_PLANNING=True
pytest --benchmark  tests/tpch/test_dask.py::test_query_1 --local --rapids --scale 100

The timing result will be appended to a benchmark.db file. I've been using the following python code to convert that file to a pandas DataFrame (but, I'm sure there is a better way):

import pandas as pd
import sqlite3
con = sqlite3.connect("./benchmark.db")
df = pd.read_sql_query("SELECT * from test_run", con)
print(df[["name", "duration"]])
con.close()

Important notes on the data:

I believe I needed to jump through a few annoying hoops to both generate the data and modify the code to handle differences between my data and the s3 data used by Coiled. For example, my data currently uses a single parquet file for each table, while Coiled uses a directory of files (which definitely makes more sense beyond sf100). This is why my benchmarks branch has a _read_parquet "hack" in benchmarks/tests/tpch/dask_queries.py, where I add a ".parquet" suffix to the table path.

The primary goal is dask-expr + cudf on Databricks. Secondary goal is cudf.pandas on Databricks.

I would definitely focus on dask-expr + cudf for now. I believe pandas->arrow->pandas conversion is still a problem in cudf.pandas, and dask/dask-expr will do this a lot.

skirui-source commented 4 months ago

initial batch of results (tested on dgx14):

(rapids-24.06) skirui@dgx14:/datasets/skirui/GitRepos/coiled-benchmarks$ py.test --benchmark tests/tpch/test_dask.py --local --rapids --scale 10

In [3]: print(df[["name", "duration"]])
name  duration
1    test_query_1  9.345021
2    test_query_2       NaN
3    test_query_3  6.615860
4    test_query_4  4.828701
5    test_query_5  7.316751
6    test_query_6  5.451433
7    test_query_7  7.176501
8    test_query_8  7.961834
9    test_query_9  6.981508
10  test_query_10  6.611681
11  test_query_11  3.811042
12  test_query_12  5.117650
13  test_query_13       NaN
14  test_query_14  5.732556
15  test_query_15  5.727635
16  test_query_16  4.815220
17  test_query_17  6.078448
18  test_query_18  7.685056
19  test_query_19  5.934201
20  test_query_20  5.236853
21  test_query_21  7.247067
22  test_query_22  4.751128

(rapids-24.06) skirui@dgx14:/datasets/skirui/GitRepos/coiled-benchmarks$ py.test --benchmark tests/tpch/test_dask.py --local --rapids --scale 100


In [3]: print(df[["name", "duration"]])
name   duration
23   test_query_1  14.323813
24   test_query_2   7.302116
25   test_query_3        NaN
26   test_query_4  18.434659
27   test_query_5  19.057925
28   test_query_6   5.927966
29   test_query_7  16.077179
30   test_query_8  16.161171
31   test_query_9  27.749285
32  test_query_10  15.439956
33  test_query_11   5.020720
34  test_query_12   9.845365
35  test_query_13        NaN
36  test_query_14   6.785045
37  test_query_15  11.023882
38  test_query_16        NaN
39  test_query_17  21.482339
40  test_query_18  28.699488
41  test_query_19   7.627631
42  test_query_20        NaN
43  test_query_21  24.507490
44  test_query_22   5.938541
>(rapids-24.06) skirui@dgx14:/datasets/skirui/GitRepos/coiled-benchmarks$ py.test --benchmark  tests/tpch/test_dask.py --local --rapids --scale 1
         name     duration

45 test_query_1 7.262008 46 test_query_2 4.105303 47 test_query_3 4.405016 48 test_query_4 NaN 49 test_query_5 4.382684 50 test_query_6 3.901139 51 test_query_7 NaN 52 test_query_8 4.996450 53 test_query_9 4.708150 54 test_query_10 NaN 55 test_query_11 3.718096 56 test_query_12 3.900132 57 test_query_13 NaN 58 test_query_14 4.186984 59 test_query_15 4.032958 60 test_query_16 NaN 61 test_query_17 4.383530 62 test_query_18 4.013830 63 test_query_19 3597.592139 64 test_query_20 5.858231 65 test_query_21 5.473735 66 test_query_22 5.003113