rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.32k stars 886 forks source link

[QST] How to improve performance of cudf.pandas `df1.add(df2)` #14548

Closed dask-user closed 9 months ago

dask-user commented 10 months ago

In my example below, I'm seeing the cudf.pandas extension does have an impact, but it's a <2x speedup. I would have expected for something as basic as adding 2 like-shaped frames together, this would have had a significantly more profound impact. Is this expected? if not is there some other way I can realize a better performance improvement?

Thank you.

Environment configuration (using colab linked in cudf.pandas docs):

>>> import cudf
>>> cudf.__version__
'23.10.02'
>>> import pandas
>>> pandas.__version__
'1.5.3'
>>> !nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Running the following script

%load_ext cudf.pandas

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(250000, 1000))
df2 = pd.DataFrame(np.random.randn(250000, 1000))

x = df1.add(df2)  # timing this call below tag=(1)
y = df1.add(df2).add(df1).add(df2)  # timing this call below tag=(2)

%%timeit Timings

CPU Timings (no cudf extension)
tag=(1)
708 ms ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

tag=(2)
2.24 s ± 382 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

GPU Timings (with cudf extension) - confirmed using GPU with %%cudf.pandas.profile
tag=(1)
392 ms ± 49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

tag=(2)
1.26 s ± 161 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
shwina commented 10 months ago

Thanks for reporting. In general, you'll see optimal performance for tall and skinny dataframes. Ideally ~millions of rows and ~10s of columns. I suspect the main reasons you're seeing less than ideal perf is the large number of columns you're testing with.

dask-user commented 10 months ago

Thanks for the insight. Reshaping to 10,000,000 x 25 did indeed result in a 20x speedup. That is rather unfortunate though, as many time series analysis based use cases naturally fit this kind of shape. E.g. weather, oil/gas well drilling, power generation, etc. (in fact often wider and shorter) Is this an inherent limitation of working on GPUs, or just due to implementation details of cudf? I'm assuming the latter, because the original shaped dataset on cupy provides a similar ~20x speedup over NumPy. If that assumption is correct, is this on the cudf roadmap to address? Thanks.

dask-user commented 10 months ago

Looks like this actually swings significantly into raw pandas favor at a certain point (pandas ~9x faster). This is a very commonly shaped dataframe in power trading analysis (n days on rows, num power generating units in USA on columns). The same shaped array has cupy 20x faster than NumPy (and the pandas timing in this example too).

%load_ext cudf.pandas

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(5000, 40000))
df2 = pd.DataFrame(np.random.randn(5000, 40000))

x = df1.add(df2)  # timing this call

%timeit Timing

CPU
526 ms ± 6.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

GPU
4.68 s ± 425 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
shwina commented 10 months ago

Thanks for the exploration!

This is very much an intentional design trade off we've made in cuDF and not so much a limitation of GPUs generally, as you have seen in your experiments.

I hope you'll excuse a bit of detail in the rest of my response. I'll try to address a couple of different questions you may be having at this point, and provide some suggestions for performance at the end:

1. Why is cuDF slower than NumPy and CuPy when I have lots of columns?

In short, it's because cuDF stores data in memory differently compared to those libraries. cuDF's memory layout is optimized for a different set of use cases.

NumPy and CuPy both optimized for homogeneous data, i.e., all the values are of the same dtype. This data is stored in a single "block" of memory. When all the data is stored in a single block, operating on it all at once can be very fast.

In contrast, cuDF is optimized for when you have heterogenous data: some numeric columns, some datetime columns, some strings, and so forth. Each column is stored in a separate "block". Effectively, this means that if you try to operate on the entire DataFrame at once as if it were an array, the operation is done column-by-column, and is not as fast.

Concretely, cuDF follows the Arrow Columnar Format.

2. OK but why is it slower than Pandas?

For this very special case of homogeneous data, it so happens that pandas also stores all the data in a single block, making it effectively as fast as numpy.

Concretely, pandas uses a memory layout called "blockmanager". You can read about the advantages and disadvantages of that memory layout here.

What can I do about it?

  1. If you have homogenous, numerical data, then an n-dimensional array as provided by NumPy and CuPy is absolutely the better way to store and operate on that data. cuDF is not meant to be a replacement for CuPy. They're just two different tools for two different kinds of tasks. In fact cuDF uses CuPy internally for some operations.

  2. ⚡ Reshape the data! You mentioned that the data you're looking at has datetimes on the y-axis and num-power-units on the x-axis).

Here's how your data is currently organized - it's very slow with cuDF:

df = pd.DataFrame(np.random.rand(1000, 5000), index=pd.date_range("2001-01-01", periods=1000, name="date"))
print(df)
                0         1         2         3         4         5         6         7     ...      4992      4993      4994      4995      4996      4997      4998      4999
date                                                                                        ...
2001-01-01  0.956683  0.274487  0.797921  0.332391  0.243612  0.494871  0.577922  0.544569  ...  0.000151  0.798692  0.060587  0.030397  0.605180  0.532482  0.015798  0.620942
2001-01-02  0.726735  0.809853  0.052142  0.183206  0.235439  0.359681  0.724841  0.607033  ...  0.142227  0.728023  0.684641  0.967947  0.368438  0.125415  0.192330  0.853860
2001-01-03  0.555346  0.891815  0.270726  0.296449  0.350540  0.455533  0.359708  0.889074  ...  0.351938  0.066515  0.916541  0.864787  0.071087  0.917996  0.474812  0.658023
2001-01-04  0.342348  0.733050  0.995950  0.489302  0.058484  0.712600  0.986021  0.329928  ...  0.495343  0.954487  0.402061  0.026346  0.603795  0.345557  0.261188  0.896929
2001-01-05  0.129015  0.415968  0.769102  0.013509  0.854183  0.557628  0.306117  0.847056  ...  0.455228  0.146815  0.507599  0.211905  0.599186  0.451411  0.485051  0.266114
...              ...       ...       ...       ...       ...       ...       ...       ...  ...       ...       ...       ...       ...       ...       ...       ...       ...
2003-09-23  0.305755  0.719076  0.728252  0.459826  0.027935  0.555344  0.176992  0.516568  ...  0.656465  0.992528  0.742249  0.483864  0.304309  0.518897  0.356416  0.365874
2003-09-24  0.849155  0.504432  0.384092  0.937825  0.968336  0.673551  0.704029  0.114176  ...  0.788920  0.121971  0.067199  0.166384  0.897912  0.877682  0.376886  0.009965
2003-09-25  0.631917  0.095934  0.750285  0.483370  0.501030  0.927874  0.855700  0.450188  ...  0.819597  0.012943  0.517688  0.772057  0.941095  0.680312  0.260357  0.799556
2003-09-26  0.500482  0.148217  0.902447  0.257304  0.258049  0.427197  0.420026  0.335528  ...  0.875103  0.483942  0.513929  0.442715  0.908143  0.698122  0.425169  0.316536
2003-09-27  0.707737  0.490878  0.267552  0.470075  0.480105  0.582617  0.562214  0.740129  ...  0.300245  0.647184  0.084224  0.027469  0.816284  0.006615  0.571881  0.367855

[1000 rows x 5000 columns]

Here's how you could reshape the data into a tall-and-skinny dataframe and keep the same information:

df = df.reset_index()
df = df.melt("date", var_name="power_generating_unit_id")
df["power_generating_unit_id"] = df["power_generating_unit_id"].astype("int32")
print(df)
              date power_generating_unit_id     value
0       2001-01-01                        0  0.552872
1       2001-01-02                        0  0.493862
2       2001-01-03                        0  0.567609
3       2001-01-04                        0  0.875945
4       2001-01-05                        0  0.751430
...            ...                      ...       ...
4999995 2003-09-23                     4999  0.185239
4999996 2003-09-24                     4999  0.531474
4999997 2003-09-25                     4999  0.857289
4999998 2003-09-26                     4999  0.119676
4999999 2003-09-27                     4999  0.386075

[5000000 rows x 3 columns]

You still have the same information, just represented differently. For example, to get the value for a given date and power_generating_unit_id, you can do:

print(df[(df['date'] == '2001-01-01') & (df['power_generating_unit_id'] == 5)])
          index power_generating_plant_id     value
5000 2001-01-01                         5  0.754101

Now, operations like add are fast:

%timeit df["value"] + df["value"]  # note: microseconds
901 µs ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
shwina commented 10 months ago

I'll add that when data is stored in this way, it becomes natural and easy to do "database-like" operations like groupbys:

# get the mean power generated per unit:
df.groupby("power_generating_unit_id").mean()
                             value
power_generating_unit_id
0                         0.497459
1                         0.510917
2                         0.511179
3                         0.504013
4                         0.513688
...                            ...
4995                      0.502024
4996                      0.496629
4997                      0.494932
4998                      0.509441
4999                      0.504814

And operations like that are blazing fast with cuDF.

vyasr commented 9 months ago

@dask-user let us know if the above suggestions are helpful to you!

dask-user commented 9 months ago

HI @vyasr and @shwina thanks for the info. The above suggestions do explain the issue, thank you. In that case seems that cudf.pandas is not quite the right tool for us.

Is there any plans in the future to address the wide ndarray style operations? I do believe that a hybrid of that style and long relational format operations is where pandas’ strengths lie vs other tools like duckdb and polars.

Also out of curiosity is there anything I can do to further optimize the operations @shwina provided, it seems that duckdb is outperforming the gpu accelerated operations in the benchmarks I’m running. See below.

Thanks.

# on colab notebook linked in cudf.pandas docs

df = pd.DataFrame(np.random.rand(1000, 50000), index=pd.date_range("2001-01-01", periods=1000, name="date"))
df = df.reset_index()
df = df.melt("date", var_name="power_generating_unit_id")
df["power_generating_unit_id"] = df["power_generating_unit_id"].astype("int32")

%timeit (df['value'] + df['value']).sum()  # timed with pure pandas and cudf.pandas
%timeit duckdb.sql('select sum(value + value) from df')  # time this without cudf.pandas extension loaded
# pandas          200 ms
# cudf.pandas     12.7 ms
# duckdb          1.12 ms

%timeit df.groupby("power_generating_unit_id")['value'].mean().mean()
%timeit duckdb.sql('select mean(value) from (select power_generating_unit_id, mean(value) as value from df group by power_generating_unit_id)')
# pandas          943 ms
# cudf.pandas     37.6 ms
# duckdb          0.982 ms
shwina commented 9 months ago

In your duckdb timings, what you're measuring is simply the time it takes to create the relation object - not actually the time it takes to execute the query. IPython/Jupyter has some tricky behaviour here that might be throwing you off, where the call to %timeit returns before the object is printed, and the printing of the duckdb object is what executes the query.

I believe the right way to measure execution time here for duckdb is to invoke show() or execute() on the result. Here's what I see when I do that:

%timeit (df['value'] + df['value']).sum()  # timed with pure pandas and cudf.pandas
%timeit duckdb.sql('select sum(value + value) from df').execute()  # time this without cudf.pandas extension loaded

# pandas          200 ms
# duckdb          167 ms
# cudf.pandas     12.8 ms
%timeit df.groupby("power_generating_unit_id")['value'].mean().mean()  # timed with pure pandas and cudf.pandas
%timeit duckdb.sql('select mean(value) from (select power_generating_unit_id, mean(value) as value from df group by   power_generating_unit_id)').execute()  # time this without cudf.pandas extension loaded

# pandas          1.03 s
# duckdb          724 ms
# cudf.pandas     46 ms

Keep in mind though, that the Colab instance has just a single hyperthreaded core - so you may be able to get better performance using DuckDB on a machine with more CPU cores.

shwina commented 9 months ago

As for the question of supporting wide DataFrames like in your example, unfortunately that's not on the roadmap for cuDF. It would mean choosing a completely different memory layout, and sacrificing performance significantly for "classical" operations.

Apologies if that's not the answer you were looking for! I'd definitely look into whether CuPy would be a good fit for you if you're interested in leveraging GPUs for these kinds of operations.

dask-user commented 9 months ago

Understood makes sense thank you. And yes I see what you're referring to wrt duckdb timing. The post timeit printing executing the query was indeed throwing me off.