[PERF] looping through dataframe is 100x slower than when running without cudf

magnus-ekman commented 3 months ago

Describe the bug I have a case where I loop through each element in a dataframe and call a function for each element. When running with cudf.pandas, this takes on the order of 100x longer time than when running with just pandas. I recognize that best practices is to write vectorized functions but there are cases where it is just easier to loop through each element. I don't expect speedup compared to the non-cudf implementation but it would be good if there wasn't a huge slowdown.

Steps/Code to reproduce bug Code run in a Jupyter notebook:

%load_ext cudf.pandas
import pandas as pd
import numpy as np
matrix = np.zeros((100, 100))
df = pd.DataFrame(matrix)

%%time
def func(acc, val):
    acc += val
    return acc    
acc = 0.0
for col in df.columns:
    for idx in df.index:
        val = df[col][idx]
        acc = func(acc, val)
print(acc)

Expected behavior When running without cudf this takes 60ms. When running with cudf it takes 10 seconds. I would expect performance with cudf to be comparable to performance without cudf.

Environment overview (please complete the following information) -Bare-metal -PIP install

Environment details Not sure where to find that script. Here are my basic setup: Platform: x86 + A100 GPU. Ubuntu 22.04.4 LTS cuDF: Name: cudf-cu12 Version: 24.6.1 CUDA: Cuda compilation tools, release 12.3, V12.3.107 Python: Python 3.10.12 Running in a Jupyter notebook

Additional context Add any other context about the problem here.

galipremsagar commented 2 months ago

Hi @magnus-ekman ,

Thank you for the report. This is an issue with cudf when we try to access the scalar values from a column. They are inherently slower when compared to pandas. Here is an example:

# Pandas
In [1]: import pandas as pd

In [2]: s = pd.Series([10, 1, 2, 3, 4, 5])

In [3]: %timeit s[2]
4.73 μs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

# cudf
In [1]: import cudf

In [2]: s = cudf.Series([10, 1, 2, 3, 4, 5])

In [3]: %timeit s[2]
1.66 ms ± 1.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This slow-down is being amplified in your example. This is something we at Nvidia are actively working on to alleviate.

However, as a temporary workaround you can disable using GPU for an instruction using this:

from cudf.pandas.module_accelerator import disable_module_accelerator
with disable_module_accelerator():
    # your pandas code

magnus-ekman commented 2 months ago

Thanks. I have a (perhaps silly) question on the workaround that is related to this slowdown. When I work in a Jupyter notebook, I like to simply type "df" in a cell and execute the cell to get the DataFrame printed in a nicely formatted way. Doing so is super slow with cudf. If I try to apply your suggested workaround, I don't get a print-out. It works if I instead do "print(df)", but it will not be as nicely formatted. Any ideas of how to solve this?

bdice commented 2 months ago

@magnus-ekman I think that issue with showing df might be the same as #15747.

@galipremsagar Maybe we can work on accelerating the fancy repr in the nearer term, since it should be easier to solve than the broader problem of scalar access.

rapidsai / cudf

[PERF] looping through dataframe is 100x slower than when running without cudf #16491