rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.01k stars 869 forks source link

[BUG] cudf.pandas dataframe.__repr__ slow in jupyterlab for large datasets #15747

Open AjayThorve opened 1 month ago

AjayThorve commented 1 month ago

Describe the bug Calling a dataframe.repr in a notebook cell either takes very long or results in a kernel failure for large datasets. Steps/Code to reproduce bug In a jupyterlab environment, run this in a cell:


# [cell 1]
%load_ext cudf.pandas

# [cell 2]
import pandas as pd
import numpy as np

# Define the number of rows and columns
num_rows = 25_000_000
num_columns = 12

# Create a DataFrame with random data
df = pd.DataFrame(np.random.randint(0, 100, size=(num_rows, num_columns)),
                  columns=[f'Column_{i}' for i in range(1, num_columns + 1)])

# [cell 3]
df

image

Expected behavior dataframe should render quickly, as is the case when working directly with cudf, or pandas

Note This works as expected in a python interactive shell, or when calling print(df) in a notebook.

vyasr commented 1 month ago

cf #13297