Describe the bug
dask_cudf groupby mean is numerically instable
Steps/Code to reproduce bug
import numpy as np
import cudf
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
from dask.distributed import Client
client = Client(cluster)
for i in range(100):
np.random.seed(3)
size = 100
groups = 20
df = cudf.DataFrame()
df['asset'] = np.random.randint(1, groups, size)
df['num'] = np.random.rand(size)
cdf = dask_cudf.from_cudf(df, npartitions=16)
gt = df.groupby('asset').mean()
dm = cdf.groupby('asset').mean().compute().reset_index().sort_values('index')
print('trail', i, (gt['num']-dm['num']).abs().max())
Expected behavior
The distributed groupby vs non-distributed groupby mean should be the same, and stable (independent of the trail number). But the above will produce different numbers randomly.
Environment overview (please complete the following information)
DGX-1 machine
I believe this is just due to the nature of floating point precision. Closing, but feel free to reopen if you have an updated example that shows issues.
Describe the bug dask_cudf groupby mean is numerically instable
Steps/Code to reproduce bug
Expected behavior The distributed groupby vs non-distributed groupby mean should be the same, and stable (independent of the trail number). But the above will produce different numbers randomly.
Environment overview (please complete the following information) DGX-1 machine
Environment details
Click here to see environment details
Additional context Add any other context about the problem here.