Closed madsbk closed 2 years ago
lolwat. That's crazy. This looks like a great place to try @shwina's Python memory profiler to see where that allocation is coming from.
This looks like it's increasingly approximating a large self-join with only a few keys given n
grows >> the distribution of id
in the random dataset generator. Are we sure this isn't expected behavior with n=2e6
?
import cudf
for n in (10000, 50000, 100000, 200000):
df_a = cudf.datasets.randomdata(nrows=n)
df_b = cudf.datasets.randomdata(nrows=n)
res = df_a.merge(df_b, on=['id'])
print(f"Output size: {format_bytes(res.memory_usage().sum())}")
print(f"{n:,}", f"{len(res):,}")
print()
Output size: 33.94 MiB
10,000 889,769
Output size: 848.07 MiB
50,000 22,231,725
Output size: 3.32 GiB
100,000 89,010,357
Output size: 13.27 GiB
200,000 356,134,437
Oh yeah, if the data is heavily skewed then this join can definitely blow up.
@beckernick you are absolutely right, I assumed that the "id"
values in randomdata
were unique. Thanks for making this clear and sorry for the noise.
Describe the bug Merging two DataFrames of size 48MB requires an extra 132GB of device memory!
Steps/Code to reproduce bug The following code reproduce the issue. First it register a RMM resource adapter that prints the size of the allocation on failure and then it merge two 48MB sized DataFrames.
The output:
Expected behavior Perform the merge with a peak memory use of around 100-200 MB.
Environment My workstation and DGX-15, nightly conda:
cc. @randerzander