rapidsai / distributed-join

Other
19 stars 12 forks source link

Investigate device memory usage outside memory pool #51

Open gaohao95 opened 3 years ago

gaohao95 commented 3 years ago

By default, the memory pool size used is the total GPU memory - 500MB. During some OOM runs, we observed using smaller memory pool solves the OOM issue. This indicates that the program uses a lot of device memory outside of memory pool. Tracking what memory is used outside of memory pool and make sure they are allocated within the memory pool should fix such issues.

gaohao95 commented 3 years ago

One potential cause of this OOM is that thrust does not use RMM memory pool during table generation: https://github.com/rapidsai/distributed-join/blob/bc7563a5f5ef2a76bfa0ac275e304a9540f3fa3d/generate_dataset/generate_dataset.cuh#L231