Open gaohao95 opened 3 years ago
One potential cause of this OOM is that thrust does not use RMM memory pool during table generation: https://github.com/rapidsai/distributed-join/blob/bc7563a5f5ef2a76bfa0ac275e304a9540f3fa3d/generate_dataset/generate_dataset.cuh#L231
By default, the memory pool size used is the total GPU memory - 500MB. During some OOM runs, we observed using smaller memory pool solves the OOM issue. This indicates that the program uses a lot of device memory outside of memory pool. Tracking what memory is used outside of memory pool and make sure they are allocated within the memory pool should fix such issues.