When we create a graph from just edge lists (no vertex lists), we find unique vertices from the given edge lists. This can significantly increase the peak memory requirement under the current implementation.
We currently find unique vertices locally in each GPU. We create a local copy of the edges (for each partition), and use sort& unique to find unique vertices. We can cut memory footprint here by hash based binning, and copying only the edges that belong to the current bin in finding unique vertices that belong to the current bin.
Once we find locally unique vertices, we need to shuffle the unique vertices to different GPUs. In each GPU, multiple locally unique vertex vectors are gathered, and there can be multiple duplicate vertices across the vectors. These duplicates can significantly increase the peak memory usage especially when the average vertex degree is low and the number of GPUs is large. This PR performs shuffling separately for different bins keep the temporary memory usage as edge list size * max(1/# bins, 2/# local edge partitions).
We can control # bins to trade-off computing (we need to scan edge list # bins times) vs peak memory usage.
When we create a graph from just edge lists (no vertex lists), we find unique vertices from the given edge lists. This can significantly increase the peak memory requirement under the current implementation.
We currently find unique vertices locally in each GPU. We create a local copy of the edges (for each partition), and use sort& unique to find unique vertices. We can cut memory footprint here by hash based binning, and copying only the edges that belong to the current bin in finding unique vertices that belong to the current bin.
Once we find locally unique vertices, we need to shuffle the unique vertices to different GPUs. In each GPU, multiple locally unique vertex vectors are gathered, and there can be multiple duplicate vertices across the vectors. These duplicates can significantly increase the peak memory usage especially when the average vertex degree is low and the number of GPUs is large. This PR performs shuffling separately for different bins keep the temporary memory usage as
edge list size
*max(1/# bins, 2/# local edge partitions)
.We can control # bins to trade-off computing (we need to scan edge list # bins times) vs peak memory usage.