Open ax7e opened 1 year ago
Also, for the code distance[j]=iteration+1
, maybe the right way is to use compare-and-swap?
__global__ void linear_bfs(const int n, const int* row_offset, const int*const column_index, int*const distance, const int iteration,const int*const in_queue,const int in_queue_count, int*const out_queue, int*const out_queue_count)
{
// Compute index of corresponding vertex in the queue.
int global_tid = blockIdx.x*blockDim.x + threadIdx.x;
do
{
if(global_tid >= in_queue_count) continue;
// Get vertex from the queue.
const int v = in_queue[global_tid];
// Load row range of vertex v.
const int r = row_offset[v];
const int r_end = row_offset[v+1];
for(int offset = r; offset < r_end; offset++)
{
const int j = column_index[offset];
if(distance[j] == bfs::infinity)
{
distance[j]=iteration+1;
// Enqueue vertex.
const int ind = atomicAdd(out_queue_count,1);
assert(ind < n);
out_queue[ind]=j;
}
}
global_tid += gridDim.x*blockDim.x;
}
while(__syncthreads_or(global_tid < in_queue_count));
}
https://github.com/kt0d/gpu-bfs/blob/11a2b56684108aba986ee0fcbd7f3c2e50ee0df1/bfs.cu#L195-L198
According to API synchronization behavior, this cudaMemcpy
call is synchronous with respect to host. And since it's not cudaMemcpyAsync
, it uses default CUDA stream, so it shouldn't start copying memory before the kernel finishes execution.
See also this thread: cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ?.
But it's possible that cudaDeviceSynchronize
calls are missing somewhere else in the code.
https://github.com/kt0d/gpu-bfs/blob/11a2b56684108aba986ee0fcbd7f3c2e50ee0df1/bfs_kernels.cu#L56-L63
If you mean setting new value of distance[j]
using atomicCAS
, then it may be a good idea.
Unfortunately I don't have access to Nvidia GPU at the moment, so I can't really test any of the above. I've used The Unlicense license so you're welcome to modify this program and publish it in any way you like.
Thanks for your great work! The following code in bfs_kernels.cu:
Maybe we should add an cudaDeviceSynchronize after the kernel call? This problem also occurs around quad_bfs.