kt0d / gpu-bfs

Implementation of BFS graph traversal for CUDA
The Unlicense
6 stars 2 forks source link

Does there needs a synchronization ? #1

Open ax7e opened 1 year ago

ax7e commented 1 year ago

Thanks for your great work! The following code in bfs_kernels.cu:

        linear_bfs<<<num_of_blocks,BLOCK_SIZE>>>(graph.nnz,d_row_offset,d_column_index,d_distance,iteration, d_in_queue, h_queue_count,d_out_queue, d_queue_count);
        // Get queue count
        checkCudaErrors(cudaMemcpy(&h_queue_count, d_queue_count,sizeof(int), cudaMemcpyDeviceToHost));

Maybe we should add an cudaDeviceSynchronize after the kernel call? This problem also occurs around quad_bfs.

ax7e commented 1 year ago

Also, for the code distance[j]=iteration+1, maybe the right way is to use compare-and-swap?

__global__ void linear_bfs(const int n, const int* row_offset, const int*const column_index, int*const distance, const int iteration,const int*const in_queue,const int in_queue_count, int*const out_queue, int*const out_queue_count)
{
    // Compute index of corresponding vertex in the queue.
    int global_tid = blockIdx.x*blockDim.x + threadIdx.x;

    do
    {
        if(global_tid >= in_queue_count) continue;
        // Get vertex from the queue.
        const int v = in_queue[global_tid];
        // Load row range of vertex v.
        const int r = row_offset[v];
        const int r_end = row_offset[v+1];
        for(int offset = r; offset < r_end; offset++)
        {
            const int j = column_index[offset];
            if(distance[j] == bfs::infinity)
            {
                distance[j]=iteration+1;
                // Enqueue vertex.
                const int ind = atomicAdd(out_queue_count,1);
                assert(ind < n);
                out_queue[ind]=j;
            }
        }
        global_tid += gridDim.x*blockDim.x;
    } 
    while(__syncthreads_or(global_tid < in_queue_count));
}
kt0d commented 1 year ago

https://github.com/kt0d/gpu-bfs/blob/11a2b56684108aba986ee0fcbd7f3c2e50ee0df1/bfs.cu#L195-L198 According to API synchronization behavior, this cudaMemcpy call is synchronous with respect to host. And since it's not cudaMemcpyAsync, it uses default CUDA stream, so it shouldn't start copying memory before the kernel finishes execution. See also this thread: cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ?. But it's possible that cudaDeviceSynchronize calls are missing somewhere else in the code.

https://github.com/kt0d/gpu-bfs/blob/11a2b56684108aba986ee0fcbd7f3c2e50ee0df1/bfs_kernels.cu#L56-L63 If you mean setting new value of distance[j] using atomicCAS, then it may be a good idea.

Unfortunately I don't have access to Nvidia GPU at the moment, so I can't really test any of the above. I've used The Unlicense license so you're welcome to modify this program and publish it in any way you like.