kokkos / kokkos

Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction
https://kokkos.org
Other
1.88k stars 422 forks source link

Kokkos and cuda streams for pipelining computations #532

Closed pkestene closed 6 years ago

pkestene commented 7 years ago

Hello,

I have a question regarding asynchronous copy, and overlapping GPU computation with memory operation.

I have the following code, based on a "outer" functor in OpenMP exec space, and each OpenMP thread launch some computation on GPU.

Application is build with nvcc_wrapper using option "-default-stream per-thread" so that each OpenMP thread will be using a different cuda stream.

void operator( ... ) {

// launch computation on GPU GPUComputeFunctor func(); kokkos::parallel_for(N, func);

// retrieve results on host kokkos::deep_copy(host_subview, gpu_subview);

}

Ideally, I would like to have the above deep_copy happening in the same "cuda stream" as the associated compute kernel. By default, the memory copy is synchronous, and all the copies from the different OpenMP threads actually use the same cuda streams.

I have seen in kokkos sources that some deep_copy are using cudaMemcpyAsync API, but I not able to tell a given depp_copy to happend in the same cuda stream as the associated comute functor.

What is the right way of doing this in Kokkos ? Thank you.

mhoemmen commented 7 years ago

Kokkos will support asynchronous deep_copy through the three-argument version of deep_copy, that takes an execution space instance as its first argument. That instance corresponds to a CUDA stream. I'm not sure of the current status of this feature.

mhoemmen commented 7 years ago

Furthermore, I'm not sure whether Kokkos supports the -default-stream per-thread option. If not, then you might have to create those streams explicitly. That option only makes sense if you have a small number of CPU threads.

crtrott commented 7 years ago

While Kokkos supports the --default-strem per-thread option, and async deep copy works with that, we don't have the right stuff in place to make it obvious how to do a deep copy asynchronous with respect to "a" specific Cuda stream. There is an example inside the kokkos/example/tutorial/AdvancedViews directory for asynchronous deep copy, but I think it basically only works right now for a single Cuda execution stream. This will all be addressed as part of the Execution Space instances feature we are working hard on. That is practically the top priority in terms of big features for Kokkos, and is a deliverable for us inside of Sandia. So this is going to happen, and soonish. I actually just found that I desperately need that to support a ECP (DOE Exascale Computing Project) application of which I am part. So we are very very motivated to make that happen soon.

mhoemmen commented 7 years ago

@crtrott sorry for getting some of that stuff wrong!

pkestene commented 7 years ago

@mhoemmen I've tried to use the 3 argument version of deep copy, using OpenMP as an exec space, but it creates new cuda streams (different from the "compute" streams) which does not the desired behavior.

I don't know if it simple or possible to retrieve a existing cuda stream "handle" when using the "--default-strem per-thread". Maybe the low-level cuda driver api can do that ?

I guess one possible solution would be to explicitely create/manage cuda streams in the kokkos cuda internal layer as an alternative to using "--default-strem per-thread" ...

mhoemmen commented 7 years ago

See @crtrott 's comments above. This feature is not ready yet.

pkestene commented 7 years ago

I understand it's not ready, and that it takes time to implement.

Just for sharing, I just figured out, that by replacing the regular deep_copy by a direct call to Kokkos::Impl::DeepCopyAsyncCuda in which the call to cudaMemcpyAsync is modified like this:

CUDA_SAFE_CALL( cudaMemcpyAsync( dst , src , n , cudaMemcpyDefault , cudaStreamPerThread ) );

the last arg gets the default stream per thread, and we get the correct behavior.

crtrott commented 7 years ago

Ah that is good to know. Assuming you modified that particular code you should actually then get the same behaviour with just calling

Kokkos::deep_copy(Kokkos::Cuda(),dest,src);