Add Kokkos Views implementation to target GPUs

sergisiso commented 4 years ago

Continues #31

This PR splits manual_version/psykal_kokkos in 2 different PSy-layer Kokkos implementations:

time_step_rawpointers_kokkos.cpp: Kokkos is only used for the parallel dispatching, the memory layout and padding is handled by dl_esm_inf, this is more memory efficient (no extra copies) but it is not able to run on the GPU.
time_step_views_kokkos.cpp: uses the Kokkos View container, the memory layout and padding is handled by Kokkos, it requires at least an extra Host copy (Fortran Arrays->Views) or sometimes two copies (View Host mirror -> Device ) but it can run on a GPU.

At the moment the second implementation does all the copies every single iteration (which takes most of the time) and makes it slower that the raw pointer implementation. There is multiple ways to avoid doing the copies at each time step but this will be tackled in a follow up PR. nemolite2d_kokkos_copyalways

sergisiso commented 4 years ago

This is ready for review. Probably @arporter or @rupertford . As mentioned the new implementation already runs with correct results in the GPU but I made no attempt to reduce the large number of copy-in/out in this PR.

sergisiso commented 4 years ago

@arporter This is ready for next review.

Do these Kokkos versions produce checksums that match with other versions?

Yes this was my main focus for this PR :) Both implementations (rawpointers, views) with any device (serial, OpenMP, Cuda) that it can use, produces the same checksums as the psykal_serial or psykal_cpp implementations.

stfc / PSycloneBench

Add Kokkos Views implementation to target GPUs #46