spirit-code / spirit

Atomistic Spin Simulation Framework
http://spirit-code.github.io
MIT License
117 stars 52 forks source link

Core: improve CUDA code with C++11 features #529

Open GPMueller opened 5 years ago

GPMueller commented 5 years ago

See https://devblogs.nvidia.com/power-cpp11-cuda-7 for a great summary. To use lambdas properly in the backend wrapper functions, the flag --expt-extended-lambda needs to be used and CUDA Toolkit has to be 7.5 or newer (see here).

The generic parallel lambda application function might look like

template<typename Lambda, typename... Args>
__global__
void cu_parallel_execution_kernel(size_t N, Lambda lambda, Args... args)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if( idx < N )
    {
        lambda(idx, args...);
    }
}
template<typename Lambda, typename... Args>
void cu_parallel_execution(size_t N, Lambda lambda, Args... args)
{
    cu_parallel_execution_kernel<<<(N+1023)/1024, 1024>>>(N, lambda, args...);
    CU_CHECK_AND_SYNC();
}

and backend wrapper functions would look like

void normalize_vectors(vectorfield & vf)
{
    auto lambda = [=] __device__ (size_t idx, Vector3 * vf) {
        vf[idx].normalize();
    };
    int n = vf.size();
    parallel_execution<<<(n+1023)/1024, 1024>>>(n, lambda, vf.data());
}

or

void add_c_cross(const scalar & c, const vectorfield & a, const vectorfield & b, vectorfield & out)
{
    auto lambda = [=] __device__ (size_t idx, scalar c, const Vector3 * a, const Vector3 * b, Vector3 * out)
    {
        out[idx] += c*a[idx].cross(b[idx]);
    };
    int n = out.size();
    _parallel_execution<<<(n+1023)/1024, 1024>>>(n, lambda, c, a.data(), b.data(), out.data());
}
GPMueller commented 5 years ago

See also https://devtalk.nvidia.com/default/topic/1043958/cuda-programming-and-performance/passing-lambda-functions-as-arguments-to-kernels/