This change ensures that QUDA is initialized prior to allocating pinned memory. Without doing this, when GPU-pinned memory is allocated, each MPI processes on a node will create a distinct CUDA context on the default GPU (i.e., GPU 0); this results in ~150 MiB wasted memory on GPU 0 that becomes significant if running 8 processes (e.g., 8 GPUs in a node). Since initialize_quda will assign the relevant GPU to each process, creating the CUDA context on the correct GPU, this removes that wasted memory.
This change ensures that QUDA is initialized prior to allocating pinned memory. Without doing this, when GPU-pinned memory is allocated, each MPI processes on a node will create a distinct CUDA context on the default GPU (i.e., GPU 0); this results in ~150 MiB wasted memory on GPU 0 that becomes significant if running 8 processes (e.g., 8 GPUs in a node). Since
initialize_quda
will assign the relevant GPU to each process, creating the CUDA context on the correct GPU, this removes that wasted memory.