Writing the coordination with openACC

Iximiel commented 1 month ago

As I did with CUDA(#1028) and I tried to do with Arrayfire(#1049) and pytorch, I tried to rewrite the COORDINATION cv with openACC as accelerator.

Here's the result, using the new benchmark tool

sc_100 Is slower than CUDA, but writing in openACC may be more familiar, because it looks like openMP and also because you can leave the compiler to guess how to implement the parallelization of the loops and you do not have to use the <<<>>> to launch kernels like in CUDA. And is way more flexible than the tensor libraries. On the compilation I have some mixed feelings, as you can read in the spoiler below.

Details about compilation and script used

I run everything on my workstation (NVIDIA T1000 8GB + AMD Ryzen 5 PRO 5650G) I used nvhpc24.3, downloaded already compiled from the Nvidia site. The environment used is actually slightly complex: I compiled plumed from master with plain gcc+mpi Then I compiled the plugin with my wild Makefile that uses nvc++ for the accelerated part and g++ for the main body of the CV. Then I ran the benchmark **without nvhpc in the environment**, because it conflicts with the mpi that I used with plumed: ```bash nsteps=100 list_of_natoms="500 2000 4000 6000 8000 10000 12000 14000 16000" export PLUMED_NUM_THREADS=8 useDistr="line sc" useDistr="sc" for distr in $useDistr; do for natoms in $list_of_natoms; do fname="${distr}_wACC_${PLUMED_NUM_THREADS}threads_${natoms}_Steps${nsteps}" plumed benchmark --plumed="plumed.dat:cudasingleplumed.dat:accplumed.dat" \ --natoms=${natoms} --nsteps=${nsteps} --atom-distribution=${distr} >"${fname}.out" grep -B1 Comparative "${fname}.out" done done rm -f bck.* ``` (I have to try to make everything run compiled with plain nvhpc But since nvhpc does not like the kw auto for deducing return types (as used in tools/MergeVectorTools.h:54), it needs some massages to the plumed source and I did not want to touch src for this project)

If you look at the code I also added a few extra headers:

LoopUnroller.h Tensor.h Vector.h that are a variant to the originals header with the possibility of declaring Tensors and Vector of any type. and some splashes of refactor to c++17 where I did not managed to convince nvc++ to deduce the template arguments as I wanted
Tools_pow.h that templatizes the type int the runtime version of fastpow

Since these modifications are a prerequisite to the use of openACC but are completely independent from it. If you are ok with this, I would like to open a PR with a patch to the original .h files

GiovanniBussi commented 1 month ago

Regarding the vector and tensor with generic type, I tried to do the same a few years ago and I remember that with intel compiler the performances were measurably affected (to my surprise). Maybe you can double check this. In case it's true, maybe we can duplicate the code. Otherwise I am also happy with a more general version, it would be useful in other parts of the code as well

GiovanniBussi commented 1 month ago

(I have to try to make everything run compiled with plain nvhpc But since nvhpc does not like the kw auto for deducing return types (as used in tools/MergeVectorTools.h:54), it needs some massages to the plumed source and I did not want to touch src for this project)

If it's limited to this maybe we can adjust the code. It would be ideal if we could also install nvc++ on one job in GitHub actions to test for this

Iximiel commented 1 month ago

Regarding the vector and tensor with generic type, I tried to do the same a few years ago and I remember that with intel compiler the performances were measurably affected (to my surprise). Maybe you can double check this. In case it's true, maybe we can duplicate the code. Otherwise I am also happy with a more general version, it would be useful in other parts of the code as well

Ok, so I set up the PR as a wip, then I will produce some benchmarks

If it's limited to this maybe we can adjust the code. It would be ideal if we could also install nvc++ on one job in GitHub actions to test for this

I'm trying to do it in #1076

plumed / plumed2

Writing the coordination with openACC #1075