shankar1729 / jdftx

JDFTx: software for joint density functional theory
http://jdftx.org
82 stars 54 forks source link

Parallelization of exact exchange calculations #261

Closed jcoulter12 closed 1 year ago

jcoulter12 commented 1 year ago

Hi Shankar,

I'm testing a calculation with the HSE06 in JDFTx, and I'm wondering about the best strategy to parallelize it. I've taken advice posted in past issues and I'm running electronic-scf to use the ACE method for computing the exchange.

I have the option to run this calculation using CPUs or GPUs. Would this calculation benefit from the use of GPUs? Additionally, when using CPUs, my usual strategy with JDFTx is to set MPI tasks = nStates, and then set all available processes to be used as threads. Is this still advisable in a calculation where the dominant cost is the ACE calculation?

Thanks as always for any advice! Jenny

shankar1729 commented 1 year ago

Hi Jenny,

If your system size is big enough for the band FFTs to occupy your GPU, you should be able to get good speedups from the GPU. For example, a bulk system may not be able to leverage the GPU effectively in the current EXX implementation, but a supercell / slab should generally be able to do so.

The EXX implementations are parallelized over bands as well, so you can use more processes than nStates. This does involve wavefunction communications between processes, so there will be a loss of efficiency beyond some (hopefully large) number of processes. Benchmark a single electronic step for your system with varying CPU and GPU process counts to determine this.

For the CPU case, you may have better performance running in pure MPI mode if your system is small, but may benefit from hybrid MPI/threads for larger supercell / slab systems. (Terminology clarification: MPI tasks = processes, and in the hybrid mode, you would be using all available cores with appropriate number of threads.)

Finally, SCF with ACE may be faster for metals, but for gapped systems without empty states being converged, electronic minimize may turn out to still be faster. (The ACE trick is useful to optimize the inner eigenvalue solve for SCF, but if a variational minimize ends up taking fewer steps than SCF cycles, the minimize could still win.)

Please try and let me know if you have further questions. Also, posting your performance experience on various hardware would be useful guidelines for future users.

Thanks! Best, Shankar

jcoulter12 commented 1 year ago

Hi Shankar,

Thanks for the helpful reply, this is just what I needed to know.

My system is a 2D material, so there's enough FFT work to benefit from the GPUs for sure. Great to know it also has band parallelism and can go beyond 1 nState/MPI process. My system is also slightly metallic, so for now I've been playing with SCF/ACE -- however, your thought on elec-min converging faster in some cases is also helpful.

In exploring what was best for my system, I did run a couple of tests with CPUs and GPUs, which I'll leave here in case it's helpful to anyone else in the future. I picked a kmesh for the tests which ended up giving nStates = 12, which might be worth noting, as that number divides nicely across number of processors used in my tests.

For GPUs (A100s) I found that 2 vs 1 resulted in almost ideal speedup, but any more than 2 was actually a slowdown, which feels important to note for other users.

nGPUs Time/VxxIter [seconds]
1 5830
2 2977
3 5517
4 5975

For CPUs, I was using Intel Ice Lake nodes with 64 cpus/node. Running with purely MPI parallelism (-c 1):

MPI processes Time [seconds]
64 4621
128 2676
256 1667

Not exactly ideal speedup, but there was meaningful benefit up to 256 cores. Both of these were done with the same DFT parameters, so we can also see 2 GPUs was roughly comparable to ~ 128 CPUs.

One additional test was to see if I could get any benefit on the CPU runs by shifting to hybrid MPI/threads. This did not work well in my case, but maybe I was too extreme in the shift -- running with 128 cores, 4 MPI processes and 32 threads/MPI process (-n 4, -c 32), one VxxLoop took ~7 hours.

As usual, glad to have such a nicely GPU accelerated DFT package! Unless you have more questions about these quick benchmarks, feel free to close the issue.

Thanks, Jenny

shankar1729 commented 1 year ago

This is great, thanks! For the GPUs, is the code compiled with CudaAwareMPI=yes and a cuda-aware MPI? If not, it would be worth looking into that as that could be a significant communication bottleneck.

Also, does the cluster have 2 GPUs / node? If so, it could also be that the MPI supports fast inter-GPU communications on node, but is not using a high-speed interconnect / direct GPU copies for transfer between GPUs on nodes.

Best, Shankar

jcoulter12 commented 1 year ago

Hi Shankar,

My copy of JDFTx was not compiled with CudaAwareMPI -- for a long time, I didn't have access to that on the cluster at Harvard and lacked the motivation to set it up myself. Apparently, it's now available and I was able to compile JDFTx with it, however, when I run the GPU executable, it segfaults just after the first pseudopotential is read in and the stack trace unfortunately doesn't give me any good hints as to what went wrong. Could be an issue with JDFTx, but is also very likely an issue with something on the cluster I'm using.

One thing with respect to this that might be interesting for someone else to know: I could not build JDFTx unless I specified -D CUDA_NVCC_FLAGS="-std=c++11", because the build was failing when compiling the CUDA accelerated parts of JDFTx with the error: "error C++11 is required for this Thrust feature; please upgrade your compiler or pass the appropriate -std=c++XX flag to it."

As for GPUs/node, we have 4 A100s per node, anyway.

Thanks again for the help with this! Jenny

shankar1729 commented 1 year ago

Most likely this means that the cuda aware MPI requires some extra flags to get it to work correctly, either in the mpirun command line or as environment variables. Any documentation in the cluster about cuda-mpi usage?

Also, is this MPI openmpi or mpich based?

Best, Shankar

jcoulter12 commented 1 year ago

Hi Shankar,

Turns out there was an issue with the modules I was using. Once I was able to fix that, there was a huge boost from CUDA-aware MPI.

For my 2 GPU test case, reduced runtime from:

Time/VxxIter [seconds]
without CUDA-aware MPI 2977
with CUDA-aware MPI 437

Which is just amazing. Thanks for this tip! Feel free to close the issue unless you have any other test you're curious to see.

Best, Jenny

shankar1729 commented 1 year ago

Okay, that's more like what I would have expected! Thanks for the performance benchmarks.

Best, Shankar