JDFTx Fails with MPI_Abort under all conditions on Polaris

ColinBundschu commented 3 months ago

All of the jobs are failing with the following error:

MPICH ERROR [Rank 1] [job id f1ce81c3-6a27-482f-bc31-942246dcb469] [Wed Jun  5 19:47:53 2024] [x3109c0s19b1n0] - Abort(1) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
MPICH ERROR [Rank 3] [job id f1ce81c3-6a27-482f-bc31-942246dcb469] [Wed Jun  5 19:47:53 2024] [x3109c0s19b1n0] - Abort(1) (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
x3109c0s19b1n0.hsn.cm.polaris.alcf.anl.gov: rank 3 exited with code 255
x3109c0s19b1n0.hsn.cm.polaris.alcf.anl.gov: rank 1 died from signal 15

Here is the stack trace that is dumped:

cbu@polaris-login-04:~/dft_out/FeN4C10/clean> cat jdftx-stacktrace
/home/cbu/jdftx/build/libjdftx_gpu.so(_Z10printStackb+0x27) [0x147e12f863a7]
/home/cbu/jdftx/build/libjdftx_gpu.so(_Z14stackTraceExiti+0xd) [0x147e12f86acd]
/home/cbu/jdftx/build/libjdftx_gpu.so(_Z8choleskyRK6matrixb+0x372) [0x147e12f94d72]
/home/cbu/jdftx/build/libjdftx_gpu.so(_Z11orthoMatrixRK6matrix+0xdc) [0x147e12f9549c]
/home/cbu/jdftx/build/libjdftx_gpu.so(_ZN8ElecVars14orthonormalizeEiP6matrix+0x141) [0x147e131323c1]
/home/cbu/jdftx/build/libjdftx_gpu.so(_ZN16LatticeMinimizer4stepERK15LatticeGradientd+0xcb6) [0x147e1322d1f6]
/home/cbu/jdftx/build/libjdftx_gpu.so(_ZN16LatticeMinimizer8minimizeERK14MinimizeParams+0x64) [0x147e1322e464]
/home/cbu/jdftx/build/jdftx_gpu() [0x40868e]
/lib64/libc.so.6(__libc_start_main+0xef) [0x147e02a3e24d]
/home/cbu/jdftx/build/jdftx_gpu() [0x407ffa]
cbu@polaris-login-04:~/dft_out/FeN4C10/clean>

shankar1729 commented 3 months ago

What do you mean all conditions? Weren't the jobs working previously?

ColinBundschu commented 3 months ago

Previously I had only run them for a fixed and small number of ionic steps. This is running to convergence. So it will run for between 5-20 iterations, then crash.

shankar1729 commented 3 months ago

I think you may just want to install gcc and openmpi in your directory and build jdftx with that. I'm not sure debugging on the very restrictive toolchain at Polaris is worth it any more.

ColinBundschu commented 3 months ago

I agree. Would using gcc allow full GPU support? I ask because I know that the gcc they run at Polaris does not support GPUs. I assume there is a very good reason for that. but given how things have been going I do not know that there is a good reason.

shankar1729 commented 3 months ago

There shouldn't be one. We have used gcc + cuda 12 + cray MPI with A100 and H100 systems on NERSC and NREL for thousands of calculations, and have not had such issues. The problem seems to be nvhpc.

ColinBundschu commented 3 months ago

So you don't know what this is all about then? https://docs.alcf.anl.gov/polaris/compiling-and-linking/gnu-compilers-polaris/

shankar1729 commented 3 months ago

Don't know why. Same hardware from Cray + Nvidia on NERSC recommends GNU tool chain for GPUs:

https://docs.nersc.gov/systems/perlmutter/software/#compilers

ColinBundschu commented 3 months ago

I don't know what the people running Polaris are smoking... it compiled and ran all gpu tests using gcc without issue. I will test full calculations now, but I am frankly quite frustrated with how Polaris is managed.

shankar1729 commented 3 months ago

:)

Great, let me know! Would be happy to undo some past changes and mark nvhpc as officially broken for jdftx to avoid future issues.

ColinBundschu commented 3 months ago

Ok so the jobs compiled with gcc are failing in the exact same way as with nvidia compilers

ColinBundschu commented 3 months ago

Ok my working theory is that when the number of kpoints exceeds the number of GPUs, the GPU runs out of memory. I am now testing to see if a 1:1 mapping solves this issues. Am I correct that there is no benefit to having more GPUs than kpoints?

shankar1729 commented 3 months ago

That is correct, don't exceed one GPU per k-point. See if the 1:1 case works, but the error you posted previously did not seem to be a memory issue, but rather incorrect math leading to a non-hermitian overlap matrix (assuming that cholesky in the stack trace is actually what causes the failure).

Maybe also send a full example log file with these errors.

ColinBundschu commented 3 months ago

By full example log file, do you mean the jdftx output file? Other than the jdftx output file, the stack trace, and the error I posted above, there is no additional output from the program that I am aware of.

Assuming that the jdftx output file is what you are after, here is an example using the gcc compiled version of jdftx. There was no stack trace trace dumped with this failed run. It is possible the stack trace I sent earlier was a lingering file from a different error we had already debugged.

Here is the std out from PBS:

NUM_OF_NODES= 1 TOTAL_NUM_RANKS= 4 RANKS_PER_NODE= 4 THREADS_PER_RANK= 1

Here is the std err from PBS:

MPICH ERROR [Rank 3] [job id ef025f29-61ad-4057-9162-fdbdbef994cd] [Thu Jun  6 17:18:51 2024] [x3109c0s1b0n0] - Abort(1) (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
x3109c0s1b0n0.hsn.cm.polaris.alcf.anl.gov: rank 3 exited with code 255
x3109c0s1b0n0.hsn.cm.polaris.alcf.anl.gov: rank 2 died from signal 15

The output file for jdftx is attached example.txt

ColinBundschu commented 3 months ago

So far the runs with at least as many GPUs as kpoints are progressing much further than all of the other runs did (they are currently still in flight). Additionally I noticed that the runs seem to need about 18.5 GB of memory for the state from the profiling. It seems plausible to me that the GPU with 2 kpoints will have double the memory needs, at 37 GB. As each GPU is 40 GB, it goes without saying that this is extremely close to running out of memory. Perhaps at first there is enough memory, but the state gets a little bigger over time and causes a crash. This could potentially explain the intermittent nature and partial completion of the runs. It seems like a simple and complete enough explanation to pass occam's razor.

shankar1729 commented 3 months ago

That's true: it could be a bad condition caused by the MPI GPU buffers. I've noticed issues with cray cuda-aware MPI when I got close to full GPU memory usage. So, rather than the state get bigger with time, this is more likely a memory leak of sorts from the MPI library. I have not seen this happen elsewhere, or with openmpi's cuda-aware support.

So, yes, if the one k per GPU works out, that's the best strategy. If does, also benchmark the nvhpc and gcc versions on the same job for both correctness and performance.

ColinBundschu commented 3 months ago

Ok so of the two tests, one of them ran to convergence without issue (yay!), and the other seems to have stopped for convergence issues. From what I can tell, the convergence issues in the one that did not finish are entirely due to how I am using jdftx and not a problem with jdftx itself, as jdftx decided to stop running and did not simply halt midway through.

For my sanity, can you verify that my assessment is correct and that the following is indeed not a problem with the GPU/code? (I think the issue here is that my ionic alpha is too large and it is overstepping, leading to extremely large forces. This suggests I am too far from the minima for this optimization algorithm to make intelligent step predictions and should use something like SD. Please let me know if you have any insights here, but just confirming that its not a GPU issue is enough!). dft_outFeN4C10.txt

shankar1729 commented 3 months ago

Yes, this does not seem to be a GPU-related bug. Also, the issue seems to be electronic, not fluid, convergence.

Your Fermi smearing seems really high (0.1 Eh): maybe that led to unphysical / bad geometries earlier in your ionic convergence? Try from a good initial geometry with a smearing of 0.01 Eh or less.

Also, speed up your calculation by cutting down the vacuum size substantially and using embedded slab-mode truncation.

ColinBundschu commented 3 months ago

Ah good catch on the truncation. I am used to my oxide slabs which are larger than 1/2 the unit cell in the z direction. Do you have a recommendation on the vacuum size? I will also try smaller smearing.

shankar1729 commented 3 months ago

I'd usually pick 5-7 Angstroms on each side, i.e. about 20 - 25 bohrs on both sides combined past the z-range of the atomic positions.

shankar1729 / jdftx

JDFTx Fails with MPI_Abort under all conditions on Polaris #336