Closed ColinBundschu closed 2 months ago
What do you mean all conditions? Weren't the jobs working previously?
Previously I had only run them for a fixed and small number of ionic steps. This is running to convergence. So it will run for between 5-20 iterations, then crash.
I think you may just want to install gcc and openmpi in your directory and build jdftx with that. I'm not sure debugging on the very restrictive toolchain at Polaris is worth it any more.
I agree. Would using gcc allow full GPU support? I ask because I know that the gcc they run at Polaris does not support GPUs. I assume there is a very good reason for that. but given how things have been going I do not know that there is a good reason.
There shouldn't be one. We have used gcc + cuda 12 + cray MPI with A100 and H100 systems on NERSC and NREL for thousands of calculations, and have not had such issues. The problem seems to be nvhpc.
So you don't know what this is all about then? https://docs.alcf.anl.gov/polaris/compiling-and-linking/gnu-compilers-polaris/
Don't know why. Same hardware from Cray + Nvidia on NERSC recommends GNU tool chain for GPUs:
https://docs.nersc.gov/systems/perlmutter/software/#compilers
I don't know what the people running Polaris are smoking... it compiled and ran all gpu tests using gcc without issue. I will test full calculations now, but I am frankly quite frustrated with how Polaris is managed.
:)
Great, let me know! Would be happy to undo some past changes and mark nvhpc as officially broken for jdftx to avoid future issues.
Ok so the jobs compiled with gcc are failing in the exact same way as with nvidia compilers
Ok my working theory is that when the number of kpoints exceeds the number of GPUs, the GPU runs out of memory. I am now testing to see if a 1:1 mapping solves this issues. Am I correct that there is no benefit to having more GPUs than kpoints?
That is correct, don't exceed one GPU per k-point. See if the 1:1 case works, but the error you posted previously did not seem to be a memory issue, but rather incorrect math leading to a non-hermitian overlap matrix (assuming that cholesky in the stack trace is actually what causes the failure).
Maybe also send a full example log file with these errors.
By full example log file, do you mean the jdftx output file? Other than the jdftx output file, the stack trace, and the error I posted above, there is no additional output from the program that I am aware of.
Assuming that the jdftx output file is what you are after, here is an example using the gcc compiled version of jdftx. There was no stack trace trace dumped with this failed run. It is possible the stack trace I sent earlier was a lingering file from a different error we had already debugged.
Here is the std out from PBS:
NUM_OF_NODES= 1 TOTAL_NUM_RANKS= 4 RANKS_PER_NODE= 4 THREADS_PER_RANK= 1
Here is the std err from PBS:
MPICH ERROR [Rank 3] [job id ef025f29-61ad-4057-9162-fdbdbef994cd] [Thu Jun 6 17:18:51 2024] [x3109c0s1b0n0] - Abort(1) (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
x3109c0s1b0n0.hsn.cm.polaris.alcf.anl.gov: rank 3 exited with code 255
x3109c0s1b0n0.hsn.cm.polaris.alcf.anl.gov: rank 2 died from signal 15
The output file for jdftx is attached example.txt
So far the runs with at least as many GPUs as kpoints are progressing much further than all of the other runs did (they are currently still in flight). Additionally I noticed that the runs seem to need about 18.5 GB of memory for the state from the profiling. It seems plausible to me that the GPU with 2 kpoints will have double the memory needs, at 37 GB. As each GPU is 40 GB, it goes without saying that this is extremely close to running out of memory. Perhaps at first there is enough memory, but the state gets a little bigger over time and causes a crash. This could potentially explain the intermittent nature and partial completion of the runs. It seems like a simple and complete enough explanation to pass occam's razor.
That's true: it could be a bad condition caused by the MPI GPU buffers. I've noticed issues with cray cuda-aware MPI when I got close to full GPU memory usage. So, rather than the state get bigger with time, this is more likely a memory leak of sorts from the MPI library. I have not seen this happen elsewhere, or with openmpi's cuda-aware support.
So, yes, if the one k per GPU works out, that's the best strategy. If does, also benchmark the nvhpc and gcc versions on the same job for both correctness and performance.
Ok so of the two tests, one of them ran to convergence without issue (yay!), and the other seems to have stopped for convergence issues. From what I can tell, the convergence issues in the one that did not finish are entirely due to how I am using jdftx and not a problem with jdftx itself, as jdftx decided to stop running and did not simply halt midway through.
For my sanity, can you verify that my assessment is correct and that the following is indeed not a problem with the GPU/code? (I think the issue here is that my ionic alpha is too large and it is overstepping, leading to extremely large forces. This suggests I am too far from the minima for this optimization algorithm to make intelligent step predictions and should use something like SD. Please let me know if you have any insights here, but just confirming that its not a GPU issue is enough!). dft_outFeN4C10.txt
Yes, this does not seem to be a GPU-related bug. Also, the issue seems to be electronic, not fluid, convergence.
Your Fermi smearing seems really high (0.1 Eh): maybe that led to unphysical / bad geometries earlier in your ionic convergence? Try from a good initial geometry with a smearing of 0.01 Eh or less.
Also, speed up your calculation by cutting down the vacuum size substantially and using embedded slab-mode truncation.
Ah good catch on the truncation. I am used to my oxide slabs which are larger than 1/2 the unit cell in the z direction. Do you have a recommendation on the vacuum size? I will also try smaller smearing.
I'd usually pick 5-7 Angstroms on each side, i.e. about 20 - 25 bohrs on both sides combined past the z-range of the atomic positions.
All of the jobs are failing with the following error:
Here is the stack trace that is dumped: