shankar1729 / jdftx

JDFTx: software for joint density functional theory
http://jdftx.org
79 stars 49 forks source link

Intermittent hanging on wavefunction saving #335

Closed ColinBundschu closed 1 month ago

ColinBundschu commented 1 month ago

I have been extremely hesitant to submit this bug due to its intermittent nature and the possibility that it could have been related to the way I set up my filesystem. However I am now nearly certain that the problem is from JDFTx directly after spending months trying to track it down on my end.

What happens is that once about every 500 wavefunction writes, JDFTx hangs and does not write the wavefunction at all. The last line in the output file is Dumping 'filename.wfns' ... without the usual done or a newline. I have verified this happens on NFS and Ceph with extremely high reliability (infiniband) connections. Working with the team, at Jetstream2, we have verified that there does not appear to be any underlying hardware or software issues at play beyond JDFTx simply ceasing to write. The event logs do not have any information that we could discern (note that we were primarily looking for filesystem or connection issues).

test.txt

Unfortunately this is an extremely serious problem if I plan to use Polaris, as the short wall times mean I need to be able to do small numbers of ionic/lattice steps with each job submission. When the writes fail, it leaves the calculation without a wavefunction to resume from, erasing significant progress. Since each calculation requires around 100 steps to converge, this means approximately 20% of calculations will fail.

shankar1729 commented 1 month ago

Does this happen only for multi-node jobs, or for single node jobs as well?

Try compiling with flag MPISafeWrite on, if not already. This is intended to work around MPI-IO issues, which is my best guess for the hangs.

Besides that, there is almost nothing I can do to debug/fix an issue that occurs with that low of a frequency. Is it only happening on a specific machine? Also, if you are running 100s of steps, the cost to restart without wfns should not be too bad if you have the updated lattice/ionpos. You also have the option of adding iteration numbers to the output filename (see dump-name), but this may cause a disk-space issue if you start writing 100 copies of the wavefunction.

shankar1729 commented 1 month ago

One final thing, if you get a hang, can you ssh into that node, gdb attach to each process and see where it's stuck?

ColinBundschu commented 1 month ago

These are single node CPU jobs. So I don't think its using MPI at all. It happens on all of the machines I use. The cost shouldn't be too bad to restart without them, but in practice its actually almost as bad as starting over. This is very surprising to me, but I have numerous test cases I can demonstrate this with.

Yes, I can ssh into the nodes that are hung. In fact, I have a few that are hung right now. If you want, I can copy your ssh key over and let you ssh in if you want to debug it directly. The nodes are internet available Ubuntu instances.

shankar1729 commented 1 month ago

Sure, I can take a look. I'll email you a public key, reply to that with log in details to a hung instance.