Closed sweitzner closed 7 years ago
Thank you for reporting this! I have fixed two of the wrapper scripts by disabling OpenMP for these builds and to only rely on MKL/sequential. Using MKL/sequential is just to make such build pure-MPI. If you actually run fewer ranks per node than your number of cores, it might be beneficial to rely on MKL/parallel (with internal OpenMP based parallelization). I will very soon update the build/run recipe to include some more information on how to run with MPI and OpenMP.
Regarding your question, OpenMP does not prevent codes from running on multiple nodes. In general it is beneficial to build hybrid applications (MPI+OpenMP). However, people often perceive OpenMP less beneficial in case of QE (which has little to do with OpenMP in particular). In fact, when you aim to scale out for the maximum number of nodes beneficial for a particular workload, OpenMP may help to further scale. This is because a high number of ranks potentially adds more communication, and a high number of ranks per node may also increase memory consumption.
Hmm. If I use the configure script as is (with OpenMP) and request two full nodes, could this be responsible for why QE hangs? The job starts successfully as far as the scheduler is concerned but pw.x never starts to actually run. I'm just trying to figure out at this point if this is a problem with our cluster, the way I am building the code, or if it is a problem with the code itself. I feel like I would have heard something about the latter, however.
Thanks for taking a look at this!
For sure, I am running pw.x as a hybrid application (MPI with OpenMP). The problem you perceived is not a general issue of whether OpenMP is used or not. Does it hang in the Davidson diagonalization stage (soon after passing the initialization phase)?
No, actually the calculation does not appear to even start. At least, nothing is written to the console and the temporary / scratch folder for wavefunction data is not created. It just appears to hang.
If it helps you, I can tomorrow check on my end (which requires your input file). Feel free to guess my email address (form: forename.name @ company . com). You may also share more details on compiler version etc.
That would be great, thank you!
Closing the issue as it appears to be independent of XCONFIGURE.
I had a problem running QE 6.0 on multiple nodes on our cluster after configuring with "configure_qe_snb.sh", and I think it's because the code is being built with OpenMP. Looking at that particular configure script, it seems that it is indeed being built with OpenMP. Can this be updated? Am I correct that OpenMP prevents codes being run on multiple nodes?