shankar1729 / jdftx

JDFTx: software for joint density functional theory
http://jdftx.org
79 stars 49 forks source link

The problem of multi-node parallelism #321

Closed qingyawang closed 1 month ago

qingyawang commented 2 months ago

Hi Shankar,

I am a beginner in JDFTx, I have successfully installed and completed the tutorials, but these simple examples are done on a single node, so I have not found the following problem running on multiple nodes. When I performed a constant potential calculation on a Cu surface (nStates=5), I requested 4 nodes from the server, 28 cores per node for the calculation. I then specified 4 processes using the mpirun -n 4 command and 28 threads per process using -c 28 after the jdftx command. But the out file shows that all 4 processes are performed on one node, and the other three requested nodes are not computed, which results in a very inefficient computation for me. When I ran the same system on a single node 30 core computer, specifying 5 processes with 6 threads per process was able to complete the calculations well. So I wonder if you have any suggestions for this kind of problems encountered with parallelism on multiple nodes. Below is my jdftx.lsf job script, output.out file, input.in file and CMakeCache.txt file ensemble. Looking forward to your suggestions very much!

Best, Yqwang

Cu.zip

qingyawang commented 2 months ago

Sorry, I missed an image of the job information which shows that the job only uses one node.

job

shankar1729 commented 2 months ago

Hi Yawang,

I'm not familiar with your job submission system (BSUB), this is likely an issue with how the job file is requesting resources. My guess is that you have -n 112 asking for 112 cores that spans 4 nodes, but then your mpirun line asks for -n 4, which may mean that your job system thinks you are only using 4 cores. The -c 28 is passed only to JDFTx, which your job system doesn't know about.

In SLURM, we would have put the -c 28 in an SBATCH line; maybe there is an equivalent for BSUB asking for the number of cores to allocate per process. Specifically, we would have asked -n 4 -c 28 in SLURM, instead of -n 112. You may want to reach out to your cluster's support for help with launching hybrid MPI-threads jobs.

Best, Shankar

qingyawang commented 2 months ago

Thank you for your advice. After talking to the server supporter, I managed to assign 4 processes to each of the four nodes of the application by adding the -machinefile directive to the mpirun (InterMPI) command, and the test results are shown in the output_122_e8.out file. Compared to using 4 cores -n 4 -c 1 , the efficiency is significantly improved, which indicates that each process does not occupy only one core in each node. However, it doesn't seem to be utilizing the capacity of 28 cores per node either. I tested this by specifying -n 4 -c 18 on a single node with 72 cores, and found this to be far more efficient than four nodes with 112 cores(output_722_e8.out file). (Time spent: single node 72 cores: 7h; four node 112 cores: 20h). Also by setting the elec-step energy convergence limit to 1e-6, I further reduced the time to 4h, which looks to be to an acceptable level(output_722_e6.out file). (All these structural optimizations are based on the vasp optimized structure at constant potential). So, I wonder if you have any suggestions for this performance degradation due to cross-node. test.zip

Meanwhile there are three small problems here:

  1. Based on my output file, can you help me see what's wrong with the parameters set in my input.in file? Should I still add some necessary and meaningful parameters. (Not that I think there is a problem with my results right now, just want to make sure that there is no problem with some of my initial settings :))

  2. My current structure optimization process does not perform solvation calculations on the structure without potential constraints first to obtain the wavefunction, and then read the wavefunction to optimize the structure; instead, it directly performs the structure optimization with constant potential. Then the LCAO step appears as shown below, does this have any effect on the results. What is your suggestion, should I follow the vacuum-solvation-constant potential calculation process or just go straight to the constant potential calculation.

LCAO

  1. The third issue is that I see a lot of jdftx applicators using free energy corrections with vasp frequency calculations for ZPE and TS corrections; in principle we should use frequency calculations at constant potential with jdftx. Why is this? If jdftx is used for frequency calculations of adsorbates on slab, is the approach the same as in your tutorial for molecules, except that the metal atoms of the substrate are fixed as in vasp.

I am very sorry to take up some of your time again, thank you very much for your help in my work and I look forward to your guidance!

shankar1729 commented 2 months ago

In terms of thread performance, indeed you are hitting the limit before 28 threads / process. If your cluster nodes are 2 processors with 14 cores each, I'd recommend using 14 threads per process. Shared-memory parallelization is most effective when all the threads are on the same physical processor, sharing an L3 cache. For the present job, 5 processes with 14 cores each would be the ideal parallelization (assuming you are running dual 14-core CPUs). You can share further details on your cluster architecture for further discussion if you like, e.g. using lscpu on the nodes.

  1. Your input seems fine to me
  2. We spent some effort to make the direct GC-DFT work properly, and it seems to be fine in your case. If you see convergence issues, then you can do the vacuum - solvated(neutral) - charged workflow.
  3. You can do either for the frequency calculations. Technically, yes, you should compute ZPE at fixed potential to be consistent. However, it typically won't change much in most cases. People usually use VASP for this since they already have it in their workflow. And yes, just fix the metal atoms and only calculate vibrations of the molecule as in the tutorial, in order to be efficient.

Best, Shankar

qingyawang commented 2 months ago

Hi Shankar, Thank you for your advice. I tried to get information about the CPU architecture of the compute node in question, however unfortunately I don't have access to it. However, from what I understand, this queue of 112 cores per node that I'm trying to apply to submit a job to looks like this: these compute nodes are made up of two sockets of 56 cores each, and then we've split the 56 cores on each socket into two parts of 28 cores. So in that case each node is divided into 4 parts. Our job submission in this queue must be a request for an integer number of 28 cores, most commonly a request for 4 nodes with 28 cores per node. Not sure if this information I provided is what you want. Also I have obtained the information related to our login node, shown below.

lscpu

Meanwhile, about the free energy calculation part, from your related discussion I see that in calculating the frequency it should be like this: first, perform the structure optimization to get the converged structure, then, fix the metal atoms on the basis of the optimized structure, and perform the calculation without reading the previous wavefunction. I don't know if my understanding is correct? What I'm trying to determine is whether the frequency calculation is just a matter of not turning on the ion-step optimization and then adding the frequency related parameters to the single point calculation. If it's a constant potential calculation, adding the command to specify the potential would also work, is that correct? I have also attached my in file.

input.zip

Thank you very much for your help and looking forward to your suggestions!

Best, Yqwang

shankar1729 commented 2 months ago

I see, in that case, you should just follow the advice of your cluster admins on the best practices for hybrid mpi-threads jobs. Ask them how best to run the 5 process case, and more generally a number of processes that matches nStates that may not be a multiple of 4, 8 etc. as you mentioned.

For the vibration calculation, use the final structure as you mention. You can also read in the wavefunctions, with the caveat of symmetries being the same. If your starting structure is already low-symmetry, and the symmetry is not being reduced by the vibrations, then no problem. If not, you should not read the wavefunctions and you should disable symmetries during the vibration calculation.

Just set the movable flag of the ionic positions to 0 for the metal atoms, while leaving them at 1 for the adsorbate atoms when doing the final vibration calculations. The rest of the considerations are the same as the vibrations tutorial.

Best, Shankar

qingyawang commented 2 months ago

Hi Shankar,

Thank you for your advice. I am still having some trouble with the frequency calculations. For my system, which contains a total of 42 atoms, only 7 atoms are active after fixing all the metal atoms. Theoretically, according to the frequency calculation method, 3 x 7 x 2 + 1 = 43 structures should be calculated, but at the moment my calculation results in 3 x 42 x 2 + 1 = 253 structures. This results in a very inefficient for frequency calculations. I have provided all the relevant files for this frequency calculation attempt, could you please help me to see what the problem is. Also regarding disabling symmetry, after this attempt I added the symmetry command (symmetries none) and the calculations are still having the problems described above, this must be a problem with my input card, please correct me.

freq.zip

I am very sorry for taking your time again and hope to get your guidance.

Best, Yqwang

shankar1729 commented 2 months ago

You need to add the option useConstraints yes to your vibrations command in order for the move flags to take effect. See http://jdftx.org/CommandVibrations.html.

Best, Shankar

qingyawang commented 2 months ago

Thanks for the correction, and based on your suggestion, I've added the symmetries none and useConstraints yes, do you think the following input file is OK?

---------Pseudopotentials-----------

ion-species GBRV/$ID_pbe.uspp
elec-cutoff 20 100

---------Electronic-----------------

electronic-minimize \ nIterations 500 \ energyDiffThreshold 1e-08

elec-ex-corr gga-PBE

kpoint-folding 3 3 1
kpoint 0.5 0.5 0.5 1.

elec-smearing Fermi 0.01

------------solvation---------------

fluid LinearPCM
pcm-variant CANDLE
fluid-solvent H2O
fluid-cation Na+ 0.1
fluid-anion F- 0.1

------Free energy: Vibrations-------

vibrations \
dr 0.01 \
centralDiff yes \ useConstraints yes \ translationSym no \
rotationSym no \
T 298

symmetries none

--------Fixed-potential-------------

target-mu -0.16317

---------Geometry-------------------

include CONTCAR.ionpos include CONTCAR.lattice

coords-type Lattice
latt-scale 1.88973 1.88973 1.88973

coulomb-interaction Slab 001
coulomb-truncation-embed 0 0 0

-------------Outputs----------------

dump-name Wang.$VAR

Output at the end:

dump End State Ecomponents ElecDensity EigStats Vscloc Dtot BoundCharge

Best, Yqwang

shankar1729 commented 2 months ago

Yes, should be fine. Note that the centralDiff will double the number of calcs needed; turn it off if you want to reduce costs further.

qingyawang commented 2 months ago

Hi Shankar,

Thanks for your previous help, I have a new problem in the calculations. In my calculation system (Cu100 surface adsorption), when I set elec-smearing Gauss 0.003675, the resulting electronic entropy is about 0.01538hartree, which is 0.4185eV, and I'm not sure if this seems to be relatively large. Do you think such an electron entropy is acceptable in the calculation? If I want to reduce this contribution, should I use change to COLD or MP1; does this cause problems in terms of convergence of the calculation? What should be your suggested regular settings?

output

Best, Yqwang

shankar1729 commented 2 months ago

That's pretty typical for Fermi / Gauss smearing. While the TS is non-negligible, the difference in TS for reaction / adsorption energies usually works out to be small. So, test this for a final physical prediction first.

And yes, you can use Cold or MP1 to reduce the entropy effect substantially. They do tend to be slightly slower in convergence, as is well documented, but it's rare for them to break convergence completely. In particular, the minimize algorithm should usually be able to manage with these smearings too, even if SCF breaks due to their non-postivity.

I personally prefer Fermi / Gauss smearing for their physical meanings and lack of negative or > 1 occupation factors, but MP1 in particular is also a well-tested approach.

Best, Shankar

qingyawang commented 1 month ago

Hi Shankar,

Thanks for your previous help, I have encountered the following confusion in energy calculations and would like your guidance. For protonation reaction in electrochemistry: CO + H+ + e- → COH For the reaction energy (G) and energy barrier (G*), I checked the literature and found that there are different ways of calculating them, which are basically brought about by the differences in the modeling approaches. So the following confusion comes from the comparison of different modeling approaches for protonation reactions.

A、First of all I would like to confirm to you my understanding of the different energies in the output of jdftx: F is the Helmholz free energy, which represents the total energy of the electrons in the system; and then G=F-μN represents the grand free energy. But what is not clear to me is what is the significance of G here and why do we need the -μN correction. Is it to automatically account for the energy effects of changes in the number of electrons when we calculate reaction energies?

B、If we implicitly consider the source of protons in the protonation reaction, we only need to calculate the energies of the initiation state (IS) and the final state (FS) separately, and then calculate the reaction energy as FS-IS-H+, and I represent this treatment in the following figure. This approach is also used in Nicholas R. Singstock's article. It is also an intuitive correction to the CHE model. I wonder if there is a problem with my understanding? image image

C、If we treat the proton source explicitly, such as the way shown below, the proton source is represented by H3O+. Then I would need to put in an H3O in my modeling, and in the Grand canonical ensemble, the electrons of the system would automatically adjust thus making the H3O actually behave as H3O+. This way I just need to calculate IS,TS,FS separately and then solve for the difference. Is this the correct way to handle this? image

D、We can also explicitly consider the proton source either by adsorption of H or by surface adsorption of H2O.

image

In xiao's article for the mechanism of H* they need to correct the reaction energy to the H+ and e- basis. This is shown in the figure below.

image image

For the H2O mechanism, they think the calculated OH behaves as OH- (bader charge analysis), so the reaction energy can be expressed directly by FS-IS. Is this treatment reasonable?

image image

E、Considering the various treatments above, what really bothers me mainly are two points: if we explicitly consider the proton source via H3O+ or H2O*, is it reasonable to express this reaction energy directly in terms of FS-IS? Is this calculation comparable to the implicit consideration of proton sources and the CHE model calculations?

These questions may be a bit cumbersome to formulate and I would like to get some advice from you.

Best, Yqwang

shankar1729 commented 1 month ago

Hi Yqwang,

I'll answer the pieces I can, and for the reaction modeling details, reach out to the authors of the papers you posted (Nick, Hai).

A. At fixed potential, the grand free energy is the appropriate free energy that is minimized in equilibrium. So for the GCDFT algorithm, this is not just a correction, this is the energy that is used in defining convergence. When you write reaction energies in terms of the grand free energies, you don't introduce explicit mu N terms based on number of electron transfers, since this is now instead built into the free energies of each configuration. (However, note that you can equivalently formulate everything in terms of fixed charge calculations and Helmholtz free energies; the grand canonical formulation here is just more efficient and convenient when interested in reactions at a specific potential.)

For the rest, I can't answer to the specific choices all the authors made; ask them instead. I'll just summarize and mention that it is valid to choose any reference for any of the molecules/ions in the reaction. The choices are a matter of what is the most accurate in DFT. So, for example, you could use H2 gas at SHE as a reference for H+, or you could directly calculate H3O+ in solution using the solvation model. In most cases, the first approach would be more accurate as ion solvation energies are typically inaccurate. (This is not too much of a problem for CANDLE though.) If DFT and the solvation model were both exact, these choices would give you the same answer. The question of which reference is more appropriate is based on which would give you better cancellation of errors at the DFT+solvation level.

In each case, in terms of grand free energies, there is exactly one correct formula for the reaction energies, which is based on calculating G of each species (including things like H+) and subtracting. You can rewrite this in terms of F and muN pieces if needed, but my recommendation is to work with the Gs for the cleanest picture, e.g., see the UPD example in the 2017 GC-DFT paper.

Best, Shankar