tdep-developers / tdep

The Temperature Dependent Effective Potentials (TDEP) code
MIT License
68 stars 22 forks source link

Error: Message size bigger than supported by PSM2 API in MPI_Allreduce #23

Closed NinaStrasser closed 1 year ago

NinaStrasser commented 1 year ago

I encountered an error while running the following command: mpirun extract_forceconstants -rc2 7.0. The error message indicates that the message size is larger than what the PSM2 API can handle in an MPI_Allreduce operation. This results in MPI processes aborting, which cancels my job running on a cluster. I am trying to converge the cutoff value for the second order force constants for a system that has 328 atoms in the unit cell using 309 configurations from an MD trajectory in VASP. The calculations worked fine for the cutoffs of 5 and 6 Angstroms.

Screenshot: PSM2_API_ERROR

Can you provide any recommendations for resolving this error when dealing with larger cutoff values?

flokno commented 1 year ago

Hi @NinaStrasser , can you add the full logfile?

I can only guess that you're trying to fit a gazillian force constants when your unit cell has 328 atoms and your cutoff is 7AA.

Edit: I overlooked that your calculation worked with 5 and 6 A. How many force constants do you have in each case?

flokno commented 1 year ago

One thing you can try is to use -nj2 instead of -rc2. If that helps you in any way we can talk about what that means. You will need larger numbers, always look for the number of force constants you get

NinaStrasser commented 1 year ago

I have 1118 force constants for 5AA and 1916 force constants for 6AA

NinaStrasser commented 1 year ago

Thanks for the prompt response, Florian! I have submitted the calculation with the -nj2 setting. I will provide you with the results tomorrow since the calculation is expected to take several hours as it took approximately 4 hours for a calculation involving the 5AA cutoff.

NinaStrasser commented 1 year ago

Hi @flokno, I have now performed the calculations with different -nj2 settings, however, it did not resolve the issue. Extract_FCs_nj2 The same error is reproduced, when I use -nj2 200 that results in 3489 second order force constants. I have also attached the log file. fc2.log Do you have any other suggestions?

flokno commented 1 year ago

Hi @NinaStrasser the actual number of force constants is the second number, i.e., 30750 in this output file. That is really quite a lot, much larger than what TDEP is designed for so to be honest I am suprised that you can push it this far in the first place. Also you would need way more samples to fit this properly, when you look for REPORT GRADE OF OVERDETERMINATION, that is just ~10. This should rather be in the hundreds.

May I ask what structure you are looking at? Why do you have 328 atoms in the unit cell? This is larger than I have seen people using for amorphous systems

NinaStrasser commented 1 year ago

Thanks for your explanation! Our group is mainly interested in metal-organic frameworks that often consist of hundreds of atoms in the primitive unit cell. However, in this particular case I wanted to simulate a rather complex Zn coordination polymer.

flokno commented 1 year ago

can you update the table with the actual number of force constants? the first number is the number of irreducible pairs in the structure, the second number is the number for force constants.

Also, can you add a similar table when using the realspace cutoff rc2 in say 0.5A steps from 3 to whatever you manage to compute?

I think the pragmatic solution will be to use the smallest possible cutoff that still represents your system sufficiently well and that you can handle numerically.

NinaStrasser commented 1 year ago

Hi @flokno! I have now updated the table with the actual numbers of force constants. TDEP_Table_1

Here is the second table including the convergence tests of the realspace cutoff. I also included the overdetermination ratio. TDEP_Table_2

Computations involving cutoffs of 6 AA and 6.5 AA will not result in the MPI error message, but did not finish within almost 24 h.