Different results when running the process on 2 CPU cores and 4 CPU cores

mcaroba / turbogap

The TurboGAP code

Other

16 stars 9 forks source link

Different results when running the process on 2 CPU cores and 4 CPU cores #5

Closed abd-adhyatma closed 2 years ago

abd-adhyatma commented 2 years ago

Greetings! I've been trying to use the TurboGAP code to do molecular dynamics fueled by the GAP provided in this repository and related paper. I'm running a melting process on a supercomputer, and the code seems to produce different results depending on the number of cores used. The related files can be found here.
The issue seems to stem from the process producing 0 or NaN for forces in the first step. When trying to use more cores, the issue is consistently reproduced. Any and all help with troubleshooting this issue would be greatly appreciated!

mcaroba commented 2 years ago

Sorry, I can't reproduce the issue. I tried running it with mpirun @ 1, 2, 3 and 4 cores, as well as with the serial version of the code. I always get reasonable results.

The only thing I noted was inconsistent timings and slowdown which were resolved when setting export OMP_NUM_THREADS=1, probably due to the standard LAPACK library attempting to use threading support and resulting in oversubscription. But this should not lead to the problem you are describing. (I also don't know if I can fix it inside TurboGAP, since this is an issue related to the user's environment and LAPACK installation).

What I can think of is that maybe you are running TurboGAP with a version of gcc which is not the same that was used to compile it. Can you chek that? You can also enable the debug flags in the make file to get more info about where these NaNs are coming from. Let me know if you get more info and I'll look again into it.

abd-adhyatma commented 2 years ago

Understood. I'll look a bit more into it and I will definitely let you know if I manage to get anything significant. Thank you, Dr. Caro.

abd-adhyatma commented 2 years ago

Hi, TurboGAP team. Something strange has happened. I reinstalled TurboGAP with the intention of enabling debug flags in the makefile. For some reason, when I used this installation, the error seemed to stop occurring! I compared it by running the calculation using 16 cores with an installation that has the debug flags disabled, and you can find the results here.

It's curious. Do you perhaps have a plausible explanation as to what is causing this, and whether or not the errors have completely been removed?

Furthermore, I have just noticed some issues when running the above calculation with 16 cores that the output in the terminal by TurboGAP says that it finished in ~250 seconds, but in real time it takes considerably longer. Would you also have some insight about this?

Thank you for your attention. Feel free to respond whenever convenient.

mcaroba commented 2 years ago

My prime suspect would be some incompatibility between the compiler suite used to build TurboGAP and those used to run it. Or perhaps a similar incompatibility with the compiler suite used to build the LAPACK/BLAS libraries.

You can check if doing export OMP_NUM_THREADS=1 before running the calculation leads to more reliable timings.

abd-adhyatma commented 2 years ago

I see. I would definitely look deeper into it if possible. Thank you, Dr. Caro.