plumed / plumed2

Development version of plumed 2
https://www.plumed.org
GNU Lesser General Public License v3.0
367 stars 289 forks source link

Very slow simulation with gromacs 2021 + plumed2 2.7.2 #741

Open dipietrantonio opened 3 years ago

dipietrantonio commented 3 years ago

Dear plumed developers,

I am a Supercomputing Applications Specialist at Pawsey Supercomputing Centre. I have patched gromacs/2021 with plumed2/2.7.2 and I have run a test case provided by a user. Unfortunately, the simulation runs very slow, that is,

               Core t (s)   Wall t (s)        (%)
       Time:   101069.194     6737.963     1500.0
                         1h52:17
                 (ns/day)    (hour/ns)
Performance:        1.139       21.077

whereas with the same build process, I have build gromacs/2020.3 patched with plumed 2.6.1and what I get is

               Core t (s)   Wall t (s)        (%)
       Time:     9351.670      623.445     1500.0
                 (ns/day)    (hour/ns)
Performance:      692.924        0.035

without plumed, the simulation with gromacs 2021 runs much faster, which makes me think it is a problem with plumed?

Best regards, Cristian Di Pietrantonio

GiovanniBussi commented 3 years ago

@dipietrantonio it is very difficult to judge why this is happening without knowing more about the simulation:

Giovanni

dipietrantonio commented 3 years ago

@GiovanniBussi you are right, I have been overly vague. I was waiting for the user to provide me authorization to share the test case. You can find it in the attached archive, which should provide answers to the domain-specific questions you have (I am not familiar with the domain, unfortunately).

The computing system is a GPU cluster (named Topaz) running Centos7.6 with 2 V100 GPUs and 2 Dual Intel “Cascade Lake” Silver 4215 2.5 GHz 8-core processors. I run the experiment using 1 GPU and 15 CPU cores (as you will see in the submission script). Let me know if you need more information.

Best regards Cristian test_case.tar.gz

GiovanniBussi commented 3 years ago

I checked the input and log file. The system is small (~2600 atoms) and probably it also suffers from the fact that GMX 2021 can run everything on the GPU if plumed is not used.

Anyway, for such a simple input file the cost of the plumed part seems a bit too large to me.

Could you share a log file obtained by running the simulation adding the following line at the beginning of the metad.dat file?

DEBUG DETAILED_TIMERS 

The final part of the log file should show the cost associated to each of the plumed input lines.

dipietrantonio commented 3 years ago

Hi Giovanni, here is the log file, md_carb.log

GiovanniBussi commented 3 years ago

Thanks, I paste below the relevant part as a reference.

I don't see anything weird (like a very high time spent in METAD, that could have been maybe optimized with some setting). Even the cost of a single torsion (phi1) is a significant fraction of the total time. I conclude that the simulated system is just too small to be simulated efficiently with GROMACS + PLUMED. So I cannot suggest any solution, beside noticing that this should not happen for systems with a larger number of atoms (and thus slower in the calculation of GROMACS forces).


PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1 80845.967654 80845.967654 80845.967654 80845.967654
PLUMED: 1 Prepare dependencies                        585442     0.615870     0.000001     0.000000     0.000590
PLUMED: 2 Sharing data                                585442  8145.403068     0.013913     0.000032     0.127919
PLUMED: 3 Waiting for data                            585442     1.476346     0.000003     0.000001     0.005705
PLUMED: 4 Calculating (forward loop)                  585442 62067.920605     0.106019     0.003150     0.205252
PLUMED: 4A 0 @0                                       585442     0.349757     0.000001     0.000000     0.000794
PLUMED: 4A 1 @1                                       585442     1.981848     0.000003     0.000002     0.000986
PLUMED: 4A 2 phi1                                     585442  8522.777175     0.014558     0.000029     0.079167
PLUMED: 4A 3 psi1                                     585442 10786.286090     0.018424     0.000006     0.111998
PLUMED: 4A 4 phi2                                     585442 10641.521903     0.018177     0.000006     0.065557
PLUMED: 4A 5 psi2                                     585442 10731.279967     0.018330     0.000006     0.065554
PLUMED: 4A 6 metad1                                   585442 10756.873912     0.018374     0.000008     0.078734
PLUMED: 4A 7 metad2                                   585442 10606.596536     0.018117     0.000007     0.060864
PLUMED: 4A 8 @8                                         2342     0.028054     0.000012     0.000000     0.009357
PLUMED: 5 Applying (backward loop)                    585442 10564.210273     0.018045     0.000018     0.059477
PLUMED: 5A 0 @8                                         2342     0.009771     0.000004     0.000000     0.004683
PLUMED: 5A 1 metad2                                   585442     1.877526     0.000003     0.000000     0.014044
PLUMED: 5A 2 metad1                                   585442     1.218530     0.000002     0.000000     0.014046
PLUMED: 5A 3 psi2                                     585442    12.586233     0.000021     0.000002     0.018732
PLUMED: 5A 4 phi2                                     585442     2.990273     0.000005     0.000001     0.014047
PLUMED: 5A 5 psi1                                     585442     1.749104     0.000003     0.000001     0.014044
PLUMED: 5A 6 phi1                                     585442     1.127604     0.000002     0.000001     0.009375
PLUMED: 5A 7 @1                                       585442     0.157751     0.000000     0.000000     0.008696
PLUMED: 5A 8 @0                                       585442     0.154341     0.000000     0.000000     0.009359
PLUMED: 5B Update forces                              585442 10530.212192     0.017987     0.000007     0.059465
PLUMED: 6 Update                                      585442    44.106911     0.000075     0.000003     6.582783
``
dipietrantonio commented 3 years ago

Hi Giovanni, running the same simulation with gromacs/2020.3 + plumed/2.6.1 gives the following result,


On 1 MPI rank, each using 15 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1   15      25001       8.775        328.261   1.3
 Launch GPU ops.        1   15    2500001     202.213       7564.559  31.1
 Force                  1   15    2500001     228.468       8546.737  35.1
 Wait PME GPU gather    1   15    2500001       3.912        146.326   0.6
 Reduce GPU PME F       1   15    2500001      11.026        412.461   1.7
 NB X/F buffer ops.     1   15    4975001      34.329       1284.203   5.3
 Write traj.            1   15        501       0.860         32.189   0.1
 Update                 1   15    5000002      52.287       1955.994   8.0
 Constraints            1   15    5000002      77.111       2884.637  11.8
 Rest                                          32.044       1198.740   4.9
-----------------------------------------------------------------------------
 Total                                        651.024      24354.107 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:     9765.352      651.024     1500.0
                 (ns/day)    (hour/ns)
Performance:      663.570        0.036
Finished mdrun on rank 0 Tue Sep 21 10:28:39 2021

PLUMED:                                               Cycles        Total      Average      Minumum      Maximum
PLUMED:                                                    1   224.058133   224.058133   224.058133   224.058133
PLUMED: 1 Prepare dependencies                       2500001     1.469931     0.000001     0.000000     0.001052
PLUMED: 2 Sharing data                               2500001    25.894379     0.000010     0.000008     0.013028
PLUMED: 3 Waiting for data                           2500001     3.716172     0.000001     0.000001     0.001440
PLUMED: 4 Calculating (forward loop)                 2500001    98.568239     0.000039     0.000031     0.038046
PLUMED: 4A 0 @0                                      2500001     0.861788     0.000000     0.000000     0.000220
PLUMED: 4A 1 @1                                      2500001     7.074460     0.000003     0.000002     0.000557
PLUMED: 4A 2 phi1                                    2500001    13.335195     0.000005     0.000004     0.022992
PLUMED: 4A 3 psi1                                    2500001    11.686390     0.000005     0.000003     0.017996
PLUMED: 4A 4 phi2                                    2500001    11.629067     0.000005     0.000003     0.020036
PLUMED: 4A 5 psi2                                    2500001    11.834101     0.000005     0.000003     0.001336
PLUMED: 4A 6 metad1                                  2500001    14.236606     0.000006     0.000004     0.022017
PLUMED: 4A 7 metad2                                  2500001    13.023018     0.000005     0.000004     0.038014
PLUMED: 4A 8 @8                                        10001     0.004306     0.000000     0.000000     0.000005
PLUMED: 5 Applying (backward loop)                   2500001    42.420516     0.000017     0.000014     0.003928
PLUMED: 5A 0 @8                                        10001     0.001630     0.000000     0.000000     0.000000
PLUMED: 5A 1 metad2                                  2500001     1.141783     0.000000     0.000000     0.000207
PLUMED: 5A 2 metad1                                  2500001     0.791034     0.000000     0.000000     0.002841
PLUMED: 5A 3 psi2                                    2500001     4.993111     0.000002     0.000002     0.000593
PLUMED: 5A 4 phi2                                    2500001     2.104832     0.000001     0.000001     0.000188
PLUMED: 5A 5 psi1                                    2500001     2.017699     0.000001     0.000001     0.000202
PLUMED: 5A 6 phi1                                    2500001     1.878040     0.000001     0.000001     0.000239
PLUMED: 5A 7 @1                                      2500001     0.382523     0.000000     0.000000     0.000158
PLUMED: 5A 8 @0                                      2500001     0.318751     0.000000     0.000000     0.000146
PLUMED: 5B Update forces                             2500001    16.413717     0.000007     0.000004     0.003916
PLUMED: 6 Update                                     2500001    35.470910     0.000014     0.000003     2.715419

md_carb.log

Which makes me think it is a software issue. The compilation process I followed is the same for both packages, you can have an idea by looking at https://github.com/PawseySC/maali-cygnet/blob/master/centos7.6/gromacs.cyg but it is essentially what is suggested on the gromacs website

GiovanniBussi commented 3 years ago

@dipietrantonio thanks for the detailed report. You are right, there is a problem. Actually, thinking back at it, it is just impossible that the calculation of a single torsional angle takes on average 0.014558 seconds (as in the first example). The second example makes much more sense (0.000005).

The only thing that I could suggest is to find out if the problem is GMX 2020->2021 or PLUMED 2.6->2.7. Actually, plumed patches are transferable across versions, so you could just copy the files patches/gromacs-2021.config and patches/gromacs-2021.diff in PLUMED 2.6 to enable 2021 + 2.6.

If the problem is in the PLUMED upgrade (i.e., with the same GMX, PLUMED 2.6 is working and PLUMED 2.7 is not) then it is definitely to be solved on our side. I would be surprised as nobody reported this type of issue

If the problem is with GROMACS upgrade, then it is difficult to understand what's happening. Perhaps GROMACS 2021 is doing something on the environment that significantly slows down PLUMED?

In short, this seems a relevant problem and we would be happy to learn more and help but I am not sure we will be able.

GiovanniBussi commented 3 years ago

Also notice that if you patch with -m runtime (instead of -m shared) you will be able to pick the PLUMED version at runtime setting the env var PLUMED_KERNEL. It can make these types of test faster (you compile GROMACS with PLUMED 2.7 only and then switch back to PLUMED 2.6 at runtime).

dipietrantonio commented 3 years ago

Hi Giovanni, I have compiled gromacs/2021 with plumed 2.6.1, following your instructions. It results in a slow simulation too, indicating that there is something in plumed that does not work well with newer versions of gromacs. Here is the debug info:


               Core t (s)   Wall t (s)        (%)
       Time:    97165.044     6477.692     1500.0
                         1h47:57
                 (ns/day)    (hour/ns)
Performance:        1.134       21.168
Finished mdrun on rank 0 Thu Sep 23 15:06:06 2021

PLUMED:                                               Cycles        Total      Average      Minumum      Maximum
PLUMED:                                                    1  6095.349020  6095.349020  6095.349020  6095.349020
PLUMED: 1 Prepare dependencies                         42501     0.036179     0.000001     0.000000     0.000708
PLUMED: 2 Sharing data                                 42501   620.260253     0.014594     0.000034     0.046939
PLUMED: 3 Waiting for data                             42501     0.102702     0.000002     0.000001     0.001097
PLUMED: 4 Calculating (forward loop)                   42501  4671.063844     0.109905     0.054999     0.209329
PLUMED: 4A 0 @0                                        42501     0.025230     0.000001     0.000000     0.000611
PLUMED: 4A 1 @1                                        42501     0.137245     0.000003     0.000002     0.001743
PLUMED: 4A 2 phi1                                      42501   647.809034     0.015242     0.004417     0.047539
PLUMED: 4A 3 psi1                                      42501   810.651214     0.019074     0.004679     0.056561
PLUMED: 4A 4 phi2                                      42501   795.250604     0.018711     0.001431     0.056198
PLUMED: 4A 5 psi2                                      42501   805.875933     0.018961     0.000010     0.060865
PLUMED: 4A 6 metad1                                    42501   812.013233     0.019106     0.000010     0.055463
PLUMED: 4A 7 metad2                                    42501   797.998516     0.018776     0.000008     0.054968
PLUMED: 4A 8 @8                                          171     0.009526     0.000056     0.000000     0.009414
PLUMED: 5 Applying (backward loop)                     42501   800.519477     0.018835     0.000020     0.051540
PLUMED: 5A 0 @8                                          171     0.000033     0.000000     0.000000     0.000003
PLUMED: 5A 1 metad2                                    42501     0.096811     0.000002     0.000000     0.009356
PLUMED: 5A 2 metad1                                    42501     0.053702     0.000001     0.000000     0.009048
PLUMED: 5A 3 psi2                                      42501     0.537368     0.000013     0.000002     0.014036
PLUMED: 5A 4 phi2                                      42501     0.155386     0.000004     0.000001     0.009361
PLUMED: 5A 5 psi1                                      42501     0.076498     0.000002     0.000001     0.004693
PLUMED: 5A 6 phi1                                      42501     0.078808     0.000002     0.000001     0.009359
PLUMED: 5A 7 @1                                        42501     0.013363     0.000000     0.000000     0.004678
PLUMED: 5A 8 @0                                        42501     0.008027     0.000000     0.000000     0.000168
PLUMED: 5B Update forces                               42501   798.837358     0.018796     0.000007     0.051528
PLUMED: 6 Update                                       42501     2.121210     0.000050     0.000003     0.201005

and I attach the complete log for reference.

So, what would be the next step? md_carb.log

dipietrantonio commented 3 years ago

Hi Giovanni, were you able to reproduce the issue?

Best regards, Cristian Di Pietrantonio

btodac commented 2 years ago

I am seeing a similar slow down when using GROMACS-2021-EasyBuild-4.5.0-PLUMED-2.7.2 (I think it is GROMACS 2021.3). A simulation running without plumed is about 10 times faster when using CPUs only. Yours, Mat

UPDATE: I also compiled Gromacs 2021.5 with plumed 2.6.5 (using the Gromacs 2021.4 patch files from plumed 2.7.3) and saw the same slowdown as before. For reference, my system contains 270000 atoms. This is on an AMD EPYC Rome system using MPI parallelisation. I have tried different versions of GROMACS and Plumed and using GNU and Intel compilers and associated libraries (Scalapack/MKL, openMPI etc)