plumed / plumed2

Development version of plumed 2
https://www.plumed.org
GNU Lesser General Public License v3.0
357 stars 284 forks source link

Very slow reading of long hills file when using MetaD with multiple walkers #437

Open carlocamilloni opened 5 years ago

carlocamilloni commented 5 years ago

I think that in case of MPI walkers it would be worth to let only one walker read the hills and communicate either just the grid or all the hills but after reading

GiovanniBussi commented 5 years ago

Can you check if the time is (mostly) spent for reading or for gridding?

dvdesolve commented 5 years ago

I think that we're experiencing similar problem, but with plumed driver and sum_hills operation. I believe that the code for reading and summing gaussians up is the same for driver and kernel library.

We're trying to sum up about 3 millions of gaussian kernels. Performing such task with custom grid on our desktop PC consumes about 30 minutes. If we're trying to do this on our cluster then the whole process could be as long as 2 days (may be problems with cluster too, but keep reading).

We tried to allocate as much MPI processes for this as we can (one node on our cluster have 64 Gb of RAM and Intel Xeon CPU E5-2697 v3 with 14 cores) but the whole task can simply fail if we allocate too much processes (i. e., even 14 MPI processes gives us crash with code 9 - killed). If we allocate only 4 processes then we're able to perform summation. Seems like all RAM is being exhausted during summation.

I/O timings seems to not be the bottleneck of summation - for example, if we doesn't provide boundaries for our HILLS.* files then we can observe time needed for reading gaussians on the first stage (boundaries detection) of plumed driver operation. This doesn't differs too much from our desktop PCs. Then new stage begins and so does summation. And this stage is much slower, than just reading the whole file.

Some details: we have 24 files with 120k-160k of gaussian kernels. On cluster plumed is compiled with Intel 15.0.3 compilers and OpenMPI-1.8.4 (MXM version). On out PCs plumed is built with GCC6 and OpenMPI-3.1.3.