hengma1001 / CVAE_pilot_MD

0 stars 0 forks source link

Clarification on concurrency (MD) #3

Open jdakka opened 5 years ago

jdakka commented 5 years ago

I would like to know the number of simulations that will execute concurrently. @hengma1001 mentioned 18 concurrent simulations (3 Summit nodes) but we both wanted to verify this.

acadev commented 5 years ago

For the simulations we were doing with fs-peptide, the number of concurrent simulations is 18.

This concurrent simulations really depends on the system that we are simulating.

When we take a system that has about 100 initial structural configurations (i.e., a protein with 3D coordinates alone), then we can have simulations with 60 replicas (each with slightly varying initial conditions -- varying the initial velocities for these coordinates) -- then we can run 6000 simulations on 1000 nodes of Summit, assuming each of the 6 replicates occupy a single GPU on a single node.

jdakka commented 5 years ago

@acadev Thanks! For RCT internal: this means that on the MD pipeline, we are looking at 6000 concurrent openMM executables on Summit. @mturilli will shed more light on this, but at the moment my understanding is that RCT on Summit can only support ~300 concurrent executables.

mturilli commented 5 years ago

Currently, we have a limit of 300 concurrent executables. This depends on the availability of jsrun on the work nodes. We open a ticket with ORNL and we are testing the offered solution. If successful, we will be able to run more than 300 concurrent tasks. The progress of this ticket can be followed here: https://github.com/radical-cybertools/radical.pilot/wiki/summit-jsrun#ccs-398324-jsrun-limits-pid-limits-on-batch-nodes

acadev commented 5 years ago

I saw that the issue with this is more or less "solved" -- is that true?

mturilli commented 5 years ago

@acadev, it is. We tested jsrun on the nodes and it seems to work correctly. We are now ready to start some tests and progressively scale them up.