hengma1001 / CVAE_pilot_MD

0 stars 0 forks source link

Clarification on concurrency (CVAE) #2

Open jdakka opened 5 years ago

jdakka commented 5 years ago

Where in the email it was indicated that 10 -12 CVAE models running concurrently, I interpreted this as 1 hyperparameter optimization (optimizing only for latent dimension) that requires 10 - 12 CVAE replicas where each replica only differs in the latent dimension configuration. I ask this because earlier we discussed having multiple hyperparameter configurations (not only for latent dimension but also convolutional layers, etc) and I remember the relationship between CVAE models, i.e., HyperSpaces to be 2**H, where H = # of hyperparameters.

acadev commented 5 years ago

I think for the purposes of our paper, it is best we keep our understanding to be as simple as possible. I think the confusion stems from the discussion on what it means to have hyper parameters to be optimized.

According to our current implementation of HyperSpace, let's say I have 5 hyper parameters, I need about (2^5) == 32 replicas of CVAE to be trained. But each of these 5 hyper parameters may be in a different range (of values). When @yngtodd implemented HyperSpace, he was dividing the space into 32 replicas to be run.

When we are running our simulations coupled to the CVAE, there are two options: (1) Run the CVAE with HyperSpace considering the latent dimension also as a hyper parameter --> we have 2^6 == 64 models to train simultaneously. OR (2) Run CVAE with 5 hyperparamters (32 initial CVAE trainings) and then replicate CVAE for only 10-12 depending on the latent dimensions.

I think option (1) makes more sense and will also be relevant for larger workloads not the cluster. Hope this clarification helps. Now the true challenge will be, lets say I am running 1000s of simulations (which is what we can do with summit), how can we stream this information for the initial training across multiple Summit Nodes.

jdakka commented 5 years ago

(1) Run the CVAE with HyperSpace considering the latent dimension also as a hyper parameter --> we have 2^6 == 64 models to train simultaneously

@acadev thanks for clarifying. Will HyperSpace consist of a single executable that spawns multiple, i.e., 64 MPI jobs? Similar to what @yngtodd implemented with mpi4py? Also, does each MPI job run on a single GPU?

Now the true challenge will be, lets say I am running 1000s of simulations (which is what we can do with summit), how can we stream this information for the initial training across multiple Summit Nodes.

@acadev, I'll let @mturilli answer to this

acadev commented 5 years ago

@jdakka -- great first question: I think it makes sense to have HyperSpace run as a single executable -- 64 MPI jobs since that is the flavor we implemented. Each MPI job does run on a single GPU. We may have issued when the data does not fit into the memory of a single GPU -- in which case we will have to test it out and see what happens. We have initial runs on data parallel versions of CVAE -- but have to see how it runs on the larger datasets we have.

jdakka commented 5 years ago

@acadev thanks! For internal RCT discussion: on the ML pipeline we are looking at a single (giant, MPI, 64 GPUs) task for training (stage 1) and another (smaller, MPI or non-MPI, 6 or less GPUs) task for inference (stage 2)

jdakka commented 5 years ago

@hengma1001 indicated also that his workload will require RMQ. On Titan we had RMQ installed in a container, in order to make RMQ accessible to the compute nodes. I believe from the INSPIRE call that on Summit the containers are mainly to be used for transferring data.

mturilli commented 5 years ago

RMQ does not perform simulations so it should still be in the remit of the container service offered by ORNL. This is why we have been able to run it there until now. We are in the process to test this and we will report back.

acadev commented 5 years ago

@jdakka @mturilli -- just to be on the same page -- once we replace everything with the Radical workflow (which also used or used RMQ) then the part of the workflow which uses RMQ in @hengma1001 's code will not need RMQ anymore. The question when and if RMQ will work on Summit.

mturilli commented 5 years ago

Hi @acadev, we confirm that our deployment of RMQ at OLCF can be accessed and used from Summit.

jdakka commented 5 years ago

@acadev: Given @hengma1001's code, my understanding is that the primary use of RMQ is to spawn multiple MD pipelines. If we can design the simulations such that each OpenMM executable is a single, independent task, we can effectively remove RMQ from the workload.

However I wanted to bring up an earlier point: earlier in this ticket, we confirmed that HyperSpace would run a single executable that would spawn multiple MPI processes, where each process runs a CVAE model. @hengma1001's code is using celery/RMQ to spawn the CVAE models. I think I need more granularity on which implementation we'd be aiming for.

acadev commented 5 years ago

Actually -- now I get your specific question. For phase 1 of the hyper parameter optimization, HyperSpace will be used to spawn jobs when needed. We were using @hengma1001 code to basically demonstrate that it is possible to run a workflow based on using CVAE and starting new simulations. This part of the code would be essentially used only in phase 2, where we have CVAE running in inference mode and then we will spawn new simulations from there (based on what points are identified to be novel for starting new simulations). Hope this clarifies the use of the workflow?

jdakka commented 5 years ago

Thanks and please help me understand the following: By phase 1 do you mean the training phase of the CVAE, after enough simulation data is generated in order to train? If so, it seems logical to me that phase 1 will train while incorporating hyper parameter search.

I am incorporating the drawing from earlier where I labeled training phase and inference phase just to be sure that these correspond to your Phase 1 and Phase 2. Drawing

acadev commented 5 years ago

Yes -- you are correct -- the drawing also correctly captures what we are doing.