radical-experiments / deepdriveMD

0 stars 2 forks source link

Suggestions for experiments for the HPDC paper #1

Open mturilli opened 4 years ago

mturilli commented 4 years ago

Experiments

Here some ideas and notes on the experiments we may want to design, setup and run for the HPDC paper. Happy to discuss each experiment further if you find this interesting/useful.

Experiment 1

Done, results at https://github.com/radical-experiments/hyperspace_experiments/blob/master/analysis/nonuniform_tasks/nonuniform_tasks.ipynb

Experiment 2

On the base of the Experiment 1, showing better resource utilization by keeping TTX constant while reducing the total amount of resoruces required and controlling the number of execution generation per task duration.

Design

Setup

Run ID #T_1000s #T_100s #T_10s #G(T_1000s) #G(T_100s) #G(T_10s) #Cores TTX ideal RU
1 40 40 40 1 2 2 120 1000s ?
2 40 40 40 1 4 4 80 1000s ?
3 40 40 40 1 8 8 60 1000s ?
4 40 40 40 1 10 16 47 1000s ?
5 40 40 40 1 10 32 46 1000s ?
6 40 40 40 1 10 40 45 1000s ?

Legenda

Notes:

Experiment 3

Shows that the results observed in Experiment 1 apply to real-life workflows with tasks that have an actual distribution of execution time. Thus shows the insight we can get about resource utilization for a an actual workflow. As experiment 1 but with distribution of task execution time measured by executing one of the scientific workflows of the paper (choose the most interesting one from a scientific point of view).

Experiment 4

Shows we can maximize resource utilization while keeping the workflow execution time as close as feasible to its ideal total execution time. As Experiment 2 but with the same distribution of task execution time as in Experiment 3, and only with the maximal resource utilization, i.e., the run with minimal number of cores.

Experiment 5

Applies what learned with the previous experiments to an actual workflow, maximizing its resource utilization while minimizing its execution time for a given resource. Analyze the execution of the scientific workflow used for Experiment 3 and define all the unique ratios between heterogeneous tasks. For example, imagine that across the execution of al the workflow, we have 4 distinct ratios of 3 types of tasks. We would have 4 cases of Experiment 1. We would then apply the equation derived for Experiment 2 and we would calculate the optimal resource utilization as done in Experiment 4.

lee212 commented 4 years ago

Experiment 2 needs a new table. The cpu counts will be round up by node counts. For example, 42/84/126/168/210/252 cores; 1/2/3/4/5/6 nodes are possible to change resource sizes. To maximize scheduling, tasks are bumped up to 100 from 40 tasks. The possible table is like:

Run ID #T_1000s #T_100s #T_10s #G(T_1000s) #G(T_100s) #G(T_10s) #Cores TTX ideal RU
1 100 100 100 1 1 2 252 1000s ?
2 100 100 100 1 1 10 210 1000s ?
3 100 100 100 1 2 50 168 1000s ?
4 100 100 100 1 4 100 126 1000s ?
shantenujha commented 4 years ago

I was hoping we could use published results to discuss the rate of launch (which is important for the O(10) seconds, if not O(100) seconds tasks) -- to convince the reader, that although not super efficient, our rate-of-launch overhead is adequate for the scales proposed. Suggestions of which published plots, if any, we can leverage ? If not, we should consider doing some (relatively quick) experiments to capture relevant performance of task launching.

mturilli commented 4 years ago

I think the latest paper we published with ORNL contains that information, albeit indirectly. We can derive the scheduling rate from the experiments that @lee212 just run but that would not give us the scheduler performance upper boundary.

lee212 commented 4 years ago

Proposed table for Experiment 3

system T_7200s (mdrun) T_3900s (CVAE) T_840s (TICA) T_600s (Inference) T_5s (RLDock)
ntl9 12 10 10 1 1
ntl9 24 10 10 1 1
ntl9 48 10 10 1 1
ntl9 96 10 10 1 1
---- --- --- --- -- -
ntl9 60 50 50 5 1
ntl9 120 50 50 5 1
ntl9 240 50 50 5 1
ntl9 480 50 50 5 1
---- --- --- --- -- -
ntl9 120 100 100 10 1
ntl9 240 100 100 10 1
ntl9 480 100 100 10 1
ntl9 960 100 100 10 1
shantenujha commented 4 years ago

@lee212 are the entries temporal durations, or number of concurrent tasks ?

lee212 commented 4 years ago

Exp 4

Run ID #G_MD #G_CVAE #G_TICA #G__Inference #G_RLDock # Cores TTX Ideal
1 1 1 1 1 1 252 7200
2 1 1 2 3 1 210 7200
3 1 1 10 10 1 168 7200
4 1 2 10 10 1 126 7200 8400
mturilli commented 4 years ago

Hi @lee212, what is the total number of cores you used for Exp 3? In Exp 1, we showed that, given a fixed amount of cores with which we can run all the available tasks, resource utilization increases only when the number of long-running tasks dominates over that of short-running tasks. Now that I am writing the paper, I see this might not need another experiment as Exp 1 is convincing in itself.

Looking at your table Experiment 3, I had initially thought it described the scalability experiment we decided to do, say Exp 6 for lack of a better name. Was this the case or was I wrong? If the latter, do we have a table describing the scalability experiment?

About Exp 4, why did you choose 12:10:10:10:1 for the number of tasks? Is this what Arvind wants to use?

lee212 commented 4 years ago

Table for Exp3 needs changes, it does not have correct numbers like exp1. I also agree it is not necessary as Exp1 satisfies its objective.

Exp4, the number of tasks was preserved from the real workload, ntl9 physical system which starts with 12 MD simulations, 10 CVAE and 10 TICA training, 1 inference and 1 reinforcement learning. Right, I didn't believe that exp4 needs to use the same distribution of exp2 like 100:100:100 but it shows resource utilization as a function of # of cores reaching ideal TTX like exp2.

lee212 commented 4 years ago

New scoping experiments for ICPP paper

  1. Performance characterization
    • Objective: shows resource utilization with varying the number of cores and the number of tasks which will describe certain behavior for temporal heterogeneity
    • Configuration:  continue working on the current plan
      • 3 sets (1000/100/10 seconds N tasks) run on 3 nodes
    • What to expect: better resource utilization when longer running tasks are dominantly occupied in the workflow 
  2. Scaling
    • Objective: The purpose of this test is to 1) estimate the required resources for processing large problem sizes, and 2) to measure the ability of RCT and an application to behave well with different problem sizes. This will give an idea of the resource planning precisely when large applications need to run for the full problem size and intensive usage of compute resources.
    • Configuration: to emphasize the upscaling, two sets will run over 1024 nodes; 1) non-RCT, 2) RCT enabled
    • What to expect:
      • non-RCT is incapable over x nodes to scale
      • RCT overhead is negligible at higher problem size
      • One is bound to reach the limitations of the system at some point (e.g. 2^16 tasks) 
  3. Performance improvement
    • Objective:  The purpose of this test is to demonstrate how RCT intelligently distribute available resources/tasks to HPC compute nodes/CPU/GPU cores for performance improvement. The key breakthrough might be 1) new scheduling (no thread lock in python 3) or 2) ML-enabled adaptive sampling
    • Configuration:
      • RCT stacks with Python 2
      • RCT stacks with Python 3 jsrun
      • RCT stacks with Python 3 PRTE
      • Or new scheduling for task assignment/resource allocation?
        • Current scheduling is continuous, first come first serve because we don’t have sufficient information for tasks e.g., duration like how long it will take
        • If we have more information about tasks, better bin packing calculation might be achievable 
  4. With real science applications
    • Objective: The purpose of this test is to validate our experiments/claims are valid
lee212 commented 4 years ago

resource_utilization_n_cores

Inverted plot for resource utilization for reduced # of cores

andre-merzky commented 4 years ago

This looks... unexpected...