mturilli commented 4 years ago

Experiments

Here some ideas and notes on the experiments we may want to design, setup and run for the HPDC paper. Happy to discuss each experiment further if you find this interesting/useful.

Experiment 1

Done, results at https://github.com/radical-experiments/hyperspace_experiments/blob/master/analysis/nonuniform_tasks/nonuniform_tasks.ipynb

Experiment 2

On the base of the Experiment 1, showing better resource utilization by keeping TTX constant while reducing the total amount of resoruces required and controlling the number of execution generation per task duration.

Design

120 1-core tasks
task durations: 10s, 100s, 1000s
Ideal TTX: 1000s
Actual TTX: 1000s + RP/resource overheads.
variable ratio among task durations (as in the first experiments)
Decreasing amount of resources for each given ratio
Increasing amount of generations for 10s and 100s tasks, with Sum Tx(generation) <= 1000s.

Setup

task duration ratio: 40-40-40
40 1000s tasks: 1 execution generation
40 100s tasks: 2, 4, 8, 10 execution generations
40 10s tasks: 2, 4, 8, 16, 32, 40 generations
Number of cores requested: 120, 80, 60, 50, 47, 46, 45.

Run ID	#T_1000s	#T_100s	#T_10s	#G(T_1000s)	#G(T_100s)	#G(T_10s)	#Cores	TTX ideal	RU
1	40	40	40	1	2	2	120	1000s	?
2	40	40	40	1	4	4	80	1000s	?
3	40	40	40	1	8	8	60	1000s	?
4	40	40	40	1	10	16	47	1000s	?
5	40	40	40	1	10	32	46	1000s	?
6	40	40	40	1	10	40	45	1000s	?

Legenda

Run: Number of experimt run
T_1000s: Number of tasks with 1000s duration
T_100s: Number of tasks with 100s duration
T_10s: Number of tasks with 10s duration
G(T_1000s): Number of generations for executing tasks with 1000s duration
G(T_100s): Number of generations for executing tasks with 100s duration
G(T_10s): Number of generations for executing tasks with 10s duration
Cores: Number of cores used to execute all the given tasks
TTX ideal: Ideal total execution time of all the given tasks
RU: Resource utilization

Notes:

we can write an equation to calculate the minimal number of cores required to have the maximal resource utilization with the minimal amount of total execution time, given a set of tasks with known heterogeneous execution time.
You want to implement this experiment using EnTK and 3 concurrent pipelines. In this way, you will always guarantee that the tasks with longest runtime start at the same time (1st stage of their pipeline). Generalizing, you want a workflow with N pipelines where N = the number of tasks with different execution time or, more formally, the number of partitions of the set of runtimes.

Experiment 3

Shows that the results observed in Experiment 1 apply to real-life workflows with tasks that have an actual distribution of execution time. Thus shows the insight we can get about resource utilization for a an actual workflow. As experiment 1 but with distribution of task execution time measured by executing one of the scientific workflows of the paper (choose the most interesting one from a scientific point of view).

Experiment 4

Shows we can maximize resource utilization while keeping the workflow execution time as close as feasible to its ideal total execution time. As Experiment 2 but with the same distribution of task execution time as in Experiment 3, and only with the maximal resource utilization, i.e., the run with minimal number of cores.

Experiment 5

Applies what learned with the previous experiments to an actual workflow, maximizing its resource utilization while minimizing its execution time for a given resource. Analyze the execution of the scientific workflow used for Experiment 3 and define all the unique ratios between heterogeneous tasks. For example, imagine that across the execution of al the workflow, we have 4 distinct ratios of 3 types of tasks. We would have 4 cases of Experiment 1. We would then apply the equation derived for Experiment 2 and we would calculate the optimal resource utilization as done in Experiment 4.

lee212 commented 4 years ago

Experiment 2 needs a new table. The cpu counts will be round up by node counts. For example, 42/84/126/168/210/252 cores; 1/2/3/4/5/6 nodes are possible to change resource sizes. To maximize scheduling, tasks are bumped up to 100 from 40 tasks. The possible table is like:

Run ID	#T_1000s	#T_100s	#T_10s	#G(T_1000s)	#G(T_100s)	#G(T_10s)	#Cores	TTX ideal	RU
1	100	100	100	1	1	2	252	1000s	?
2	100	100	100	1	1	10	210	1000s	?
3	100	100	100	1	2	50	168	1000s	?
4	100	100	100	1	4	100	126	1000s	?

shantenujha commented 4 years ago

I was hoping we could use published results to discuss the rate of launch (which is important for the O(10) seconds, if not O(100) seconds tasks) -- to convince the reader, that although not super efficient, our rate-of-launch overhead is adequate for the scales proposed. Suggestions of which published plots, if any, we can leverage ? If not, we should consider doing some (relatively quick) experiments to capture relevant performance of task launching.

mturilli commented 4 years ago

I think the latest paper we published with ORNL contains that information, albeit indirectly. We can derive the scheduling rate from the experiments that @lee212 just run but that would not give us the scheduler performance upper boundary.

lee212 commented 4 years ago

~~Proposed table for Experiment 3~~

system	T_7200s (mdrun)	T_3900s (CVAE)	T_840s (TICA)	T_600s (Inference)	T_5s (RLDock)
ntl9	12	10	10	1	1
ntl9	24	10	10	1	1
ntl9	48	10	10	1	1
ntl9	96	10	10	1	1
----	---	---	---	--	-
ntl9	60	50	50	5	1
ntl9	120	50	50	5	1
ntl9	240	50	50	5	1
ntl9	480	50	50	5	1
----	---	---	---	--	-
ntl9	120	100	100	10	1
ntl9	240	100	100	10	1
ntl9	480	100	100	10	1
ntl9	960	100	100	10	1

shantenujha commented 4 years ago

@lee212 are the entries temporal durations, or number of concurrent tasks ?

lee212 commented 4 years ago

Exp 4

Number of tasks are 12:10:10:10:1 (for MDRUN:CVAE:TICA:Inference:RLDock)

Run ID	#G_MD	#G_CVAE	#G_TICA	#G__Inference	#G_RLDock	# Cores	TTX Ideal
1	1	1	1	1	1	252	7200
2	1	1	2	3	1	210	7200
3	1	1	10	10	1	168	7200
4	1	2	10	10	1	126	~~7200~~ 8400

mturilli commented 4 years ago

Hi @lee212, what is the total number of cores you used for Exp 3? In Exp 1, we showed that, given a fixed amount of cores with which we can run all the available tasks, resource utilization increases only when the number of long-running tasks dominates over that of short-running tasks. Now that I am writing the paper, I see this might not need another experiment as Exp 1 is convincing in itself.

Looking at your table Experiment 3, I had initially thought it described the scalability experiment we decided to do, say Exp 6 for lack of a better name. Was this the case or was I wrong? If the latter, do we have a table describing the scalability experiment?

About Exp 4, why did you choose 12:10:10:10:1 for the number of tasks? Is this what Arvind wants to use?

lee212 commented 4 years ago

Table for Exp3 needs changes, it does not have correct numbers like exp1. I also agree it is not necessary as Exp1 satisfies its objective.

Exp4, the number of tasks was preserved from the real workload, ntl9 physical system which starts with 12 MD simulations, 10 CVAE and 10 TICA training, 1 inference and 1 reinforcement learning. Right, I didn't believe that exp4 needs to use the same distribution of exp2 like 100:100:100 but it shows resource utilization as a function of # of cores reaching ideal TTX like exp2.

lee212 commented 4 years ago

New scoping experiments for ICPP paper

Performance characterization
- Objective: shows resource utilization with varying the number of cores and the number of tasks which will describe certain behavior for temporal heterogeneity
- Configuration: continue working on the current plan
  - 3 sets (1000/100/10 seconds N tasks) run on 3 nodes
- What to expect: better resource utilization when longer running tasks are dominantly occupied in the workflow
Scaling
- Objective: The purpose of this test is to 1) estimate the required resources for processing large problem sizes, and 2) to measure the ability of RCT and an application to behave well with different problem sizes. This will give an idea of the resource planning precisely when large applications need to run for the full problem size and intensive usage of compute resources.
- Configuration: to emphasize the upscaling, two sets will run over 1024 nodes; 1) non-RCT, 2) RCT enabled
- What to expect:
  - non-RCT is incapable over x nodes to scale
  - RCT overhead is negligible at higher problem size
  - One is bound to reach the limitations of the system at some point (e.g. 2^16 tasks)
Performance improvement
- Objective: The purpose of this test is to demonstrate how RCT intelligently distribute available resources/tasks to HPC compute nodes/CPU/GPU cores for performance improvement. The key breakthrough might be 1) new scheduling (no thread lock in python 3) or 2) ML-enabled adaptive sampling
- Configuration:
  - RCT stacks with Python 2
  - RCT stacks with Python 3 jsrun
  - RCT stacks with Python 3 PRTE
  - Or new scheduling for task assignment/resource allocation?
    - Current scheduling is continuous, first come first serve because we don’t have sufficient information for tasks e.g., duration like how long it will take
    - If we have more information about tasks, better bin packing calculation might be achievable
With real science applications
- Objective: The purpose of this test is to validate our experiments/claims are valid

lee212 commented 4 years ago

resource_utilization_n_cores

Inverted plot for resource utilization for reduced # of cores

andre-merzky commented 4 years ago

This looks... unexpected...

radical-experiments / deepdriveMD

Suggestions for experiments for the HPDC paper #1

Experiments

Experiment 1

Experiment 2

Design

Setup

Run: Number of experimt run

T_1000s: Number of tasks with 1000s duration

T_100s: Number of tasks with 100s duration

T_10s: Number of tasks with 10s duration

G(T_1000s): Number of generations for executing tasks with 1000s duration

G(T_100s): Number of generations for executing tasks with 100s duration

G(T_10s): Number of generations for executing tasks with 10s duration

Cores: Number of cores used to execute all the given tasks

Notes:

Experiment 3

Experiment 4

Experiment 5