EnTK behavior and performance consideration

Weiming-Hu commented 3 years ago

This is not a bug report but rather a proposed agenda for our next weekly meeting.

I have been using EnTK to run a standard grid search workflow for parameter optimization. I have about 170 geographic locations and I would like to optimize the weight for each of these locations. The total number of weight combinations is 8,001.

A minimal workflow to run 1 weight combination at 1 location is as follows:

Analog Generation: C++ program, less than 1 second
Power simulation: Python program, 10 seconds
Verification: Python program, less than 1 second

I experimented with several workflow designs, namely:

Large number of pipelines each with a single light-weight task (1,360,170 pipelines, 1 minute per task)
- Long submission time?
- Memory exhaustion?
Medium number of pipelines each with several medium-weight tasks (8,001 pipelines with 2 tasks, 10 minutes per task)

I would like to discuss some interesting observations I had regarding my experiences with EnTK. I will greatly appreciate any advice and suggestions for improving the performance and design in the future.

Weiming-Hu commented 3 years ago

Questions:

processes vs threads_per_process: Where do processes and threads land onto?
Status in logging: Scheduled, Submitted, ...?
The correspondence between unit folder ID and task/stage/pipeline?

andre-merzky commented 3 years ago

Questions:

1. `processes` vs `threads_per_process`: Where do processes and threads land onto?

if cpu_process_type is MPI, then process ranks will land on any node. The scheduler will try to place them on the same node, but is free to distribute. assuming that MPI works across nodes. threads_per_process will allocate that number of cores per process on the same node, so that the process can spawn threads which land on their own core.

2. Status in logging: `Scheduled`, `Submitted`, ...?

Scheduled: the task accepted by RE and waits for dependencies to get resolved Submitted: dependencies are resolved, the task is handed to RP for execution

3. The correspondence between unit folder ID and task/stage/pipeline?

That correspondence is mostly random at the moment. We work on a patch to make the IDs uniform across the layers, that should get released soon (or is released? I should check...)

andre-merzky commented 3 years ago

I experimented with several workflow designs, namely:
* Large number of pipelines each with a single light-weight task (1,360,170 pipelines, 1 minute per task)

  * Long submission time?
  * Memory exhaustion?

* Medium number of pipelines each with several medium-weight tasks (8,001 pipelines with 2 tasks, 10 minutes per task)
I would like to discuss some interesting observations I had regarding my experiences with EnTK. I will greatly appreciate any advice and suggestions for improving the performance and design in the future.

We'll probably discuss this on the call, but some quick feedback anyway:

very short tasks put significant strain on the system: if the turnaround time for scheduling, placing and starting a task is in the same order of magnitude than actually running the task, efficiency decreases
many tasks are, in general, not a problem for the execution system: in fact, they make it easier to load-balance

Having said that, there are limitations to the second point: tasks have, at the moment, a certain memory impact, and the client at all times holds a class instance for each task. At the same time, MongoDB does not like the way we communicate with it, and that also suffers from many tasks. As a rule of thumb, 32k tasks would be the max I'd suggest for now. For larger numbers, we would need to tweak MongoDB settings (poll timeouts etc) to make this work: this is possible though.

In the long run, we target about a million tasks and hope we arrive there by Summer: we change the way tasks are expressed in memory (that partially happened already), and remove MongoDB as limiting component.

Having said all that: (8,001 pipelines with 2 tasks, 10 minutes per task) sounds like a great setup.

Weiming-Hu commented 3 years ago

Great discussion today and everything you said makes total sense to me. Thank you very much for the awesome tool. I guess this ticket has pretty much served its purpose. I'm going to close this and open others one more specific to issues I had currently.

radical-collaboration / hpc-workflows

EnTK behavior and performance consideration #132