Scaling Performance at large pipeline runs

benjha commented 4 years ago

Performance Evaluation Plan

The performance evaluation plan will consist on verifying the behavior of MDFF-EnTK workflow on Summit. It will include performance evaluation at application level and at Radical Tools level.

Current workflow execute NAMD and VMD in the same node.

Application Level

General Approach

Calculate baseline timings with the current resource allocation used for NAMD and VMD. Timings will include total execution of both programs.

NAMD's baseline resource configuration is 40 MPI processes, 4 threads each. VMD's baseline resource configuration is 1 MPI process, 4 threads. However, its output shows:

...
Info) Initializing parallel VMD instances via MPI...
Info) Found 1 VMD MPI node containing a total of 176 CPUs and 0 GPUs:
Info)    0: 176 CPUs, 566.9GB (93%) free mem, 0 GPUs, Name: f35n07
Info) No CUDA accelerator devices available.
...

CPU resource configuration

Modify current resource allocation for NAMD and VMD to verify performance. NAMD-SMP supports different combinations of MPI processes-threads. This evaluation will consist on:

Evaluate the impact of MPI in circumstances when there is a low amount of MPI processes with a high number of threads vs high amount of MPI processes with a low number of threads.
Evaluate socket affinity.
A wrong combination of the above will lead to resource contention.
By default, SMPI is configured for minimum latency. Verify performance when configuring SMPI for maximum bandwidth. (No change in performance is expected since NAMD uses one node).

export PAMI_ENABLE_STRIPING=1
export PAMI_IBV_ADAPTER_AFFINITY=1
export PAMI_IBV_DEVICE_NAME="mlx5_0:1,mlx5_3:1"
export PAMI_IBV_DEVICE_NAME_1="mlx5_3:1,mlx5_0:1"

VMD's notes on parallelism:

https://www.ks.uiuc.edu/Research/vmd/mailing_list/vmd-l/28900.html https://www.ks.uiuc.edu/Research/vmd/current/ug/node144.html

Some functionality uses multithreading. If using several MPI processes, should be used tcl's parallel command. Verify that the flag cpus set to 4 under analysis section of workflow_cfg.yml is limiting VMD to use 4 threads, but according to the std output of each unit.* this may not be happening.

...
Info) Initializing parallel VMD instances via MPI...
Info) Found 1 VMD MPI node containing a total of 176 CPUs and 0 GPUs:
Info)    0: 176 CPUs, 566.9GB (93%) free mem, 0 GPUs, Name: f35n07
Info) No CUDA accelerator devices available.
...

I/O

NAMD and VMD communicate with each other via files. At large scale this will be problematic because a lot of files will be written to and read from Alpine (gpfs). The plan is to evaluate Summit's NVMe for temporal storage. Using NVMe paried with OLCF's Spectral will help to automatically flush NVMe contents to Alpine when the job finishes.

NVMe off

Current behavior

NVMe on

Modify your job submission script to include the -alloc_flags NVME bsub option. Then on each reserved Burst Buffer node will be available a directory called /mnt/bb/$USER.

NVMe on with Spectral

https://www.olcf.ornl.gov/spectral-library/

Misc

Are we using GPUs?

EnTK Level

Will be based on:

https://github.com/radical-cybertools/radical.entk/blob/5710b11463981244ca454939d2cfd03e2369cf47/docs/user_guide/profiling.rst

Does Radical Analytics can be used to run the evaluation described above ?

shantenujha commented 4 years ago

@benjha -- Thanks. This looks good to me as well designed first pass.

lee212 commented 4 years ago

@benjha , the profiling purpose, the link is fine but the rendered one you can find here: https://radicalentk.readthedocs.io/en/latest/user_guide/profiling.html Let us know if you need further information using radical.analytics. and also about EnTK.

benjha commented 4 years ago

Application level performance evaluation results

Current baseline is based on default resources assigned to the workflow. For simulation (NAMD), 40 MPI process with 4 threads each are requested. For analysis, 1 MPI process with 4 threads.

MDFF-EnTK workflow was executed five times and reported runtime results averaged. Runtimes were obtained from NAMD's WallTime report; for VMD, Linux's time where used and elapsed (wall clock time) time is reported.

Software configuration:

python               : 3.7.0
  pythonpath           : /sw/summit/xalt/1.2.0/site:/sw/summit/xalt/1.2.0/libexec
  virtualenv           : /gpfs/alpine/world-shared/bip115/radical_tools_python

  radical.entk         : 1.4.0
  radical.pilot        : 1.4.0
  radical.saga         : 1.4.0
  radical.utils        : 1.4.0

NAMD 2.14b1 MPI smp
NAMD 2.14b2 PAMI smp

VMD 1.9.3

Baseline total runtime

Total time:  68.48 s
Simulation runtime:  50.3 s (NAMD 2.14b1 MPI smp)
Analysis runtime:  18.18 s
Longest analysis task:  Task 1, 4.77 s

Simulation

Two flavors of NAMD were used in this evaluation, MPI smp and PAMI smp. It is not clear which favor should be used for the MDFF-EnTK workflow, but currently it is using MPI SMP.

NAMD MPI smp

Following recommendations from [1], different number of MPI processes and threads were tested, however the only combination that provided better performance than current 40 MPI processes with 4 threads each was 20 MPI processes with 8 threads each. This resulted in total simulation runtime going down from 50.3 s to 42.52 s.

NAMD PAMI smp

First we did an unguided performance evaluation by increasing number of threads by power of 2 to verify "natural" speed ups. Fig. 1 reports timings for 4, 8, 16 threads, there is a clear trend indicating that adding more threads reduced simulation runtime, however using 32 threads or more being power of two, sometimes halted the workflow between this task (task 6) and task 7. Notice that by only using 16 threads, the simulation runtime is about 2x slower than best performant NAMD MPI smp using 160 cores.

Figure 1. Simulation timings using NAMD MPI smp version

Following Stage 2 and Stage 3 recommendations [1], we then evaluate simulation runtime using 4, 7, 14, 28 and 56 threads. Fig 2. reports timings under these configurations, in particular 56 threads was the best performant of MPI smp and PAMI smp tests.

Figure 2. Simulation timings using NAMD PAMI smp version

As a side note, using 112 threads average runtime went from 37.82 s (56 threads) to 31.3405426 s which indicates we are hitting an scalability limit for the size of this molecular system. In addition, when using 112 threads we also suffered of halting between this task and task 7.

Preliminary comments:

In its current version (1.4.0), Radical-Saga set SMT (Simultaneous Multithreading Level) flag to 4 and cpu ids, which are used to generate the Summit's ERF, do not consider an stride different that one. Stride is needed when SMT is different than four. Further examination is required to have more flexibility when allocating resources.

Further details in:

namd-entk.slack.com

Analysis

VMD was configured to use 1 MPI process with 4,8 and 16 threads to evaluate if current analysis task take advantage of CPU's cores. In particular, task 1 to task 5 and task 7 to task 9 are dedicated to analysis.

Figure 3. Analysis timings

Fig. 3 shows tasks' runtime didn't improve from going from 4 to 8 and 16 threads. On the other hand, despite we are limiting the resources assigned to VMD explicity with the ERF file configuration, e.g.

cpu_index_using: physical
rank: 0: { host: 1; cpu: {0,1,2,3,4,5,6,7}}

Still, VMD was able to detect 176 CPU cores avaible:

...
Info) Initializing parallel VMD instances via MPI...
Info) Found 1 VMD MPI node containing a total of 176 CPUs and 0 GPUs:
Info)    0: 176 CPUs, 572.3GB (94%) free mem, 0 GPUs, Name: f11n05
Info) No CUDA accelerator devices available.
Info) Using plugin pdb for structure file 4ake-target.pdb
Info) Using plugin pdb for coordinates from file 4ake-target.pdb
Info) Determining bond structure from distance search ...
...

this could mean VMD may be using all CPU cores, EnTK's resource allocation is being overridden and as a consequence Fig. 3 showed same performance for all cases.

Improved total runtime

Total time:  56.00 s
Simulation runtime:  37.83s (NAMD 2.14b2 PAMI smp - 56 threads)
Analysis runtime:  18.18 s
Longest analysis task:  Task 1, 4.77 s

References

http://www.ks.uiuc.edu/Research/namd/wiki/?NamdPerformanceTuning

lee212 commented 4 years ago

@benjha , just to share the update with SMT level, NVME and spectral, we are working on these now and SMT is now on with devel branch with this PR https://github.com/radical-cybertools/radical.pilot/pull/2198 merged.

NVMe will be added as an option in a resource config like SMT.

lee212 commented 3 years ago

For smt1 and NVME options, the radical.pilot needs a configuration:

    "system_architecture"         : {"smt": 1,
                                     "options": ["gpumps", "NVME"]}
},

https://github.com/radical-cybertools/radical.pilot/blob/d90f8dd5268068c8834acf562ff87445499da8af/src/radical/pilot/configs/resource_ornl.json#L145-L147

lee212 commented 3 years ago

I am closing this for now because of inactivity, but willing to re-open.

radical-collaboration / MDFF-EnTK