Closed lee212 closed 3 years ago
@benjha -- Thanks. This looks good to me as well designed first pass.
@benjha , the profiling purpose, the link is fine but the rendered one you can find here: https://radicalentk.readthedocs.io/en/latest/user_guide/profiling.html Let us know if you need further information using radical.analytics. and also about EnTK.
Current baseline is based on default resources assigned to the workflow. For simulation (NAMD), 40 MPI process with 4 threads each are requested. For analysis, 1 MPI process with 4 threads.
MDFF-EnTK workflow was executed five times and reported runtime results averaged. Runtimes were obtained from NAMD's WallTime report; for VMD, Linux's time
where used and elapsed (wall clock time) time is reported.
Software configuration:
python : 3.7.0
pythonpath : /sw/summit/xalt/1.2.0/site:/sw/summit/xalt/1.2.0/libexec
virtualenv : /gpfs/alpine/world-shared/bip115/radical_tools_python
radical.entk : 1.4.0
radical.pilot : 1.4.0
radical.saga : 1.4.0
radical.utils : 1.4.0
NAMD 2.14b1 MPI smp
NAMD 2.14b2 PAMI smp
VMD 1.9.3
Total time: 68.48 s
Simulation runtime: 50.3 s (NAMD 2.14b1 MPI smp)
Analysis runtime: 18.18 s
Longest analysis task: Task 1, 4.77 s
Two flavors of NAMD were used in this evaluation, MPI smp and PAMI smp. It is not clear which favor should be used for the MDFF-EnTK workflow, but currently it is using MPI SMP.
Following recommendations from [1], different number of MPI processes and threads were tested, however the only combination that provided better performance than current 40 MPI processes with 4 threads each was 20 MPI processes with 8 threads each. This resulted in total simulation runtime going down from 50.3 s to 42.52 s.
First we did an unguided performance evaluation by increasing number of threads by power of 2 to verify "natural" speed ups. Fig. 1 reports timings for 4, 8, 16 threads, there is a clear trend indicating that adding more threads reduced simulation runtime, however using 32 threads or more being power of two, sometimes halted the workflow between this task (task 6) and task 7. Notice that by only using 16 threads, the simulation runtime is about 2x slower than best performant NAMD MPI smp using 160 cores.
Figure 1. Simulation timings using NAMD MPI smp version
Following Stage 2 and Stage 3 recommendations [1], we then evaluate simulation runtime using 4, 7, 14, 28 and 56 threads. Fig 2. reports timings under these configurations, in particular 56 threads was the best performant of MPI smp and PAMI smp tests.
Figure 2. Simulation timings using NAMD PAMI smp version
As a side note, using 112 threads average runtime went from 37.82 s (56 threads) to 31.3405426 s which indicates we are hitting an scalability limit for the size of this molecular system. In addition, when using 112 threads we also suffered of halting between this task and task 7.
Preliminary comments:
In its current version (1.4.0), Radical-Saga set SMT (Simultaneous Multithreading Level) flag to 4 and cpu ids, which are used to generate the Summit's ERF, do not consider an stride different that one. Stride is needed when SMT is different than four. Further examination is required to have more flexibility when allocating resources.
Further details in:
namd-entk.slack.com
VMD was configured to use 1 MPI process with 4,8 and 16 threads to evaluate if current analysis task take advantage of CPU's cores. In particular, task 1 to task 5 and task 7 to task 9 are dedicated to analysis.
Figure 3. Analysis timings
Fig. 3 shows tasks' runtime didn't improve from going from 4 to 8 and 16 threads. On the other hand, despite we are limiting the resources assigned to VMD explicity with the ERF file configuration, e.g.
cpu_index_using: physical
rank: 0: { host: 1; cpu: {0,1,2,3,4,5,6,7}}
Still, VMD was able to detect 176 CPU cores avaible:
...
Info) Initializing parallel VMD instances via MPI...
Info) Found 1 VMD MPI node containing a total of 176 CPUs and 0 GPUs:
Info) 0: 176 CPUs, 572.3GB (94%) free mem, 0 GPUs, Name: f11n05
Info) No CUDA accelerator devices available.
Info) Using plugin pdb for structure file 4ake-target.pdb
Info) Using plugin pdb for coordinates from file 4ake-target.pdb
Info) Determining bond structure from distance search ...
...
this could mean VMD may be using all CPU cores, EnTK's resource allocation is being overridden and as a consequence Fig. 3 showed same performance for all cases.
Total time: 56.00 s
Simulation runtime: 37.83s (NAMD 2.14b2 PAMI smp - 56 threads)
Analysis runtime: 18.18 s
Longest analysis task: Task 1, 4.77 s
@benjha , just to share the update with SMT level, NVME and spectral, we are working on these now and SMT is now on with devel
branch with this PR https://github.com/radical-cybertools/radical.pilot/pull/2198 merged.
NVMe will be added as an option in a resource config like SMT.
For smt1 and NVME options, the radical.pilot needs a configuration:
"system_architecture" : {"smt": 1,
"options": ["gpumps", "NVME"]}
},
I am closing this for now because of inactivity, but willing to re-open.
Performance Evaluation Plan
The performance evaluation plan will consist on verifying the behavior of MDFF-EnTK workflow on Summit. It will include performance evaluation at application level and at Radical Tools level.
Current workflow execute NAMD and VMD in the same node.
Application Level
General Approach
Calculate baseline timings with the current resource allocation used for NAMD and VMD. Timings will include total execution of both programs.
NAMD's baseline resource configuration is 40 MPI processes, 4 threads each. VMD's baseline resource configuration is 1 MPI process, 4 threads. However, its output shows:
CPU resource configuration
Modify current resource allocation for NAMD and VMD to verify performance. NAMD-SMP supports different combinations of MPI processes-threads. This evaluation will consist on:
Evaluate the impact of MPI in circumstances when there is a low amount of MPI processes with a high number of threads vs high amount of MPI processes with a low number of threads.
Evaluate socket affinity.
A wrong combination of the above will lead to resource contention.
By default, SMPI is configured for minimum latency. Verify performance when configuring SMPI for maximum bandwidth. (No change in performance is expected since NAMD uses one node).
https://www.ks.uiuc.edu/Research/vmd/mailing_list/vmd-l/28900.html https://www.ks.uiuc.edu/Research/vmd/current/ug/node144.html
Some functionality uses multithreading. If using several MPI processes, should be used tcl's parallel command. Verify that the flag
cpus
set to 4 under analysis section ofworkflow_cfg.yml
is limiting VMD to use 4 threads, but according to the std output of each unit.* this may not be happening.I/O
NAMD and VMD communicate with each other via files. At large scale this will be problematic because a lot of files will be written to and read from Alpine (gpfs). The plan is to evaluate Summit's NVMe for temporal storage. Using NVMe paried with OLCF's Spectral will help to automatically flush NVMe contents to Alpine when the job finishes.
Current behavior
Modify your job submission script to include the -alloc_flags NVME bsub option. Then on each reserved Burst Buffer node will be available a directory called /mnt/bb/$USER.
https://www.olcf.ornl.gov/spectral-library/
Misc
Are we using GPUs?
EnTK Level
Will be based on:
https://github.com/radical-cybertools/radical.entk/blob/5710b11463981244ca454939d2cfd03e2369cf47/docs/user_guide/profiling.rst
Does Radical Analytics can be used to run the evaluation described above ?