Update on NAMD-EnTK pipeline; no `mdff check` in analysis, and N times simulation

lee212 commented 4 years ago

[x] removing mdff check at analysis
[ ] denoting N times simulation e.g. 4, 8, 16, 32, 64, ...

benjha commented 4 years ago

Hi @lee212

any progress in these tasks ?

daipayans commented 4 years ago

Hi @benjha @lee212, task 1 is independent of task2 , so the 5 times run should work regardless.

daipayans commented 4 years ago

Hi @lee212 @benjha : made changes to simple_mdff.py. Removed all instances of 'mdff check' command.

@lee212 : Please repeat the run that we did earlier on XSEDE Bridges and then we can peform task 2 (running N=5 times). Let me know if there are any issues.

lee212 commented 4 years ago

@daipayans, thanks for the change. I submitted the job to the XSEDE Bridges.

[hrlee@login018 ~]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           8111005        RM pilot.00    hrlee PD       0:00      2 (Priority)

I will increase N once I have the first result.

@benjha , I am about to make a copy of the Daipayans version for Summit, would you like to run it? I will push the copy asap.

benjha commented 4 years ago

@lee212 Yes, Summit's have been too busy these two weeks but we can give a try

benjha commented 4 years ago

@lee212

Can you push the Summit's version ? Thanks,

daipayans commented 4 years ago

@lee212 is there something we can do here barring the submission scripts for summit? let me know.

lee212 commented 4 years ago

@benjha, I pushed the summit version here: https://github.com/radical-collaboration/MDFF-Error/blob/master/simple_mdff.summit.py. It contains module load fftw and the recent changes that Daipayan made to the analysis steps.

@daipayans , I pushed output files from Bridges to here: https://github.com/radical-collaboration/MDFF-Error/tree/master/experiments/bridges/simple_mdff_1st, including the last task run.

daipayans commented 4 years ago

@lee212 I checked the STDOUT for tenth stage and there is an issue. The EnTK task is asking vmd to load the incorrect file (for instance check stdout for "/usr/tmp/mdff_sim8071626726571297.dx" instead of "4ake-target_autopsf-grid.dx" which is generated in the second stage. There are many such instances when you check the stdout for this stage.

As the griddx file is incorrect it cannot calculate the cross-correlation coefficient between the density map and structure. The tcl commands are fine as I checked them. This looks to me an EnTK issue and perhaps there is a easy solution around it.

As usual let's troubleshoot this and then run this experiment again. Thanks!

lee212 commented 4 years ago

@daipayans , I manually tested tcl script with VMD and the temporal file name seems correct, and it is likely generated as a copy of the 4ake-target_autopsf-grid.dx. See my interactive shell here:

$ vmd
Info) VMD for LINUXAMD64, version 1.9.2 (December 29, 2014)
Info) http://www.ks.uiuc.edu/Research/vmd/
Info) Email questions and bug reports to vmd@ks.uiuc.edu
Info) Please include this reference in published work using VMD:
Info)    Humphrey, W., Dalke, A. and Schulten, K., `VMD - Visual
Info)    Molecular Dynamics', J. Molec. Graphics 1996, 14.1, 33-38.
Info) -------------------------------------------------------------
Info) Multithreading available, 28 CPUs detected.
Info) Free system memory: 88238MB (68%)
Warning) Detected a mismatch between CUDA runtime and GPU driver
Warning) Check to make sure that GPU drivers are up to date.
Info) No CUDA accelerator devices available.
Info) Dynamically loaded 2 plugins in directory:
Info) /opt/packages/vmd1.9.2/lib/plugins/LINUXAMD64/molfile
vmd > mol new 1ake-initial_autopsf.psf
psfplugin) Detected a Charmm PSF file
Info) Using plugin psf for structure file 1ake-initial_autopsf.psf
Info) Analyzing structure ...
Info)    Atoms: 3341
Info)    Bonds: 3365
Info)    Angles: 6123  Dihedrals: 8921  Impropers: 541  Cross-terms: 212
Info)    Bondtypes: 0  Angletypes: 0  Dihedraltypes: 0  Impropertypes: 0
Info)    Residues: 214
Info)    Waters: 0
Info)    Segments: 1
Info)    Fragments: 1   Protein: 1   Nucleic: 0
0
vmd > mol addfile adk-step1.dcd waitfor all
dcdplugin) detected standard 32-bit DCD file of native endianness
dcdplugin) CHARMM format DCD file (also NAMD 2.1 and later)
Info) Using plugin dcd for coordinates from file adk-step1.dcd
Info) Finished with coordinate file adk-step1.dcd.
0
vmd > package require mdff
0.4
vmd > set selall [atomselect 0 "all"]
atomselect0
vmd > $selall frame 0
vmd > mdff ccc $selall -i 4ake-target_autopsf-grid.dx -res 5
Warning: guessing atomic number for atoms with unknown element...
Info) volmap: Computing bounding box coordinates
Info) volmap: grid minmax = {-1 22 0} {52 71 52}
Info) volmap: grid size = 53x49x52 (0.5 MB)
Info) volmap: writing file "/usr/tmp/mdff_sim026787686174170898.dx".
MAP <- "/usr/tmp/mdff_sim026787686174170898.dx"
MAP :: gaussian blur (sigma = 2.5, regular)
MAP :: pad by x:9 9 y:9 9 z:9 9
MAP -> "/usr/tmp/mdff_corr9391486108951032.dx"

Stats for MAP:
  WEIGHT:    1
  AVERAGE:   0.0546307
  STDEV:     0.161184
  MIN:       0
  MAX:       1.13487
  PMF_AVG:   0.0436423

-nan
vmd > ^Z
[1]+  Stopped                 vmd

I believe this is normal behavior and I assume the griddx file is loaded correctly. Can you confirm?

daipayans commented 4 years ago

@lee212 thanks for looking into this in detail. I will review the individual files at my end over this weekend. The issue is the generation of '-nan', which is occuring because the structure is not docked inside the density. Also, the root mean squared deviation (rmsd) value is very poor (which is another indication). This happens because the rigid docking step was commented/ removed.

There are several ways to fix this issue (easy to hard diffculty level):

The easiest in terms of the EnTK pipeline would be to pre-dock the structure locally.
Alternately, we could define the following VMD Tcl command as an intermediate task stage, 'voltool fit $sel -res 5 -i 4ake-target_autopsf.dx'. This will initial dock the structure inside the density. The rest of the pipeline and scripts remain the same. However, note for this you will need to upgrade the vmd module on bridges.
Finally, to follow a traditional approach, we will need to install the rigid body docking software's (which are not developed by NAMD-VMD). Here, a naive user could face problems to correctly install the packages.

I propose option 1, as it is fair to expect this stage from future users of this integrated pipeline. Let me know what everyone thinks and then we proceed accordingly.

benjha commented 4 years ago

Hi @daipayans @lee212 This is the output I got from running simple_mdf.summit.py script on Summit. Note tasks 6 to 9 failed.

$ python simple_mdff.summit.py  --resource ornl_summit
simple_mdff.summit.py:348: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  resource_cfg = yaml.load(fp)
simple_mdff.summit.py:352: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  workflow_cfg = yaml.load(fp)
EnTK session: re.session.login4.benjha.018355.0000
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login4.benjha.018355.0000]                            \
database   : [mongodb://rct:rct_test@two.radical-project.org/rct_test]        ok
create pilot manager                                                          ok
submit 1 pilot(s)
        [ornl.summit:168]
                                                                              ok
All components created
Update: simple-mdff state: SCHEDULING
Update: simple-mdff.Generating a simulated density map state: SCHEDULING
Update: simple-mdff.Generating a simulated density map.Starting to load the target PDB state: SCHEDULING
Update: simple-mdff.Generating a simulated density map.Starting to load the target PDB state: SCHEDULED
Update: simple-mdff.Generating a simulated density map state: SCHEDULED
create unit manager/gpfs/alpine/world-shared/bip115/radical_tools_python/lib/python3.7/site-packages/pymongo/topology.py:155: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
                                                           ok
add 1 pilot(s)                                                                ok
Update: simple-mdff.Generating a simulated density map.Starting to load the target PDB state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.Generating a simulated density map.Starting to load the target PDB state: EXECUTED
Update: simple-mdff.Generating a simulated density map.Starting to load the target PDB state: DONE
Update: simple-mdff.Generating a simulated density map state: DONE
Update: simple-mdff.Converting the density map to an MDFF potential state: SCHEDULING
Update: simple-mdff.Converting the density map to an MDFF potential.generate dx file state: SCHEDULING
Update: simple-mdff.Converting the density map to an MDFF potential.generate dx file state: SCHEDULED
Update: simple-mdff.Converting the density map to an MDFF potential state: SCHEDULED
Update: simple-mdff.Converting the density map to an MDFF potential.generate dx file state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.Converting the density map to an MDFF potential.generate dx file state: EXECUTED
Update: simple-mdff.Converting the density map to an MDFF potential.generate dx file state: DONE
Update: simple-mdff.Converting the density map to an MDFF potential state: DONE
Update: simple-mdff.Preparing the initial structure state: SCHEDULING
Update: simple-mdff.Preparing the initial structure.Starting to load the initial structure state: SCHEDULING
Update: simple-mdff.Preparing the initial structure.Starting to load the initial structure state: SCHEDULED
Update: simple-mdff.Preparing the initial structure state: SCHEDULED
Update: simple-mdff.Preparing the initial structure.Starting to load the initial structure state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.Preparing the initial structure.Starting to load the initial structure state: EXECUTED
Update: simple-mdff.Preparing the initial structure.Starting to load the initial structure state: DONE
Update: simple-mdff.Preparing the initial structure state: DONE
Update: simple-mdff.Defining secondary structure restraints state: SCHEDULING
Update: simple-mdff.Defining secondary structure restraints.task.0003 state: SCHEDULING
Update: simple-mdff.Defining secondary structure restraints.task.0003 state: SCHEDULED
Update: simple-mdff.Defining secondary structure restraints state: SCHEDULED
Update: simple-mdff.Defining secondary structure restraints.task.0003 state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.Defining secondary structure restraints.task.0003 state: EXECUTED
Update: simple-mdff.Defining secondary structure restraints.task.0003 state: DONE
Update: simple-mdff.Defining secondary structure restraints state: DONE
Update: simple-mdff.cispeptide and chirality restraints state: SCHEDULING
Update: simple-mdff.cispeptide and chirality restraints.task.0004 state: SCHEDULING
Update: simple-mdff.cispeptide and chirality restraints.task.0004 state: SCHEDULED
Update: simple-mdff.cispeptide and chirality restraints state: SCHEDULED
Update: simple-mdff.cispeptide and chirality restraints.task.0004 state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.cispeptide and chirality restraints.task.0004 state: EXECUTED
Update: simple-mdff.cispeptide and chirality restraints.task.0004 state: DONE
Update: simple-mdff.cispeptide and chirality restraints state: DONE
Update: simple-mdff.Running the MDFF simulation with NAMD state: SCHEDULING
Update: simple-mdff.Running the MDFF simulation with NAMD.task.0005 state: SCHEDULING
Update: simple-mdff.Running the MDFF simulation with NAMD.task.0005 state: SCHEDULED
Update: simple-mdff.Running the MDFF simulation with NAMD state: SCHEDULED
Update: simple-mdff.Running the MDFF simulation with NAMD.task.0005 state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.Running the MDFF simulation with NAMD.task.0005 state: EXECUTED
Update: simple-mdff.Running the MDFF simulation with NAMD.task.0005 state: DONE
Update: simple-mdff.Running the MDFF simulation with NAMD state: DONE
Update: simple-mdff.NAMD simulation state: SCHEDULING
Update: simple-mdff.NAMD simulation.task.0006 state: SCHEDULING
Update: simple-mdff.NAMD simulation.task.0006 state: SCHEDULED
Update: simple-mdff.NAMD simulation state: SCHEDULED
Update: simple-mdff.NAMD simulation.task.0006 state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.NAMD simulation.task.0006 state: EXECUTED
Update: simple-mdff.NAMD simulation.task.0006 state: FAILED
Update: simple-mdff.NAMD simulation state: DONE
Update: simple-mdff.Calculating the root mean square deviation state: SCHEDULING
Update: simple-mdff.Calculating the root mean square deviation.task.0007 state: SCHEDULING
Update: simple-mdff.Calculating the root mean square deviation.task.0007 state: SCHEDULED
Update: simple-mdff.Calculating the root mean square deviation state: SCHEDULED
Update: simple-mdff.Calculating the root mean square deviation.task.0007 state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.Calculating the root mean square deviation.task.0007 state: EXECUTED
Update: simple-mdff.Calculating the root mean square deviation.task.0007 state: FAILED
Update: simple-mdff.Calculating the root mean square deviation state: DONE
Update: simple-mdff.Calculating the root mean square deviation for backbone atoms state: SCHEDULING
Update: simple-mdff.Calculating the root mean square deviation for backbone atoms.task.0008 state: SCHEDULING
Update: simple-mdff.Calculating the root mean square deviation for backbone atoms.task.0008 state: SCHEDULED
Update: simple-mdff.Calculating the root mean square deviation for backbone atoms state: SCHEDULED
Update: simple-mdff.Calculating the root mean square deviation for backbone atoms.task.0008 state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.Calculating the root mean square deviation for backbone atoms.task.0008 state: EXECUTED
Update: simple-mdff.Calculating the root mean square deviation for backbone atoms.task.0008 state: FAILED
Update: simple-mdff.Calculating the root mean square deviation for backbone atoms state: DONE
Update: simple-mdff.Calculating the cross-correlation coefficient state: SCHEDULING
Update: simple-mdff.Calculating the cross-correlation coefficient.task.0009 state: SCHEDULING
Update: simple-mdff.Calculating the cross-correlation coefficient.task.0009 state: SCHEDULED
Update: simple-mdff.Calculating the cross-correlation coefficient state: SCHEDULED
Update: simple-mdff.Calculating the cross-correlation coefficient.task.0009 state: SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Update: simple-mdff.Calculating the cross-correlation coefficient.task.0009 state: EXECUTED
Update: simple-mdff.Calculating the cross-correlation coefficient.task.0009 state: FAILED
Update: simple-mdff.Calculating the cross-correlation coefficient state: DONE
Update: simple-mdff state: DONE
close unit manager                                                            ok
wait for 1 pilot(s)
              0                                                               ok
closing session re.session.login4.benjha.018355.0000                           \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                          timeout
                                                                              ok
session lifetime: 373.8s                                                      ok
All components terminated

Feel free to consult outputs in

/gpfs/alpine/world-shared/bip115/enTK_tests/mdff_hyungro/3April2020

lee212 commented 4 years ago

@benjha , thanks for the report. I will try to replicate and check out the directory you pointed. Could you also make a copy of the sandbox of this run? by:

mkdir /gpfs/alpine/world-shared/bip115/enTK_tests/mdff_hyungro/3April2020/radical.pilot.sandbox
cp -pr $MEMBERWORK/bip115/radical.pilot.sandbox/re.session.login4.benjha.018355.0000 /gpfs/alpine/world-shared/bip115/enTK_tests/mdff_hyungro/3April2020/radical.pilot.sandbox

benjha commented 4 years ago

@lee212 Sandbox ready

lee212 commented 4 years ago

@benjha , the job was failed from stage 6 because there is a missing input file which is dependent from previous stage, the file name expected was par_all27_prot_lipid_na .inp but received par_all36_prot.prm, I echo the exact line here: https://github.com/radical-collaboration/MDFF-Error/blob/736ebea7ade65705dc62f7a2bb84ddc49c73bb77/simple_mdff.summit.py#L220

@daipayans , I would simply replace the filename to complete all stages without failing but if you find there are incorrect tcl command lines, please revise accordingly. I can run a test to have sanity check for any changes.

lee212 commented 4 years ago

@daipayans , ping

daipayans commented 4 years ago

@benjha @lee212 apologies for late response.

@benjha "par_all27_prot_lipid_na .inp but received par_all36_prot.prm". Good point and this is important for everyone. If you are using the latest build of VMD it will be the latest version of the CHARMM forcefield, par_all36_prot.prm. Initially, @lee212 used the input files that are distributed with the simple MDFF tutorial which has not been updated, which explains the error you are receiving. In that, par_all27_prot_lipid_na .inp is the older version of the CHARMM force field.

@lee212 @benjha Overall, let's update all modules to the latest build of NAMD and VMD. Let me know.

benjha commented 4 years ago

@daipayans For Summit the team decided to use VMD 1.9.3 because of the voltool problem in 1.9.4 beta

So, simple_mdf.summit.py should use par_all27_prot_lipid_na.inp ?

daipayans commented 4 years ago

@benjha I recommend updating par_all27_prot_lipid_na .inp to par_all36_prot.prm in the simple_mdff.summit.py script.

benjha commented 4 years ago

@lee212 With the filename update, par_all36_prot.prm does exist in unit.00005 and unit.00006, but now unit.00006 fails with the next std output

$ pwd
/gpfs/alpine/world-shared/bip115/enTK_tests/mdff_hyungro/3April2020/radical.pilot.sandbox/re.session.login4.benjha.018360.0000/pilot.0000/unit.000006
(radical_tools_python) [benjha@login4.summit unit.000006]$ cat STDOUT 
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: 40 processes, 160 worker threads (PEs) + 1 comm threads per process, 6400 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.1-0-gcc60a79-namd-charm-6.10.1-build-2020-Mar-05-18422
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 1 cores x 4 PUs = 4-way SMP)
Charm++> cpu topology info is gathered in 0.890 seconds.

Charm++> Warning: the number of SMP threads (6440) is greater than the number of physical cores (176), so threads will sleep while idling. Use +CmiSpinOnIdle or +CmiSleepOnIdle to control this directly.

[0] Stack Traceback:
  [0:0] namd2 0x112f61e4 CmiAbort
  [0:1] namd2 0x11325b70 CmiCheckAffinity()
  [0:2] namd2 0x1113729c _initCharm(int, char**)
  [0:3] namd2 0x10316ea4 master_init(int, char**)
  [0:4] namd2 0x10316968 slave_init(int, char**)
  [0:5] namd2 0x112fbe18 
  [0:6] namd2 0x112fe7dc 
  [0:7] libpthread.so.0 0x2000001f8b94 
  [0:8] libc.so.6 0x2000009785f4 clone

and std error

cat STDERR 
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Multiple PEs assigned to same core. Set affinity options to correct or lower the number of threads, or pass +setcpuaffinity to ignore.

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Somehow enTK sets 6440 SMP threads for unit.00006

The sandbox of this execution is in

/gpfs/alpine/world-shared/bip115/enTK_tests/mdff_hyungro/3April2020/radical.pilot.sandbox/re.session.login4.benjha.018360.0000

To fix the problem, line 209 from simple_mdff.summit.py should be

task7.arguments = ['+ppn', summit_hw_thread_cnt, 'adk-step1.namd']

instead of

task7.arguments = ['+ppn', sim_cpus, 'adk-step1.namd']

benjha commented 4 years ago

@daipayans @lee212 The script is now running on Summit. We are ready to start the second task of this ticket.

The working version of simple_mdff.summit.py can be found in

/gpfs/alpine/world-shared/bip115/enTK_tests/mdff_hyungro/3April2020/simple_mdff.summit.py

Please, update the repo. accordingly

daipayans commented 4 years ago

@benjha Sounds good! The error you posted earlier is related to how the NAMD module is build on Summit. But I assume that it has been fixed now.

lee212 commented 4 years ago

@benjha , Thanks for the comment on ppn counts. I updated the repo with the change. One question, how do I assign N cores to the namd with jsrun? If it is similar to mpirun then I would specify N numbers like jsrun -np N namd +ppn 4 adk-step1.namd and, for 1 node, N would be 42 on summit. If I miss something?

benjha commented 4 years ago

Hi @lee212,

jsrun is not similar to mpirun, and regular jsrun is different to jsrun using erf files . Let's see an example:

Let NODES be 64 then this call will allocate 32 tasks across 64 nodes using mpirun (note Summit nodes has 2 sockets):

mpirun -n $(($NODES*32)) --npersocket 16 --bind-to core

which is equivalent to:

jsrun --nrs $(($NODES*2)) --rs_per_host 2 --tasks_per_rs 16 --cpu_per_rs 21

For Summit, I noted EnTK uses erf files, so the erf file we have for unit.000006

/gpfs/alpine/world-shared/bip115/enTK_tests/mdff/bak/3April2020/radical.pilot.sandbox/re.session.login4.benjha.018360.0000/pilot.0000/unit.000006/unit.000006.rs

Implies N to be 40 and ``+ppn``` be four.

A different erf, may allow +ppn be larger, in the next case N will be 21 and +ppn be 8:

rank: 0: { host: 1; cpu: {0,1,2,3,4,5,6,7}}
rank: 1: { host: 1; cpu: {8,9,10,11,12,13,14,15}}
...
rank 20: {host: 1; cpu:{156,157,158,159,160,161,162,163}}

lee212 commented 4 years ago

Thanks @benjha , this is great. Now I have a better understanding of jsrun with NAMD. I made a commit with your updated script so the script will have +ppn 4 as the erf file defines 4 threads per rank.

I am submitting a couple of test runs with 4,8, 16 nodes...

lee212 commented 4 years ago

We completed simulations with 5 replica, the experiment result https://github.com/radical-collaboration/MDFF-EnTK/tree/master/experiments/summit/simple_mdff_final_5_replica may provide results at this moment. I declare the goals of this ticket are complete and I propose to close this.

radical-collaboration / MDFF-EnTK

Update on NAMD-EnTK pipeline; no `mdff check` in analysis, and N times simulation #11