Closed mturilli closed 10 years ago
This seems to say a different story but I confirm that looking within the sandbox, all the CUs appear to have been executed successfully.
From the plot, it seems that the cancellation of the pilot and the failed units are unrelated: units get completed successfully after some fail. Also, the units don't enter the TRANSFERRING_OUTPUT
state, so I don't think its a staging problem -- they fail before.
Would you mind sending the STDOUT/STDERR for unit-54012dc520a641558d89aa07
? Also, could you do a grep -C 5 54012dc520a641558d89aa07 AGENT*
to see what happened in the vicinity of that unit?
Thanks!
$ cat radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/STDERR
:-) G R O M A C S (-:
Grunge ROck MAChoS
:-) VERSION 4.6.5 (-:
Contributions from Mark Abraham, Emile Apol, Rossen Apostolov,
Herman J.C. Berendsen, Aldert van Buuren, Pär Bjelkmar,
Rudi van Drunen, Anton Feenstra, Gerrit Groenhof, Christoph Junghans,
Peter Kasson, Carsten Kutzner, Per Larsson, Pieter Meulenhoff,
Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
Michael Shirts, Alfons Sijbers, Peter Tieleman,
Berk Hess, David van der Spoel, and Erik Lindahl.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2012,2013, The GROMACS development team at
Uppsala University & The Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.
:-) mdrun_mpi (-:
Option Filename Type Description
------------------------------------------------------------
-s topol.tpr Input Run input file: tpr tpb tpa
-o traj.trr Output Full precision trajectory: trr trj cpt
-x traj.xtc Output, Opt. Compressed trajectory (portable xdr format)
-cpi state.cpt Input, Opt. Checkpoint file
-cpo state.cpt Output, Opt. Checkpoint file
-c confout.gro Output Structure file: gro g96 pdb etc.
-e ener.edr Output Energy file
-g md.log Output Log file
-dhdl dhdl.xvg Output, Opt. xvgr/xmgr file
-field field.xvg Output, Opt. xvgr/xmgr file
-table table.xvg Input, Opt. xvgr/xmgr file
-tabletf tabletf.xvg Input, Opt. xvgr/xmgr file
-tablep tablep.xvg Input, Opt. xvgr/xmgr file
-tableb table.xvg Input, Opt. xvgr/xmgr file
-rerun rerun.xtc Input, Opt. Trajectory: xtc trr trj gro g96 pdb cpt
-tpi tpi.xvg Output, Opt. xvgr/xmgr file
-tpid tpidist.xvg Output, Opt. xvgr/xmgr file
-ei sam.edi Input, Opt. ED sampling input
-eo edsam.xvg Output, Opt. xvgr/xmgr file
-j wham.gct Input, Opt. General coupling stuff
-jo bam.gct Output, Opt. General coupling stuff
-ffout gct.xvg Output, Opt. xvgr/xmgr file
-devout deviatie.xvg Output, Opt. xvgr/xmgr file
-runav runaver.xvg Output, Opt. xvgr/xmgr file
-px pullx.xvg Output, Opt. xvgr/xmgr file
-pf pullf.xvg Output, Opt. xvgr/xmgr file
-ro rotation.xvg Output, Opt. xvgr/xmgr file
-ra rotangles.log Output, Opt. Log file
-rs rotslabs.log Output, Opt. Log file
-rt rottorque.log Output, Opt. Log file
-mtx nm.mtx Output, Opt. Hessian matrix
-dn dipole.ndx Output, Opt. Index file
-multidir rundir Input, Opt., Mult. Run directory
-membed membed.dat Input, Opt. Generic data file
-mp membed.top Input, Opt. Topology file
-mn membed.ndx Input, Opt. Index file
Option Type Value Description
------------------------------------------------------
-[no]h bool no Print help info and quit
-[no]version bool no Print version info and quit
-nice int 0 Set the nicelevel
-deffnm string Set the default filename for all file options
-xvg enum xmgrace xvg plot formatting: xmgrace, xmgr or none
-[no]pd bool no Use particle decompostion
-dd vector 0 0 0 Domain decomposition grid, 0 is optimize
-ddorder enum interleave DD node order: interleave, pp_pme or cartesian
-npme int -1 Number of separate nodes to be used for PME, -1
is guess
-nt int 0 Total number of threads to start (0 is guess)
-ntmpi int 0 Number of thread-MPI threads to start (0 is guess)
-ntomp int 0 Number of OpenMP threads per MPI process/thread
to start (0 is guess)
-ntomp_pme int 0 Number of OpenMP threads per MPI process/thread
to start (0 is -ntomp)
-pin enum auto Fix threads (or processes) to specific cores:
auto, on or off
-pinoffset int 0 The starting logical core number for pinning to
cores; used to avoid pinning threads from
different mdrun instances to the same core
-pinstride int 0 Pinning distance in logical cores for threads,
use 0 to minimize the number of threads per
physical core
-gpu_id string List of GPU device id-s to use, specifies the
per-node PP rank to GPU mapping
-[no]ddcheck bool yes Check for all bonded interactions with DD
-rdd real 0 The maximum distance for bonded interactions with
DD (nm), 0 is determine from initial coordinates
-rcon real 0 Maximum distance for P-LINCS (nm), 0 is estimate
-dlb enum auto Dynamic load balancing (with DD): auto, no or yes
-dds real 0.8 Minimum allowed dlb scaling of the DD cell size
-gcom int -1 Global communication frequency
-nb enum auto Calculate non-bonded interactions on: auto, cpu,
gpu or gpu_cpu
-[no]tunepme bool yes Optimize PME load between PP/PME nodes or GPU/CPU
-[no]testverlet bool no Test the Verlet non-bonded scheme
-[no]v bool no Be loud and noisy
-[no]compact bool yes Write a compact log file
-[no]seppot bool no Write separate V and dVdl terms for each
interaction type and node to the log file(s)
-pforce real -1 Print all forces larger than this (kJ/mol nm)
-[no]reprod bool no Try to avoid optimizations that affect binary
reproducibility
-cpt real 15 Checkpoint interval (minutes)
-[no]cpnum bool no Keep and number checkpoint files
-[no]append bool yes Append to previous output files when continuing
from checkpoint instead of adding the simulation
part number to all file names
-nsteps step -2 Run this number of steps, overrides .mdp file
option
-maxh real -1 Terminate after 0.99 times this time (hours)
-multi int 0 Do multiple simulations in parallel
-replex int 0 Attempt replica exchange periodically with this
period (steps)
-nex int 0 Number of random exchanges to carry out each
exchange interval (N^3 is one suggestion). -nex
zero or not specified gives neighbor replica
exchange.
-reseed int -1 Seed for replica exchange, -1 is generate a seed
-[no]ionize bool no Do a simulation including the effect of an X-Ray
bombardment on your system
Reading file topol.tpr, VERSION 4.6.5 (single precision)
Using 1 MPI process
Non-default thread affinity set probably by the OpenMP library,
disabling internal thread affinity
starting mdrun 'Protein t= 0.00000'
100000 steps, 100.0 ps.
Writing final coordinates.
Core t (s) Wall t (s) (%)
Time: 5.100 635.951 0.8
(ns/day) (hour/ns)
Performance: 13.586 1.767
gcq#254: "Why, how now, Claudio ! Whence Comes this Restraint ?" (Lucio in Measure for measure, Act 1, Scene 4, William Shakespeare)
[mpiexec@gcn-7-54.sdsc.edu] HYDT_bscd_pbs_wait_for_completion (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with TM error 17002
[mpiexec@gcn-7-54.sdsc.edu] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@gcn-7-54.sdsc.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@gcn-7-54.sdsc.edu] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
$ cat radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/STDOUT
$
$ grep -C 5 54012dc520a641558d89aa07 radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT*
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,446 - radical.pilot.agent - DEBUG - Launching task with MPIRUN (/opt/mvapich2/intel/ib/bin/mpirun).
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,446 - radical.pilot.agent - INFO - Node gcn-7-54 satisfies 1 cores at offset 14
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,448 - radical.pilot.agent - DEBUG - Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31/radical_pilot_cu_launch_script-AY58pk.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,449 - radical.pilot.agent - INFO - Launching task 54012dc520a641558d89aa31 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31/radical_pilot_cu_launch_script-AY58pk.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,491 - radical.pilot.agent - INFO - Found new tasks in pilot queue: 54012dc420a641558d89a9e4
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:22,627 - radical.pilot.agent - INFO - Task input staging directives {u'target': u'topol.tpr', u'priority': 0, u'source': u'staging:///topol.tpr', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Link'} for cu: 54012dc520a641558d89aa07 to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,627 - radical.pilot.agent - INFO - Operating from staging
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:22,628 - radical.pilot.agent - INFO - Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,719 - radical.pilot.agent - DEBUG - Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-56 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-57 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-54 : X X X X X X X X X X X X X X X -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-55 : - - - - - - - - - - - - - - - -
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,537 - radical.pilot.agent - DEBUG - Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3/radical_pilot_cu_launch_script-0LhrDe.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,538 - radical.pilot.agent - INFO - Launching task 54012dc420a641558d89a9e3 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3/radical_pilot_cu_launch_script-0LhrDe.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,545 - radical.pilot.agent - INFO - Task input staging directives {u'target': u'topol.tpr', u'priority': 0, u'source': u'staging:///topol.tpr', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Link'} for cu: 54012dc420a641558d89a9ee to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9ee
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,546 - radical.pilot.agent - INFO - Operating from staging
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,546 - radical.pilot.agent - INFO - Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9ee/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:23,670 - radical.pilot.agent - INFO - Found new tasks in pilot queue: 54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,806 - radical.pilot.agent - DEBUG - Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-56 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-57 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-54 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-55 : X X X - - - - - - - - - - - - -
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-62 : - - - - - - - - - - - - - - - -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-61 : - - - - - - - - - - - - - - - -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,277 - radical.pilot.agent - DEBUG - Launching task with MPIRUN (/opt/mvapich2/intel/ib/bin/mpirun).
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,278 - radical.pilot.agent - INFO - Node gcn-7-55 satisfies 1 cores at offset 12
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:26,279 - radical.pilot.agent - DEBUG - Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/radical_pilot_cu_launch_script-WVD3eu.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:26,280 - radical.pilot.agent - INFO - Launching task 54012dc520a641558d89aa07 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/radical_pilot_cu_launch_script-WVD3eu.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,499 - radical.pilot.agent - INFO - Found new tasks in pilot queue: 54012dc620a641558d89aa3d
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,531 - radical.pilot.agent - INFO - Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc320a641558d89a9c3/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,550 - radical.pilot.agent - DEBUG - Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-56 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-57 : X X X X X X X X X X X X X X X X
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-62 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-61 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 19:07:22,014 - radical.pilot.agent - INFO - Task 54012dc420a641558d89a9ec terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 19:07:22,016 - radical.pilot.agent - INFO - Task 54012dc520a641558d89aa27 terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 19:07:22,018 - radical.pilot.agent - INFO - Task 54012dc520a641558d89aa07 terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 19:07:22,334 - radical.pilot.agent - DEBUG - Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-56 : - - - - - - - - - - - - - - - -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-57 : - - - - - - - - - - - - - - - -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-54 : - - - - - - - - - - - - - - X -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-55 : X - - - - - - - - - - - - X X X
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread radical.pilot.agent : [DEBUG ] Launching task with MPIRUN (/opt/mvapich2/intel/ib/bin/mpirun).
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread radical.pilot.agent : [INFO ] Node gcn-7-54 satisfies 1 cores at offset 14
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread radical.pilot.agent : [DEBUG ] Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31/radical_pilot_cu_launch_script-AY58pk.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread radical.pilot.agent : [INFO ] Launching task 54012dc520a641558d89aa31 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31/radical_pilot_cu_launch_script-AY58pk.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117347 Thread-1 radical.pilot.agent : [INFO ] Found new tasks in pilot queue: 54012dc420a641558d89a9e4
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:22 117353 MainThread radical.pilot.agent : [INFO ] Task input staging directives {u'target': u'topol.tpr', u'priority': 0, u'source': u'staging:///topol.tpr', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Link'} for cu: 54012dc520a641558d89aa07 to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117353 MainThread radical.pilot.agent : [INFO ] Operating from staging
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:22 117353 MainThread radical.pilot.agent : [INFO ] Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread radical.pilot.agent : [DEBUG ] Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-56 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-57 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-54 : X X X X X X X X X X X X X X X -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-55 : - - - - - - - - - - - - - - - -
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117352 MainThread radical.pilot.agent : [DEBUG ] Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3/radical_pilot_cu_launch_script-0LhrDe.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117352 MainThread radical.pilot.agent : [INFO ] Launching task 54012dc420a641558d89a9e3 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3/radical_pilot_cu_launch_script-0LhrDe.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117353 MainThread radical.pilot.agent : [INFO ] Task input staging directives {u'target': u'topol.tpr', u'priority': 0, u'source': u'staging:///topol.tpr', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Link'} for cu: 54012dc420a641558d89a9ee to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9ee
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117353 MainThread radical.pilot.agent : [INFO ] Operating from staging
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117353 MainThread radical.pilot.agent : [INFO ] Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9ee/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:23 117347 Thread-1 radical.pilot.agent : [INFO ] Found new tasks in pilot queue: 54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117352 MainThread radical.pilot.agent : [DEBUG ] Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-56 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-57 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-54 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-55 : X X X - - - - - - - - - - - - -
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-62 : - - - - - - - - - - - - - - - -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-61 : - - - - - - - - - - - - - - - -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117352 MainThread radical.pilot.agent : [DEBUG ] Launching task with MPIRUN (/opt/mvapich2/intel/ib/bin/mpirun).
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117352 MainThread radical.pilot.agent : [INFO ] Node gcn-7-55 satisfies 1 cores at offset 12
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:26 117352 MainThread radical.pilot.agent : [DEBUG ] Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/radical_pilot_cu_launch_script-WVD3eu.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:26 117352 MainThread radical.pilot.agent : [INFO ] Launching task 54012dc520a641558d89aa07 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/radical_pilot_cu_launch_script-WVD3eu.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117347 Thread-1 radical.pilot.agent : [INFO ] Found new tasks in pilot queue: 54012dc620a641558d89aa3d
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117353 MainThread radical.pilot.agent : [INFO ] Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc320a641558d89a9c3/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117352 MainThread radical.pilot.agent : [DEBUG ] Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-56 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-57 : X X X X X X X X X X X X X X X X
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-62 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-61 : X X X X X X X X X X X X X X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 19:07:22 117352 MainThread radical.pilot.agent : [INFO ] Task 54012dc420a641558d89a9ec terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 19:07:22 117352 MainThread radical.pilot.agent : [INFO ] Task 54012dc520a641558d89aa27 terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 19:07:22 117352 MainThread radical.pilot.agent : [INFO ] Task 54012dc520a641558d89aa07 terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 19:07:22 117352 MainThread radical.pilot.agent : [DEBUG ] Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-56 : - - - - - - - - - - - - - - - -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-57 : - - - - - - - - - - - - - - - -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-54 : - - - - - - - - - - - - - - X -
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-55 : X - - - - - - - - - - - - X X X
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 19:07:22 117352 MainThread radical.pilot.agent : [INFO ] Task 54012dc520a641558d89aa07 terminated with return code 255.
Well, the unit in fact did fail, as far as the agent knows. The return code is the only information the agent has about success or failure.
As you observed though, the unit stdout/err seems to indicate that gromacs itself finished nicely -- but its mpirun (in which gromacs is contained) which returns the error code. At least that is what I would assume from these messages:
[mpiexec@gcn-7-54.sdsc.edu] HYDT_bscd_pbs_wait_for_completion (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with TM error 17002
[mpiexec@gcn-7-54.sdsc.edu] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@gcn-7-54.sdsc.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@gcn-7-54.sdsc.edu] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
Now, what causes that error I have no idea :( Google returns a number of hits for the error messages above -- but I don't immediately see how they are related to your gromacs execution. Most replies seem to indicate a problem with the MPI layer though.
Mark, Ole, any insight?
Does this mean that I need to run the test again? Or should I try to fix the issue?
Sorry, I misassigned the trestles and gordon tickets -- those were supposed to go to Mark. Fixed now.
There were a couple of issues at play here:
mpirun_rsh
, which is actually a distinct method (from mpirun
), which I now added to the agent (7bf24ae)mpirun_rsh
doesn't look for binaries on the path, so we need to specify the full path (added to kernels in https://github.com/radical-cybertools/radical.ensemblemd.mdkernels/commit/7610169f7ccbe5df6d0802906a89cf80386c59d9)Great, thank you! Successfully tested.
Great!
Good, I'll close this.
As a FYI: I noticed on Gordon that gromacs is pinning threads to cores and that seriously pessimizes the runtime of non-whole-node executions through RP.
Should we open a feature ticket for that?
Done in #334
Running 128 BoT with shared input and 5 output files for each CU with the backfilling scheduler but only on gordon.
In the verbose execution logs I see:
I am not sure what is the temporal line of the logging - if any - but looking at it linearly it would seem that the pilot has been teared down before all the output was staged out from most of the units. Consistently, upon inspection of the unit sanboxes on gordon, all the units completely successfully but only few of them (14) had their output files staged out.