radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

gromacs / mpi failing on gordon #309

Closed mturilli closed 10 years ago

mturilli commented 10 years ago

Running 128 BoT with shared input and 5 output files for each CU with the backfilling scheduler but only on gordon.

In the verbose execution logs I see:

2014:08:30 02:08:46 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc420a641558d89a9ef' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:46 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc520a641558d89aa02' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:47 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc520a641558d89aa15' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:47 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc420a641558d89a9c5' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:47 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc420a641558d89a9f0' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:47 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc320a641558d89a9c1' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:47 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc420a641558d89a9f2' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:47 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc420a641558d89a9cb' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:47 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc520a641558d89aa20' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:47 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc620a641558d89aa33' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:48 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc620a641558d89aa32' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:48 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc420a641558d89a9ed' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:48 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc520a641558d89a9fe' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:48 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '54012dc520a641558d89aa1b' state changed from 'StagingOutput' to 'Done'.
2014:08:30 02:08:48 radical.pilot.MainProcess: [INFO    ] Sent 'COMMAND_CANCEL_PILOT' command to all pilots.
2014:08:30 02:08:48 radical.pilot.MainProcess: [DEBUG   ] PilotManager.close(): PilotLauncherWorker-1 terminated.
2014:08:30 02:08:48 radical.pilot.MainProcess: [DEBUG   ] Worker thread (ID: Thread-1[140428384925440]) for PilotManager 54012dbb20a641558d89a9be stopped.
2014:08:30 02:08:48 radical.pilot.MainProcess: [INFO    ] Closed PilotManager 54012dbb20a641558d89a9be.
2014:08:30 02:08:48 radical.pilot.MainProcess: [DEBUG   ] UnitManager.close(): InputFileTransferWorker-1 terminated.
2014:08:30 02:08:48 radical.pilot.MainProcess: [DEBUG   ] UnitManager.close(): InputFileTransferWorker-2 terminated.
2014:08:30 02:08:48 radical.pilot.MainProcess: [DEBUG   ] UnitManager.close(): OutputFileTransferWorker-1 terminated.
2014:08:30 02:08:48 radical.pilot.MainProcess: [DEBUG   ] UnitManager.close(): OutputFileTransferWorker-2 terminated.
2014:08:30 02:08:48 radical.pilot.MainProcess: [DEBUG   ] Worker thread (ID: Thread-3[140428344870656]) for UnitManager 54012dc320a641558d89a9c0 stopped.
2014:08:30 02:08:48 radical.pilot.MainProcess: [INFO    ] Closed UnitManager 54012dc320a641558d89a9c0.
2014:08:30 02:08:48 radical.pilot.MainProcess: [INFO    ] Closed Session 54012dba20a641558d89a9bd.
2014:08:30 02:08:48 radical.pilot.MainProcess: [WARNING ] PilotManager object already closed.
2014:08:30 02:08:48 radical.pilot.MainProcess: [WARNING ] UnitManager object already closed.
2014:08:30 02:08:48 radical.pilot.MainProcess: [INFO    ] Closed Session 54012dba20a641558d89a9bd.
Compute Unit '54012dc520a641558d89aa07' failed...
Compute Unit '54012dc520a641558d89aa1f' failed...
Compute Unit '54012dc420a641558d89a9e2' failed...
Compute Unit '54012dc420a641558d89a9ea' failed...
Compute Unit '54012dc420a641558d89a9eb' failed...
Compute Unit '54012dc520a641558d89aa12' failed...
Compute Unit '54012dc620a641558d89aa3d' failed...
Compute Unit '54012dc520a641558d89aa13' failed...
Compute Unit '54012dc620a641558d89aa39' failed...
Compute Unit '54012dc420a641558d89a9d1' failed...
Compute Unit '54012dc520a641558d89aa31' failed...
Compute Unit '54012dc520a641558d89aa08' failed...
Compute Unit '54012dc620a641558d89aa3c' failed...
Compute Unit '54012dc420a641558d89a9f3' failed...
Compute Unit '54012dc420a641558d89a9c6' failed...
Compute Unit '54012dc420a641558d89a9c9' failed...
Compute Unit '54012dc420a641558d89a9e0' failed...
Compute Unit '54012dc520a641558d89aa11' failed...
Compute Unit '54012dc520a641558d89aa22' failed...
Compute Unit '54012dc420a641558d89a9d0' failed...
Compute Unit '54012dc420a641558d89a9fc' failed...
Compute Unit '54012dc620a641558d89aa3a' failed...
Compute Unit '54012dc420a641558d89a9f1' failed...
Compute Unit '54012dc420a641558d89a9ee' failed...
Compute Unit '54012dc520a641558d89aa28' failed...
Compute Unit '54012dc420a641558d89a9cc' failed...
Compute Unit '54012dc420a641558d89a9cd' failed...
Compute Unit '54012dc420a641558d89a9c8' failed...
Compute Unit '54012dc520a641558d89aa19' failed...
Compute Unit '54012dc520a641558d89a9fd' failed...
Compute Unit '54012dc420a641558d89a9d6' failed...
Compute Unit '54012dc420a641558d89a9ca' failed...
Compute Unit '54012dc520a641558d89aa14' failed...
Compute Unit '54012dc520a641558d89aa2b' failed...
Compute Unit '54012dc420a641558d89a9ce' failed...
Compute Unit '54012dc420a641558d89a9d9' failed...
Compute Unit '54012dc520a641558d89aa30' failed...
Compute Unit '54012dc420a641558d89a9f8' failed...
Compute Unit '54012dc420a641558d89a9d7' failed...
Compute Unit '54012dc420a641558d89a9d8' failed...
Compute Unit '54012dc520a641558d89aa2f' failed...
Compute Unit '54012dc420a641558d89a9dd' finished with output:
Compute Unit '54012dc620a641558d89aa3e' finished with output:
Compute Unit '54012dc520a641558d89aa2c' finished with output:
Compute Unit '54012dc520a641558d89aa05' finished with output:
Compute Unit '54012dc520a641558d89aa00' finished with output:
Compute Unit '54012dc520a641558d89a9ff' finished with output:
Compute Unit '54012dc520a641558d89aa24' finished with output:
Compute Unit '54012dc520a641558d89aa1e' finished with output:
Compute Unit '54012dc520a641558d89aa17' finished with output:
Compute Unit '54012dc420a641558d89a9da' finished with output:
Compute Unit '54012dc520a641558d89aa18' finished with output:
Compute Unit '54012dc520a641558d89aa01' finished with output:
Compute Unit '54012dc420a641558d89a9dc' finished with output:
Compute Unit '54012dc420a641558d89a9ef' finished with output:
Compute Unit '54012dc520a641558d89aa02' finished with output:
Compute Unit '54012dc520a641558d89aa15' finished with output:
Compute Unit '54012dc420a641558d89a9c5' finished with output:
Compute Unit '54012dc420a641558d89a9f0' finished with output:
Compute Unit '54012dc320a641558d89a9c1' finished with output:
Compute Unit '54012dc420a641558d89a9f2' finished with output:
Compute Unit '54012dc420a641558d89a9cb' finished with output:
Compute Unit '54012dc520a641558d89aa20' finished with output:
Compute Unit '54012dc620a641558d89aa33' finished with output:
Compute Unit '54012dc620a641558d89aa32' finished with output:
Compute Unit '54012dc420a641558d89a9ed' finished with output:
Compute Unit '54012dc520a641558d89a9fe' finished with output:
Compute Unit '54012dc520a641558d89aa1b' finished with output:
CU execution done.
session cleanup on exit

I am not sure what is the temporal line of the logging - if any - but looking at it linearly it would seem that the pilot has been teared down before all the output was staged out from most of the units. Consistently, upon inspection of the unit sanboxes on gordon, all the units completely successfully but only few of them (14) had their output files staged out.

mturilli commented 10 years ago

This seems to say a different story but I confirm that looking within the sandbox, all the CUs appear to have been executed successfully.

54012dba20a641558d89a9bd

andre-merzky commented 10 years ago

From the plot, it seems that the cancellation of the pilot and the failed units are unrelated: units get completed successfully after some fail. Also, the units don't enter the TRANSFERRING_OUTPUT state, so I don't think its a staging problem -- they fail before.

Would you mind sending the STDOUT/STDERR for unit-54012dc520a641558d89aa07? Also, could you do a grep -C 5 54012dc520a641558d89aa07 AGENT* to see what happened in the vicinity of that unit?

Thanks!

mturilli commented 10 years ago
$ cat radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/STDERR 
                         :-)  G  R  O  M  A  C  S  (-:

                               Grunge ROck MAChoS

                            :-)  VERSION 4.6.5  (-:

        Contributions from Mark Abraham, Emile Apol, Rossen Apostolov, 
           Herman J.C. Berendsen, Aldert van Buuren, Pär Bjelkmar,  
     Rudi van Drunen, Anton Feenstra, Gerrit Groenhof, Christoph Junghans, 
        Peter Kasson, Carsten Kutzner, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
         Copyright (c) 2001-2012,2013, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.

         This program is free software; you can redistribute it and/or
       modify it under the terms of the GNU Lesser General Public License
        as published by the Free Software Foundation; either version 2.1
             of the License, or (at your option) any later version.

                              :-)  mdrun_mpi  (-:

Option     Filename  Type         Description
------------------------------------------------------------
  -s      topol.tpr  Input        Run input file: tpr tpb tpa
  -o       traj.trr  Output       Full precision trajectory: trr trj cpt
  -x       traj.xtc  Output, Opt. Compressed trajectory (portable xdr format)
-cpi      state.cpt  Input, Opt.  Checkpoint file
-cpo      state.cpt  Output, Opt. Checkpoint file
  -c    confout.gro  Output       Structure file: gro g96 pdb etc.
  -e       ener.edr  Output       Energy file
  -g         md.log  Output       Log file
-dhdl      dhdl.xvg  Output, Opt. xvgr/xmgr file
-field    field.xvg  Output, Opt. xvgr/xmgr file
-table    table.xvg  Input, Opt.  xvgr/xmgr file
-tabletf    tabletf.xvg  Input, Opt.  xvgr/xmgr file
-tablep  tablep.xvg  Input, Opt.  xvgr/xmgr file
-tableb   table.xvg  Input, Opt.  xvgr/xmgr file
-rerun    rerun.xtc  Input, Opt.  Trajectory: xtc trr trj gro g96 pdb cpt
-tpi        tpi.xvg  Output, Opt. xvgr/xmgr file
-tpid   tpidist.xvg  Output, Opt. xvgr/xmgr file
 -ei        sam.edi  Input, Opt.  ED sampling input
 -eo      edsam.xvg  Output, Opt. xvgr/xmgr file
  -j       wham.gct  Input, Opt.  General coupling stuff
 -jo        bam.gct  Output, Opt. General coupling stuff
-ffout      gct.xvg  Output, Opt. xvgr/xmgr file
-devout   deviatie.xvg  Output, Opt. xvgr/xmgr file
-runav  runaver.xvg  Output, Opt. xvgr/xmgr file
 -px      pullx.xvg  Output, Opt. xvgr/xmgr file
 -pf      pullf.xvg  Output, Opt. xvgr/xmgr file
 -ro   rotation.xvg  Output, Opt. xvgr/xmgr file
 -ra  rotangles.log  Output, Opt. Log file
 -rs   rotslabs.log  Output, Opt. Log file
 -rt  rottorque.log  Output, Opt. Log file
-mtx         nm.mtx  Output, Opt. Hessian matrix
 -dn     dipole.ndx  Output, Opt. Index file
-multidir    rundir  Input, Opt., Mult. Run directory
-membed  membed.dat  Input, Opt.  Generic data file
 -mp     membed.top  Input, Opt.  Topology file
 -mn     membed.ndx  Input, Opt.  Index file

Option       Type   Value   Description
------------------------------------------------------
-[no]h       bool   no      Print help info and quit
-[no]version bool   no      Print version info and quit
-nice        int    0       Set the nicelevel
-deffnm      string         Set the default filename for all file options
-xvg         enum   xmgrace  xvg plot formatting: xmgrace, xmgr or none
-[no]pd      bool   no      Use particle decompostion
-dd          vector 0 0 0   Domain decomposition grid, 0 is optimize
-ddorder     enum   interleave  DD node order: interleave, pp_pme or cartesian
-npme        int    -1      Number of separate nodes to be used for PME, -1
                            is guess
-nt          int    0       Total number of threads to start (0 is guess)
-ntmpi       int    0       Number of thread-MPI threads to start (0 is guess)
-ntomp       int    0       Number of OpenMP threads per MPI process/thread
                            to start (0 is guess)
-ntomp_pme   int    0       Number of OpenMP threads per MPI process/thread
                            to start (0 is -ntomp)
-pin         enum   auto    Fix threads (or processes) to specific cores:
                            auto, on or off
-pinoffset   int    0       The starting logical core number for pinning to
                            cores; used to avoid pinning threads from
                            different mdrun instances to the same core
-pinstride   int    0       Pinning distance in logical cores for threads,
                            use 0 to minimize the number of threads per
                            physical core
-gpu_id      string         List of GPU device id-s to use, specifies the
                            per-node PP rank to GPU mapping
-[no]ddcheck bool   yes     Check for all bonded interactions with DD
-rdd         real   0       The maximum distance for bonded interactions with
                            DD (nm), 0 is determine from initial coordinates
-rcon        real   0       Maximum distance for P-LINCS (nm), 0 is estimate
-dlb         enum   auto    Dynamic load balancing (with DD): auto, no or yes
-dds         real   0.8     Minimum allowed dlb scaling of the DD cell size
-gcom        int    -1      Global communication frequency
-nb          enum   auto    Calculate non-bonded interactions on: auto, cpu,
                            gpu or gpu_cpu
-[no]tunepme bool   yes     Optimize PME load between PP/PME nodes or GPU/CPU
-[no]testverlet bool   no      Test the Verlet non-bonded scheme
-[no]v       bool   no      Be loud and noisy
-[no]compact bool   yes     Write a compact log file
-[no]seppot  bool   no      Write separate V and dVdl terms for each
                            interaction type and node to the log file(s)
-pforce      real   -1      Print all forces larger than this (kJ/mol nm)
-[no]reprod  bool   no      Try to avoid optimizations that affect binary
                            reproducibility
-cpt         real   15      Checkpoint interval (minutes)
-[no]cpnum   bool   no      Keep and number checkpoint files
-[no]append  bool   yes     Append to previous output files when continuing
                            from checkpoint instead of adding the simulation
                            part number to all file names
-nsteps      step   -2      Run this number of steps, overrides .mdp file
                            option
-maxh        real   -1      Terminate after 0.99 times this time (hours)
-multi       int    0       Do multiple simulations in parallel
-replex      int    0       Attempt replica exchange periodically with this
                            period (steps)
-nex         int    0       Number of random exchanges to carry out each
                            exchange interval (N^3 is one suggestion).  -nex
                            zero or not specified gives neighbor replica
                            exchange.
-reseed      int    -1      Seed for replica exchange, -1 is generate a seed
-[no]ionize  bool   no      Do a simulation including the effect of an X-Ray
                            bombardment on your system

Reading file topol.tpr, VERSION 4.6.5 (single precision)
Using 1 MPI process

Non-default thread affinity set probably by the OpenMP library,
disabling internal thread affinity
starting mdrun 'Protein t=   0.00000'
100000 steps,    100.0 ps.

Writing final coordinates.

               Core t (s)   Wall t (s)        (%)
       Time:        5.100      635.951        0.8
                 (ns/day)    (hour/ns)
Performance:       13.586        1.767

gcq#254: "Why, how now, Claudio ! Whence Comes this Restraint ?" (Lucio in Measure for measure, Act 1, Scene 4, William Shakespeare)

[mpiexec@gcn-7-54.sdsc.edu] HYDT_bscd_pbs_wait_for_completion (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with TM error 17002
[mpiexec@gcn-7-54.sdsc.edu] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@gcn-7-54.sdsc.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@gcn-7-54.sdsc.edu] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
$ cat radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/STDOUT 
$
$ grep -C 5 54012dc520a641558d89aa07 radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT*
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,446 - radical.pilot.agent - DEBUG - Launching task with MPIRUN (/opt/mvapich2/intel/ib/bin/mpirun).
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,446 - radical.pilot.agent - INFO - Node gcn-7-54 satisfies 1 cores at offset 14
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,448 - radical.pilot.agent - DEBUG - Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31/radical_pilot_cu_launch_script-AY58pk.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,449 - radical.pilot.agent - INFO - Launching task 54012dc520a641558d89aa31 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31/radical_pilot_cu_launch_script-AY58pk.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,491 - radical.pilot.agent - INFO - Found new tasks in pilot queue: 54012dc420a641558d89a9e4
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:22,627 - radical.pilot.agent - INFO - Task input staging directives {u'target': u'topol.tpr', u'priority': 0, u'source': u'staging:///topol.tpr', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Link'} for cu: 54012dc520a641558d89aa07 to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,627 - radical.pilot.agent - INFO - Operating from staging
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:22,628 - radical.pilot.agent - INFO - Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:22,719 - radical.pilot.agent - DEBUG - Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-56                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-57                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-54                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-55                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,537 - radical.pilot.agent - DEBUG - Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3/radical_pilot_cu_launch_script-0LhrDe.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,538 - radical.pilot.agent - INFO - Launching task 54012dc420a641558d89a9e3 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3/radical_pilot_cu_launch_script-0LhrDe.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,545 - radical.pilot.agent - INFO - Task input staging directives {u'target': u'topol.tpr', u'priority': 0, u'source': u'staging:///topol.tpr', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Link'} for cu: 54012dc420a641558d89a9ee to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9ee
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,546 - radical.pilot.agent - INFO - Operating from staging
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,546 - radical.pilot.agent - INFO - Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9ee/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:23,670 - radical.pilot.agent - INFO - Found new tasks in pilot queue: 54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:23,806 - radical.pilot.agent - DEBUG - Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-56                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-57                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-54                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-55                :  X  X  X  -  -  -  -  -  -  -  -  -  -  -  -  - 
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-62                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-61                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,277 - radical.pilot.agent - DEBUG - Launching task with MPIRUN (/opt/mvapich2/intel/ib/bin/mpirun).
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,278 - radical.pilot.agent - INFO - Node gcn-7-55 satisfies 1 cores at offset 12
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:26,279 - radical.pilot.agent - DEBUG - Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/radical_pilot_cu_launch_script-WVD3eu.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 18:56:26,280 - radical.pilot.agent - INFO - Launching task 54012dc520a641558d89aa07 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/radical_pilot_cu_launch_script-WVD3eu.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,499 - radical.pilot.agent - INFO - Found new tasks in pilot queue: 54012dc620a641558d89aa3d
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,531 - radical.pilot.agent - INFO - Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc320a641558d89a9c3/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 18:56:26,550 - radical.pilot.agent - DEBUG - Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-56                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-57                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-62                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-61                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 19:07:22,014 - radical.pilot.agent - INFO - Task 54012dc420a641558d89a9ec terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 19:07:22,016 - radical.pilot.agent - INFO - Task 54012dc520a641558d89aa27 terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG:2014-08-29 19:07:22,018 - radical.pilot.agent - INFO - Task 54012dc520a641558d89aa07 terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-2014-08-29 19:07:22,334 - radical.pilot.agent - DEBUG - Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-56                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-57                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-54                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  X  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.LOG-gcn-7-55                :  X  -  -  -  -  -  -  -  -  -  -  -  -  X  X  X 
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread   radical.pilot.agent   : [DEBUG   ] Launching task with MPIRUN (/opt/mvapich2/intel/ib/bin/mpirun).
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread   radical.pilot.agent   : [INFO    ] Node gcn-7-54 satisfies 1 cores at offset 14
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread   radical.pilot.agent   : [DEBUG   ] Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31/radical_pilot_cu_launch_script-AY58pk.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread   radical.pilot.agent   : [INFO    ] Launching task 54012dc520a641558d89aa31 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31/radical_pilot_cu_launch_script-AY58pk.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa31
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117347 Thread-1     radical.pilot.agent   : [INFO    ] Found new tasks in pilot queue: 54012dc420a641558d89a9e4
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:22 117353 MainThread   radical.pilot.agent   : [INFO    ] Task input staging directives {u'target': u'topol.tpr', u'priority': 0, u'source': u'staging:///topol.tpr', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Link'} for cu: 54012dc520a641558d89aa07 to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117353 MainThread   radical.pilot.agent   : [INFO    ] Operating from staging
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:22 117353 MainThread   radical.pilot.agent   : [INFO    ] Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:22 117352 MainThread   radical.pilot.agent   : [DEBUG   ] Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-56                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-57                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-54                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-55                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117352 MainThread   radical.pilot.agent   : [DEBUG   ] Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3/radical_pilot_cu_launch_script-0LhrDe.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117352 MainThread   radical.pilot.agent   : [INFO    ] Launching task 54012dc420a641558d89a9e3 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3/radical_pilot_cu_launch_script-0LhrDe.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9e3
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117353 MainThread   radical.pilot.agent   : [INFO    ] Task input staging directives {u'target': u'topol.tpr', u'priority': 0, u'source': u'staging:///topol.tpr', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Link'} for cu: 54012dc420a641558d89a9ee to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9ee
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117353 MainThread   radical.pilot.agent   : [INFO    ] Operating from staging
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117353 MainThread   radical.pilot.agent   : [INFO    ] Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc420a641558d89a9ee/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:23 117347 Thread-1     radical.pilot.agent   : [INFO    ] Found new tasks in pilot queue: 54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:23 117352 MainThread   radical.pilot.agent   : [DEBUG   ] Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-56                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-57                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-54                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-55                :  X  X  X  -  -  -  -  -  -  -  -  -  -  -  -  - 
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-62                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-61                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117352 MainThread   radical.pilot.agent   : [DEBUG   ] Launching task with MPIRUN (/opt/mvapich2/intel/ib/bin/mpirun).
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117352 MainThread   radical.pilot.agent   : [INFO    ] Node gcn-7-55 satisfies 1 cores at offset 12
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:26 117352 MainThread   radical.pilot.agent   : [DEBUG   ] Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/radical_pilot_cu_launch_script-WVD3eu.sh
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 18:56:26 117352 MainThread   radical.pilot.agent   : [INFO    ] Launching task 54012dc520a641558d89aa07 via /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07/radical_pilot_cu_launch_script-WVD3eu.sh in /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc520a641558d89aa07
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117347 Thread-1     radical.pilot.agent   : [INFO    ] Found new tasks in pilot queue: 54012dc620a641558d89aa3d
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117353 MainThread   radical.pilot.agent   : [INFO    ] Linking /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/staging_area/topol.tpr to /home/mturilli/radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/unit-54012dc320a641558d89a9c3/topol.tpr succeeded.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 18:56:26 117352 MainThread   radical.pilot.agent   : [DEBUG   ] Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-56                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-57                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
--
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-62                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-61                :  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 19:07:22 117352 MainThread   radical.pilot.agent   : [INFO    ] Task 54012dc420a641558d89a9ec terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 19:07:22 117352 MainThread   radical.pilot.agent   : [INFO    ] Task 54012dc520a641558d89aa27 terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 19:07:22 117352 MainThread   radical.pilot.agent   : [INFO    ] Task 54012dc520a641558d89aa07 terminated with return code 255.
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-2014:08:29 19:07:22 117352 MainThread   radical.pilot.agent   : [DEBUG   ] Slot status:
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-56                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-57                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-54                :  -  -  -  -  -  -  -  -  -  -  -  -  -  -  X  - 
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR-gcn-7-55                :  X  -  -  -  -  -  -  -  -  -  -  -  -  X  X  X 
andre-merzky commented 10 years ago
radical.pilot.sandbox/pilot-54012dbb20a641558d89a9bf/AGENT.STDERR:2014:08:29 19:07:22 117352 MainThread radical.pilot.agent : [INFO ] Task 54012dc520a641558d89aa07 terminated with return code 255.

Well, the unit in fact did fail, as far as the agent knows. The return code is the only information the agent has about success or failure.

As you observed though, the unit stdout/err seems to indicate that gromacs itself finished nicely -- but its mpirun (in which gromacs is contained) which returns the error code. At least that is what I would assume from these messages:

[mpiexec@gcn-7-54.sdsc.edu] HYDT_bscd_pbs_wait_for_completion (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with TM error 17002
[mpiexec@gcn-7-54.sdsc.edu] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@gcn-7-54.sdsc.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@gcn-7-54.sdsc.edu] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

Now, what causes that error I have no idea :( Google returns a number of hits for the error messages above -- but I don't immediately see how they are related to your gromacs execution. Most replies seem to indicate a problem with the MPI layer though.

Mark, Ole, any insight?

mturilli commented 10 years ago

Does this mean that I need to run the test again? Or should I try to fix the issue?

andre-merzky commented 10 years ago

Sorry, I misassigned the trestles and gordon tickets -- those were supposed to go to Mark. Fixed now.

marksantcroos commented 10 years ago

There were a couple of issues at play here:

  1. MPI programs on Gordon need to be executed through mpirun_rsh, which is actually a distinct method (from mpirun), which I now added to the agent (7bf24ae)
  2. mpirun_rsh doesn't look for binaries on the path, so we need to specify the full path (added to kernels in https://github.com/radical-cybertools/radical.ensemblemd.mdkernels/commit/7610169f7ccbe5df6d0802906a89cf80386c59d9)
  3. Added a FAQ entry on the latter (6ae7ce1)
mturilli commented 10 years ago

Great, thank you! Successfully tested.

540e855320a64150c43038a0

andre-merzky commented 10 years ago

Great!

marksantcroos commented 10 years ago

Good, I'll close this.

As a FYI: I noticed on Gordon that gromacs is pinning threads to cores and that seriously pessimizes the runtime of non-whole-node executions through RP.

mturilli commented 10 years ago

Should we open a feature ticket for that?

andre-merzky commented 10 years ago

Done in #334