radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

fix command and detect MPICH mpiexec lm #3064

Closed AymenFJA closed 11 months ago

AymenFJA commented 11 months ago

This PR should fix #3036, # 3061 and potentially #2810. Changes summary:

  1. Fix the command that detects if mpiexec uses -rf (rankfile) or -f (host name file).
  2. Add another level of check for host name file for MPICH mpiexec

This was tested on a machine with MPICH mpiexec flavor as follows:

(test_rct) controlplane $ mpiexec -info
HYDRA build details:
    Version:                                 3.3.2
    Release Date:                            Tue Nov 12 21:23:16 CST 2019
    CC:                              gcc   -Wl,-Bsymbolic-functions -Wl,-z,relro 
    CXX:                             g++   -Wl,-Bsymbolic-functions -Wl,-z,relro 
    F77:                             f77  -Wl,-Bsymbolic-functions -Wl,-z,relro 
    F90:                             f95  -Wl,-Bsymbolic-functions -Wl,-z,relro 
    Configure options:                       '--disable-option-checking' '--prefix=/usr' '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--runstatedir=/run' '--disable-maintainer-mode' '--disable-dependency-tracking' '--with-libfabric' '--enable-shared' '--enable-fortran=all' '--disable-rpath' '--disable-wrapper-rpath' '--sysconfdir=/etc/mpich' '--libdir=/usr/lib/x86_64-linux-gnu' '--includedir=/usr/include/x86_64-linux-gnu/mpich' '--docdir=/usr/share/doc/mpich' 'CPPFLAGS= -Wdate-time -D_FORTIFY_SOURCE=2 -I/build/mpich-VeuB8Z/mpich-3.3.2/src/mpl/include -I/build/mpich-VeuB8Z/mpich-3.3.2/src/mpl/include -I/build/mpich-VeuB8Z/mpich-3.3.2/src/openpa/src -I/build/mpich-VeuB8Z/mpich-3.3.2/src/openpa/src -D_REENTRANT -I/build/mpich-VeuB8Z/mpich-3.3.2/src/mpi/romio/include' 'CFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-VeuB8Z/mpich-3.3.2=. -fstack-protector-strong -Wformat -Werror=format-security -O2' 'CXXFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-VeuB8Z/mpich-3.3.2=. -fstack-protector-strong -Wformat -Werror=format-security -O2' 'FFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-VeuB8Z/mpich-3.3.2=. -fstack-protector-strong -O2' 'FCFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-VeuB8Z/mpich-3.3.2=. -fstack-protector-strong -cpp -O2' 'BASH_SHELL=/bin/bash' 'build_alias=x86_64-linux-gnu' 'MPICHLIB_CFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-VeuB8Z/mpich-3.3.2=. -fstack-protector-strong -Wformat -Werror=format-security' 'MPICHLIB_CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'MPICHLIB_CXXFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-VeuB8Z/mpich-3.3.2=. -fstack-protector-strong -Wformat -Werror=format-security' 'MPICHLIB_FFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-VeuB8Z/mpich-3.3.2=. -fstack-protector-strong' 'MPICHLIB_FCFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-VeuB8Z/mpich-3.3.2=. -fstack-protector-strong -cpp' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'FC=f95' 'F77=f77' 'MPILIBNAME=mpich' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'LIBS=' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:       
    Demux engines available:                 poll select

RP test:

(test_rct) controlplane $ python 09_mpi_tasks.py 

================================================================================
 Getting Started (RP version 1.40.0)                                            
================================================================================

new session: [rp.session.controlplane.kc-internal.019642.0002]                 \
zmq proxy  : [tcp://172.30.1.2:10001]                                         ok
read config                                                                   ok

--------------------------------------------------------------------------------
submit pilots                                                                   

create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   local.localhost          32 cores       0 gpus           ok

--------------------------------------------------------------------------------
submit tasks                                                                    

create task manager                                                           ok
create 2 task description(s)
        ..                                                                    ok
submit: ########################################################################

--------------------------------------------------------------------------------
gather results                                                                  

wait  : ########################################################################
        DONE      :     2
                                                                              ok

  * task.000000: DONE, exit:   0, ranks: controlplane 1:1/2 @ 0/1 : 0/1
controlplane 1:0/2 @ 0/1 : 0/1
controlplane 0:1/2 @ 0/1 : 0/1
controlplane 0:0/2 @ 0/1 : 0/1
controlplane 2:1/2 @ 0/1 : 0/1
controlplane 2:0/2 @ 0/1 : 0/1

  * task.000001: DONE, exit:   0, ranks: controlplane 2:1/2 @ 0/1 : 0/1
controlplane 2:0/2 @ 0/1 : 0/1
controlplane 1:1/2 @ 0/1 : 0/1
controlplane 1:0/2 @ 0/1 : 0/1
controlplane 0:1/2 @ 0/1 : 0/1
controlplane 0:0/2 @ 0/1 : 0/1

--------------------------------------------------------------------------------
finalize                                                                        

closing session rp.session.controlplane.kc-internal.019642.0002                \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 35.4s                                                       ok

--------------------------------------------------------------------------------

This PR still requires tests to be updated accordingly (work in progress)

codecov[bot] commented 11 months ago

Codecov Report

Merging #3064 (c36d00a) into devel (b42778f) will increase coverage by 0.02%. The diff coverage is 72.72%.

@@            Coverage Diff             @@
##            devel    #3064      +/-   ##
==========================================
+ Coverage   43.78%   43.81%   +0.02%     
==========================================
  Files          96       96              
  Lines       10576    10586      +10     
==========================================
+ Hits         4631     4638       +7     
- Misses       5945     5948       +3     
Files Coverage Δ
src/radical/pilot/agent/launch_method/mpiexec.py 88.40% <72.72%> (-1.44%) :arrow_down:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

AymenFJA commented 11 months ago

Note: https://github.com/radical-cybertools/radical.pilot/blob/c36d00ad5f75d252cb431da26d844f51820104b1/src/radical/pilot/agent/launch_method/mpiexec.py#L98

A double backslash is used in this line, and the reason is pylint:

https://pylint.readthedocs.io/en/latest/user_guide/messages/warning/anomalous-backslash-in-string.html