hpcugent / vsc-mympirun

mympirun is a tool to facilitate running MPI programs on an HPC cluster
GNU General Public License v2.0
6 stars 9 forks source link

mympirun fails with intel/2020a toolchain #168

Closed smoors closed 4 years ago

smoors commented 4 years ago

when I run it in multi-node, I get the following error:

mympirun ./mpi_intel2020a 
Abort(1091215) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136).......: 
MPID_Init(904)..............: 
MPIDI_OFI_mpi_init_hook(986): OFI addrinfo() failed (ofi_init.c:986:MPIDI_OFI_mpi_init_hook:No data available)
2020-05-15 17:40:13,864 ERROR      mympirun.RunAsyncMPI MainThread  _post_exitcode: problem occured with cmd ['mpirun', '--file=/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/mpdboot', '--machinefile', '/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/nodes', '-rmk', 'pbs', '-bootstrap', 'pbsdsh', '-genv', 'MKL_NUM_THREADS', '1', '-genv', 'I_MPI_PIN', '1', '-genv', 'I_MPI_DAT_LIBRARY', 'libdat2.so', '-genv', 'I_MPI_FABRICS', 'shm:dapl', '-genv', 'LOADEDMODULES', 'GCCcore/9.3.0:zlib/1.2.11-GCCcore-9.3.0:binutils/2.34-GCCcore-9.3.0:iccifort/2020.1.217:numactl/2.0.13-GCCcore-9.3.0:UCX/1.8.0-GCCcore-9.3.0:impi/2019.7.217-iccifort-2020.1.217:iimpi/2020a:imkl/2020.1.217-iimpi-2020a:intel/2020a:vsc-mympirun/4.1.9', '-genv', 'I_MPI_FALLBACK_DEVICE', '0', '-genv', 'I_MPI_FALLBACK', 'disable', '-genv', 'I_MPI_NETMASK', '10.143.0.0/255.255.0.0', '-genv', 'I_MPI_DAPL_SCALABLE_PROGRESS', '0', '-genv', 'MODULESHOME', '/usr/share/lmod/lmod', '-genv', 'MODULEPATH', '/apps/brussel/CO7/skylake/modules/2020a/all:/apps/brussel/CO7/skylake/modules/2019b/all:/apps/brussel/CO7/skylake/modules/2019a/all:/apps/brussel/CO7/skylake/modules/2018b/all:/apps/brussel/CO7/skylake/modules/2018a/all:/apps/brussel/CO7/skylake/modules/2017b/all:/apps/brussel/CO7/skylake/modules/2017a/all:/apps/brussel/CO7/skylake/modules/2016b/all:/apps/brussel/CO7/skylake/modules/2016a/all:/apps/brussel/CO7/skylake/modules/2015b/all::', '-np', '4', '-envlist', 'LD_LIBRARY_PATH,PATH,PYTHONPATH,I_MPI_MPD_TMPDIR,I_MPI_HYDRA_TOPOLIB,I_MPI_ROOT,OMP_NUM_THREADS,MKL_EXAMPLES', './mpi_intel2020a']: (shellcmd ['mpirun', '--file=/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/mpdboot', '--machinefile', '/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/nodes', '-rmk', 'pbs', '-bootstrap', 'pbsdsh', '-genv', 'MKL_NUM_THREADS', '1', '-genv', 'I_MPI_PIN', '1', '-genv', 'I_MPI_DAT_LIBRARY', 'libdat2.so', '-genv', 'I_MPI_FABRICS', 'shm:dapl', '-genv', 'LOADEDMODULES', 'GCCcore/9.3.0:zlib/1.2.11-GCCcore-9.3.0:binutils/2.34-GCCcore-9.3.0:iccifort/2020.1.217:numactl/2.0.13-GCCcore-9.3.0:UCX/1.8.0-GCCcore-9.3.0:impi/2019.7.217-iccifort-2020.1.217:iimpi/2020a:imkl/2020.1.217-iimpi-2020a:intel/2020a:vsc-mympirun/4.1.9', '-genv', 'I_MPI_FALLBACK_DEVICE', '0', '-genv', 'I_MPI_FALLBACK', 'disable', '-genv', 'I_MPI_NETMASK', '10.143.0.0/255.255.0.0', '-genv', 'I_MPI_DAPL_SCALABLE_PROGRESS', '0', '-genv', 'MODULESHOME', '/usr/share/lmod/lmod', '-genv', 'MODULEPATH', '/apps/brussel/CO7/skylake/modules/2020a/all:/apps/brussel/CO7/skylake/modules/2019b/all:/apps/brussel/CO7/skylake/modules/2019a/all:/apps/brussel/CO7/skylake/modules/2018b/all:/apps/brussel/CO7/skylake/modules/2018a/all:/apps/brussel/CO7/skylake/modules/2017b/all:/apps/brussel/CO7/skylake/modules/2017a/all:/apps/brussel/CO7/skylake/modules/2016b/all:/apps/brussel/CO7/skylake/modules/2016a/all:/apps/brussel/CO7/skylake/modules/2015b/all::', '-np', '4', '-envlist', 'LD_LIBRARY_PATH,PATH,PYTHONPATH,I_MPI_MPD_TMPDIR,I_MPI_HYDRA_TOPOLIB,I_MPI_ROOT,OMP_NUM_THREADS,MKL_EXAMPLES', './mpi_intel2020a']) output Abort(1091215) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136).......: 
MPID_Init(904)..............: 
MPIDI_OFI_mpi_init_hook(986): OFI addrinfo() failed (ofi_init.c:986:MPIDI_OFI_mpi_init_hook:No data available)

2020-05-15 17:40:13,866 WARNING    mympirun.IntelHydraMPIPbsdsh_PBS MainThread  main: exitcode 143 > 0; cmd ['mpirun', '--file=/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/mpdboot', '--machinefile', '/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/nodes', '-rmk', 'pbs', '-bootstrap', 'pbsdsh', '-genv', 'MKL_NUM_THREADS', '1', '-genv', 'I_MPI_PIN', '1', '-genv', 'I_MPI_DAT_LIBRARY', 'libdat2.so', '-genv', 'I_MPI_FABRICS', 'shm:dapl', '-genv', 'LOADEDMODULES', 'GCCcore/9.3.0:zlib/1.2.11-GCCcore-9.3.0:binutils/2.34-GCCcore-9.3.0:iccifort/2020.1.217:numactl/2.0.13-GCCcore-9.3.0:UCX/1.8.0-GCCcore-9.3.0:impi/2019.7.217-iccifort-2020.1.217:iimpi/2020a:imkl/2020.1.217-iimpi-2020a:intel/2020a:vsc-mympirun/4.1.9', '-genv', 'I_MPI_FALLBACK_DEVICE', '0', '-genv', 'I_MPI_FALLBACK', 'disable', '-genv', 'I_MPI_NETMASK', '10.143.0.0/255.255.0.0', '-genv', 'I_MPI_DAPL_SCALABLE_PROGRESS', '0', '-genv', 'MODULESHOME', '/usr/share/lmod/lmod', '-genv', 'MODULEPATH', '/apps/brussel/CO7/skylake/modules/2020a/all:/apps/brussel/CO7/skylake/modules/2019b/all:/apps/brussel/CO7/skylake/modules/2019a/all:/apps/brussel/CO7/skylake/modules/2018b/all:/apps/brussel/CO7/skylake/modules/2018a/all:/apps/brussel/CO7/skylake/modules/2017b/all:/apps/brussel/CO7/skylake/modules/2017a/all:/apps/brussel/CO7/skylake/modules/2016b/all:/apps/brussel/CO7/skylake/modules/2016a/all:/apps/brussel/CO7/skylake/modules/2015b/all::', '-np', '4', '-envlist', 'LD_LIBRARY_PATH,PATH,PYTHONPATH,I_MPI_MPD_TMPDIR,I_MPI_HYDRA_TOPOLIB,I_MPI_ROOT,OMP_NUM_THREADS,MKL_EXAMPLES', './mpi_intel2020a']
2020-05-15 17:40:13,872 ERROR      mympirun        MainThread  Main failed
Traceback (most recent call last):
  File "/theia/home/apps/CO7/skylake/software/vsc-mympirun/4.1.9/lib/python2.7/site-packages/vsc_mympirun-4.1.9-py2.7.egg/EGG-INFO/scripts/mympirun", line 125, in main
    instance.main()
  File "/theia/home/apps/CO7/skylake/software/vsc-mympirun/4.1.9/lib/python2.7/site-packages/vsc_mympirun-4.1.9-py2.7.egg/vsc/mympirun/mpi/mpi.py", line 453, in main
    self.log.raiseException("main: exitcode %s > 0; cmd %s" % (exitcode, self.mpirun_cmd))
  File "/theia/home/apps/CO7/skylake/software/vsc-mympirun/4.1.9/lib/python2.7/site-packages/vsc_base-2.9.6-py2.7.egg/vsc/utils/fancylogger.py", line 333, in raiseException
    raise err
Exception: main: exitcode 143 > 0; cmd ['mpirun', '--file=/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/mpdboot', '--machinefile', '/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/nodes', '-rmk', 'pbs', '-bootstrap', 'pbsdsh', '-genv', 'MKL_NUM_THREADS', '1', '-genv', 'I_MPI_PIN', '1', '-genv', 'I_MPI_DAT_LIBRARY', 'libdat2.so', '-genv', 'I_MPI_FABRICS', 'shm:dapl', '-genv', 'LOADEDMODULES', 'GCCcore/9.3.0:zlib/1.2.11-GCCcore-9.3.0:binutils/2.34-GCCcore-9.3.0:iccifort/2020.1.217:numactl/2.0.13-GCCcore-9.3.0:UCX/1.8.0-GCCcore-9.3.0:impi/2019.7.217-iccifort-2020.1.217:iimpi/2020a:imkl/2020.1.217-iimpi-2020a:intel/2020a:vsc-mympirun/4.1.9', '-genv', 'I_MPI_FALLBACK_DEVICE', '0', '-genv', 'I_MPI_FALLBACK', 'disable', '-genv', 'I_MPI_NETMASK', '10.143.0.0/255.255.0.0', '-genv', 'I_MPI_DAPL_SCALABLE_PROGRESS', '0', '-genv', 'MODULESHOME', '/usr/share/lmod/lmod', '-genv', 'MODULEPATH', '/apps/brussel/CO7/skylake/modules/2020a/all:/apps/brussel/CO7/skylake/modules/2019b/all:/apps/brussel/CO7/skylake/modules/2019a/all:/apps/brussel/CO7/skylake/modules/2018b/all:/apps/brussel/CO7/skylake/modules/2018a/all:/apps/brussel/CO7/skylake/modules/2017b/all:/apps/brussel/CO7/skylake/modules/2017a/all:/apps/brussel/CO7/skylake/modules/2016b/all:/apps/brussel/CO7/skylake/modules/2016a/all:/apps/brussel/CO7/skylake/modules/2015b/all::', '-np', '4', '-envlist', 'LD_LIBRARY_PATH,PATH,PYTHONPATH,I_MPI_MPD_TMPDIR,I_MPI_HYDRA_TOPOLIB,I_MPI_ROOT,OMP_NUM_THREADS,MKL_EXAMPLES', './mpi_intel2020a']
boegel commented 4 years ago

@smoors Can you run again with mympirun --debug and provide the full output?

Does it work fine when you use mpirun -np directly?

smoors commented 4 years ago

@boegel the error is gone with the latest version 5.1.0