Closed smoors closed 4 years ago
when I run it in multi-node, I get the following error:
mympirun ./mpi_intel2020a Abort(1091215) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(136).......: MPID_Init(904)..............: MPIDI_OFI_mpi_init_hook(986): OFI addrinfo() failed (ofi_init.c:986:MPIDI_OFI_mpi_init_hook:No data available) 2020-05-15 17:40:13,864 ERROR mympirun.RunAsyncMPI MainThread _post_exitcode: problem occured with cmd ['mpirun', '--file=/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/mpdboot', '--machinefile', '/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/nodes', '-rmk', 'pbs', '-bootstrap', 'pbsdsh', '-genv', 'MKL_NUM_THREADS', '1', '-genv', 'I_MPI_PIN', '1', '-genv', 'I_MPI_DAT_LIBRARY', 'libdat2.so', '-genv', 'I_MPI_FABRICS', 'shm:dapl', '-genv', 'LOADEDMODULES', 'GCCcore/9.3.0:zlib/1.2.11-GCCcore-9.3.0:binutils/2.34-GCCcore-9.3.0:iccifort/2020.1.217:numactl/2.0.13-GCCcore-9.3.0:UCX/1.8.0-GCCcore-9.3.0:impi/2019.7.217-iccifort-2020.1.217:iimpi/2020a:imkl/2020.1.217-iimpi-2020a:intel/2020a:vsc-mympirun/4.1.9', '-genv', 'I_MPI_FALLBACK_DEVICE', '0', '-genv', 'I_MPI_FALLBACK', 'disable', '-genv', 'I_MPI_NETMASK', '10.143.0.0/255.255.0.0', '-genv', 'I_MPI_DAPL_SCALABLE_PROGRESS', '0', '-genv', 'MODULESHOME', '/usr/share/lmod/lmod', '-genv', 'MODULEPATH', '/apps/brussel/CO7/skylake/modules/2020a/all:/apps/brussel/CO7/skylake/modules/2019b/all:/apps/brussel/CO7/skylake/modules/2019a/all:/apps/brussel/CO7/skylake/modules/2018b/all:/apps/brussel/CO7/skylake/modules/2018a/all:/apps/brussel/CO7/skylake/modules/2017b/all:/apps/brussel/CO7/skylake/modules/2017a/all:/apps/brussel/CO7/skylake/modules/2016b/all:/apps/brussel/CO7/skylake/modules/2016a/all:/apps/brussel/CO7/skylake/modules/2015b/all::', '-np', '4', '-envlist', 'LD_LIBRARY_PATH,PATH,PYTHONPATH,I_MPI_MPD_TMPDIR,I_MPI_HYDRA_TOPOLIB,I_MPI_ROOT,OMP_NUM_THREADS,MKL_EXAMPLES', './mpi_intel2020a']: (shellcmd ['mpirun', '--file=/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/mpdboot', '--machinefile', '/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/nodes', '-rmk', 'pbs', '-bootstrap', 'pbsdsh', '-genv', 'MKL_NUM_THREADS', '1', '-genv', 'I_MPI_PIN', '1', '-genv', 'I_MPI_DAT_LIBRARY', 'libdat2.so', '-genv', 'I_MPI_FABRICS', 'shm:dapl', '-genv', 'LOADEDMODULES', 'GCCcore/9.3.0:zlib/1.2.11-GCCcore-9.3.0:binutils/2.34-GCCcore-9.3.0:iccifort/2020.1.217:numactl/2.0.13-GCCcore-9.3.0:UCX/1.8.0-GCCcore-9.3.0:impi/2019.7.217-iccifort-2020.1.217:iimpi/2020a:imkl/2020.1.217-iimpi-2020a:intel/2020a:vsc-mympirun/4.1.9', '-genv', 'I_MPI_FALLBACK_DEVICE', '0', '-genv', 'I_MPI_FALLBACK', 'disable', '-genv', 'I_MPI_NETMASK', '10.143.0.0/255.255.0.0', '-genv', 'I_MPI_DAPL_SCALABLE_PROGRESS', '0', '-genv', 'MODULESHOME', '/usr/share/lmod/lmod', '-genv', 'MODULEPATH', '/apps/brussel/CO7/skylake/modules/2020a/all:/apps/brussel/CO7/skylake/modules/2019b/all:/apps/brussel/CO7/skylake/modules/2019a/all:/apps/brussel/CO7/skylake/modules/2018b/all:/apps/brussel/CO7/skylake/modules/2018a/all:/apps/brussel/CO7/skylake/modules/2017b/all:/apps/brussel/CO7/skylake/modules/2017a/all:/apps/brussel/CO7/skylake/modules/2016b/all:/apps/brussel/CO7/skylake/modules/2016a/all:/apps/brussel/CO7/skylake/modules/2015b/all::', '-np', '4', '-envlist', 'LD_LIBRARY_PATH,PATH,PYTHONPATH,I_MPI_MPD_TMPDIR,I_MPI_HYDRA_TOPOLIB,I_MPI_ROOT,OMP_NUM_THREADS,MKL_EXAMPLES', './mpi_intel2020a']) output Abort(1091215) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(136).......: MPID_Init(904)..............: MPIDI_OFI_mpi_init_hook(986): OFI addrinfo() failed (ofi_init.c:986:MPIDI_OFI_mpi_init_hook:No data available) 2020-05-15 17:40:13,866 WARNING mympirun.IntelHydraMPIPbsdsh_PBS MainThread main: exitcode 143 > 0; cmd ['mpirun', '--file=/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/mpdboot', '--machinefile', '/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/nodes', '-rmk', 'pbs', '-bootstrap', 'pbsdsh', '-genv', 'MKL_NUM_THREADS', '1', '-genv', 'I_MPI_PIN', '1', '-genv', 'I_MPI_DAT_LIBRARY', 'libdat2.so', '-genv', 'I_MPI_FABRICS', 'shm:dapl', '-genv', 'LOADEDMODULES', 'GCCcore/9.3.0:zlib/1.2.11-GCCcore-9.3.0:binutils/2.34-GCCcore-9.3.0:iccifort/2020.1.217:numactl/2.0.13-GCCcore-9.3.0:UCX/1.8.0-GCCcore-9.3.0:impi/2019.7.217-iccifort-2020.1.217:iimpi/2020a:imkl/2020.1.217-iimpi-2020a:intel/2020a:vsc-mympirun/4.1.9', '-genv', 'I_MPI_FALLBACK_DEVICE', '0', '-genv', 'I_MPI_FALLBACK', 'disable', '-genv', 'I_MPI_NETMASK', '10.143.0.0/255.255.0.0', '-genv', 'I_MPI_DAPL_SCALABLE_PROGRESS', '0', '-genv', 'MODULESHOME', '/usr/share/lmod/lmod', '-genv', 'MODULEPATH', '/apps/brussel/CO7/skylake/modules/2020a/all:/apps/brussel/CO7/skylake/modules/2019b/all:/apps/brussel/CO7/skylake/modules/2019a/all:/apps/brussel/CO7/skylake/modules/2018b/all:/apps/brussel/CO7/skylake/modules/2018a/all:/apps/brussel/CO7/skylake/modules/2017b/all:/apps/brussel/CO7/skylake/modules/2017a/all:/apps/brussel/CO7/skylake/modules/2016b/all:/apps/brussel/CO7/skylake/modules/2016a/all:/apps/brussel/CO7/skylake/modules/2015b/all::', '-np', '4', '-envlist', 'LD_LIBRARY_PATH,PATH,PYTHONPATH,I_MPI_MPD_TMPDIR,I_MPI_HYDRA_TOPOLIB,I_MPI_ROOT,OMP_NUM_THREADS,MKL_EXAMPLES', './mpi_intel2020a'] 2020-05-15 17:40:13,872 ERROR mympirun MainThread Main failed Traceback (most recent call last): File "/theia/home/apps/CO7/skylake/software/vsc-mympirun/4.1.9/lib/python2.7/site-packages/vsc_mympirun-4.1.9-py2.7.egg/EGG-INFO/scripts/mympirun", line 125, in main instance.main() File "/theia/home/apps/CO7/skylake/software/vsc-mympirun/4.1.9/lib/python2.7/site-packages/vsc_mympirun-4.1.9-py2.7.egg/vsc/mympirun/mpi/mpi.py", line 453, in main self.log.raiseException("main: exitcode %s > 0; cmd %s" % (exitcode, self.mpirun_cmd)) File "/theia/home/apps/CO7/skylake/software/vsc-mympirun/4.1.9/lib/python2.7/site-packages/vsc_base-2.9.6-py2.7.egg/vsc/utils/fancylogger.py", line 333, in raiseException raise err Exception: main: exitcode 143 > 0; cmd ['mpirun', '--file=/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/mpdboot', '--machinefile', '/user/brussel/100/vsc10009/.mympirun_oc8qf9/3114049.master01.hydra.brussel.vsc_20200515_174012/nodes', '-rmk', 'pbs', '-bootstrap', 'pbsdsh', '-genv', 'MKL_NUM_THREADS', '1', '-genv', 'I_MPI_PIN', '1', '-genv', 'I_MPI_DAT_LIBRARY', 'libdat2.so', '-genv', 'I_MPI_FABRICS', 'shm:dapl', '-genv', 'LOADEDMODULES', 'GCCcore/9.3.0:zlib/1.2.11-GCCcore-9.3.0:binutils/2.34-GCCcore-9.3.0:iccifort/2020.1.217:numactl/2.0.13-GCCcore-9.3.0:UCX/1.8.0-GCCcore-9.3.0:impi/2019.7.217-iccifort-2020.1.217:iimpi/2020a:imkl/2020.1.217-iimpi-2020a:intel/2020a:vsc-mympirun/4.1.9', '-genv', 'I_MPI_FALLBACK_DEVICE', '0', '-genv', 'I_MPI_FALLBACK', 'disable', '-genv', 'I_MPI_NETMASK', '10.143.0.0/255.255.0.0', '-genv', 'I_MPI_DAPL_SCALABLE_PROGRESS', '0', '-genv', 'MODULESHOME', '/usr/share/lmod/lmod', '-genv', 'MODULEPATH', '/apps/brussel/CO7/skylake/modules/2020a/all:/apps/brussel/CO7/skylake/modules/2019b/all:/apps/brussel/CO7/skylake/modules/2019a/all:/apps/brussel/CO7/skylake/modules/2018b/all:/apps/brussel/CO7/skylake/modules/2018a/all:/apps/brussel/CO7/skylake/modules/2017b/all:/apps/brussel/CO7/skylake/modules/2017a/all:/apps/brussel/CO7/skylake/modules/2016b/all:/apps/brussel/CO7/skylake/modules/2016a/all:/apps/brussel/CO7/skylake/modules/2015b/all::', '-np', '4', '-envlist', 'LD_LIBRARY_PATH,PATH,PYTHONPATH,I_MPI_MPD_TMPDIR,I_MPI_HYDRA_TOPOLIB,I_MPI_ROOT,OMP_NUM_THREADS,MKL_EXAMPLES', './mpi_intel2020a']
@smoors Can you run again with mympirun --debug and provide the full output?
mympirun --debug
Does it work fine when you use mpirun -np directly?
mpirun -np
@boegel the error is gone with the latest version 5.1.0
when I run it in multi-node, I get the following error: