open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

Insufficient slots problem when installed cuda aware openmpi using `spack` #11708

Closed tbhaxor closed 1 year ago

tbhaxor commented 1 year ago

I am on our slurm HPC platform and want to use cuda aware opempi@4.1.4, the version admins provided also works but for some problem (but not all). The code owner specifically mentioned to use openmpi@4.1.4

In each host we have max 40 cores, $SLURM_NTASKS would be 40 min. Basic problems fine with the below provided openmpi information on multiple node where $SLURM_NTASKS > 40, in my case I committed 4 nodes, so $SLURM_NTASKS = 160.

The command I want to run

mpirun -n $SLURM_NTASKS ./code
ompi_info of the admin compiled version ``` Package: Open MPI root@login04 Distribution Open MPI: 4.0.5 Open MPI repo revision: v4.0.5 Open MPI release date: Aug 26, 2020 Open RTE: 4.0.5 Open RTE repo revision: v4.0.5 Open RTE release date: Aug 26, 2020 OPAL: 4.0.5 OPAL repo revision: v4.0.5 OPAL release date: Aug 26, 2020 MPI API: 3.1.0 Ident string: 4.0.5 Prefix: /home/ext/apps/client_apps/openmpi/build Configured architecture: x86_64-unknown-linux-gnu Configure host: login04 Configured by: root Configured on: Thu May 18 12:13:36 IST 2023 Configure host: login04 Configure command line: '--prefix=/home/ext/apps/client_apps/openmpi/build' '--enable-mpirun-prefix-by-default' '--with-cuda=/opt/ohpc/pub/cuda/cuda-11.2' '--with-ucx=/home/ext/apps/client_apps/openmpi/build' '--with-ucx-libdir=/home/ext/apps/client_apps/openmpi/build/lib' '--enable-mca-no-build=btl-uct' '--with-pmix=internal' '--with-zlib=/home/ext/apps/client_apps/openmpi/build' Built by: root Built on: Thu May 18 12:24:14 IST 2023 Built host: login04 C bindings: yes C++ bindings: no Fort mpif.h: yes (all) Fort use mpi: yes (full: ignore TKR) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: yes Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /opt/ohpc/pub/compiler/gcc/8.3.0/bin/gcc C compiler family name: GNU C compiler version: 8.3.0 C++ compiler: g++ C++ compiler absolute: /opt/ohpc/pub/compiler/gcc/8.3.0/bin/g++ Fort compiler: gfortran Fort compiler abs: /opt/ohpc/pub/compiler/gcc/8.3.0/bin/gfortran Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) Fort 08 assumed shape: yes Fort optional args: yes Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: yes Fort BIND(C) (all): yes Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): yes Fort TYPE,BIND(C): yes Fort T,BIND(C,name="a"): yes Fort PRIVATE: yes Fort PROTECTED: yes Fort ABSTRACT: yes Fort ASYNCHRONOUS: yes Fort PROCEDURE: yes Fort USE...ONLY: yes Fort C_FUNLOC: yes Fort f08 using wrappers: yes Fort MPI_SIZEOF: yes C profiling: yes C++ profiling: no Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: yes C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no mpirun default --prefix: yes MPI_WTIME support: native Symbol vis. support: yes Host topology support: yes IPv6 support: no MPI1 compatibility: no MPI extensions: affinity, cuda, pcollreq FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.5) MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.0.5) MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.5) MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.5) MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.5) MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.0.5) MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.0.5) MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.0.5) MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.0.5) MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v4.0.5) MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.0.5) MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.0.5) MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.0.5) MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.0.5) ```

When I load openmpi v4.0.5 (with lmod, not spack because given by admins) it works on both single node and multinode. But when I use my spack version it doesnt find all hosts, so the max number of -np I can use is 40 even though I have allotted 160 SLURM_NTASKS.

The command I have used to run

export OMPI_MCA_btl="^openib" 
time mpirun -n $SLURM_NTASKS ./code

And the command I used to install the openmpi

spack install -v -j `nproc` openmpi@4.1.4 %gcc@8.3.0 +cuda  fabrics=ucx schedulers=auto  ^cuda@11.2.0%gcc@8.3.0 ^ucx@1.12.1%gcc@8.3.0
ompi_info from my spack installation ``` Package: Open MPI tbhaxor@login07 Distribution Open MPI: 4.1.4 Open MPI repo revision: v4.1.4 Open MPI release date: May 26, 2022 Open RTE: 4.1.4 Open RTE repo revision: v4.1.4 Open RTE release date: May 26, 2022 OPAL: 4.1.4 OPAL repo revision: v4.1.4 OPAL release date: May 26, 2022 MPI API: 3.1.0 Ident string: 4.1.4 Prefix: /scratch/tbhaxor/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.3.0/openmpi-4.1.4-zdq733otnffxgfgjpjqiimigf2yeeukp Configured architecture: x86_64-pc-linux-gnu Configure host: login07 Configured by: tbhaxor Configured on: Sun May 21 06:42:27 UTC 2023 Configure host: login07 Configure command line: '--prefix=/scratch/tbhaxor/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.3.0/openmpi-4.1.4-zdq733otnffxgfgjpjqiimigf2yeeukp' '--enable-shared' '--disable-silent-rules' '--disable-builtin-atomics' '--enable-static' '--enable-mpi1-compatibility' '--without-verbs' '--without-hcoll' '--without-knem' '--without-fca' '--without-xpmem' '--without-ofi' '--without-mxm' '--without-psm' '--without-psm2' '--without-cma' '--with-ucx=/scratch/phyprtek/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.3.0/ucx-1.12.1-verxzyfqjcglhlmr4obphrcf4hmgywwl' '--without-cray-xpmem' '--disable-memchecker' '--with-libevent=/home/ext/apps/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.3.0/libevent-2.1.12-o3rmktubqxmb2ivpw4m77wb7rig7vpbp' '--with-pmix=/scratch/phyprtek/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.3.0/pmix-4.2.3-g2w6wlloj2atbqtdzjsck5zbe5hq34mb' '--with-zlib=/home/ext/apps/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.3.0/zlib-1.2.11-kpcoxtpwrosnfihvjpijrbdzj4wkthv6' '--with-hwloc=/scratch/phyprtek/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.3.0/hwloc-2.9.1-myyazf5322iitzxnooufxw6nwk4xpmqw' '--disable-java' '--disable-mpi-java' '--with-gpfs=no' '--enable-dlopen' '--with-cuda=/scratch/phyprtek/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8.3.0/cuda-11.2.0-pfrllbd7e2hi5hxnx7vq33irwwqikjkc' '--enable-wrapper-rpath' '--disable-wrapper-runpath' '--disable-mpi-cxx' '--disable-cxx-exceptions' '--with-wrapper-ldflags=-Wl,-rpath,/home/ext/pub/compiler/gcc/8.3.0/lib/gcc/x86_64-pc-linux-gnu/8.3.0 -Wl,-rpath,/home/ext/pub/compiler/gcc/8.3.0/lib64' Built by: tbhaxor Built on: Sun May 21 06:51:48 UTC 2023 Built host: login07 C bindings: yes C++ bindings: no Fort mpif.h: yes (all) Fort use mpi: yes (full: ignore TKR) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: yes Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the /scratch/tbhaxor/spack/lib/spack/env/gcc/gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: rpath C compiler: /scratch/tbhaxor/spack/lib/spack/env/gcc/gcc C compiler absolute: C compiler family name: GNU C compiler version: 8.3.0 C++ compiler: /scratch/tbhaxor/spack/lib/spack/env/gcc/g++ C++ compiler absolute: none Fort compiler: /scratch/tbhaxor/spack/lib/spack/env/gcc/gfortran Fort compiler abs: Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) Fort 08 assumed shape: yes Fort optional args: yes Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: yes Fort BIND(C) (all): yes Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): yes Fort TYPE,BIND(C): yes Fort T,BIND(C,name="a"): yes Fort PRIVATE: yes Fort PROTECTED: yes Fort ABSTRACT: yes Fort ASYNCHRONOUS: yes Fort PROCEDURE: yes Fort USE...ONLY: yes Fort C_FUNLOC: yes Fort f08 using wrappers: yes Fort MPI_SIZEOF: yes C profiling: yes C++ profiling: no Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: yes C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no mpirun default --prefix: no MPI_WTIME support: native Symbol vis. support: yes Host topology support: yes IPv6 support: no MPI1 compatibility: yes MPI extensions: affinity, cuda, pcollreq FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.4) MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.4) MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.4) MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.4) MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA event: external (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.4) MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4) MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4) MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.4) MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.4) MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4) MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.4) MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.1.4) MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.1.4) ```
Error message from mpirun ``` There are not enough slots available in the system to satisfy the 160 slots that were requested by the application: ./code Either request fewer slots for your application, or make more slots available for use. A "slot" is the Open MPI term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which Open MPI processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor cores In all the above cases, if you want Open MPI to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --oversubscribe option to ignore the number of available slots when deciding the number of processes to launch. ```
SLURM job runtime information ``` ========================================== SLURM_CLUSTER_NAME = our_slurm SLURM_ARRAY_JOB_ID = SLURM_ARRAY_TASK_ID = SLURM_ARRAY_TASK_COUNT = SLURM_ARRAY_TASK_MAX = SLURM_ARRAY_TASK_MIN = SLURM_JOB_ACCOUNT = our_slurm SLURM_JOB_ID = 180158 SLURM_JOB_NAME = code_multinode SLURM_JOB_NODELIST = gpu[006-007,015,034] SLURM_JOB_USER = tbhaxor SLURM_JOB_UID = 15220 SLURM_JOB_PARTITION = gpumultinode SLURM_TASK_PID = 28226 SLURM_SUBMIT_DIR = /scratch/tbhaxor/code SLURM_CPUS_ON_NODE = 40 SLURM_NTASKS = 160 SLURM_TASK_PID = 28226 ========================================== ```

Since it is not detecting other hosts in the, I can not use --oversubscribe as it floods the memory on single node and everything fails

tbhaxor commented 1 year ago

This is fixed by using custom configuration, still spack installation doesnt work the best