openpmix / prrte

PMIx Reference RunTime Environment (PRRTE)
https://pmix.org
Other
35 stars 66 forks source link

PMI application runs with Slurm pmi2, not with prrte #1635

Closed anderbubble closed 1 year ago

anderbubble commented 1 year ago

Background information

What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)

$ rpm -q prrte
prrte-3.0.0-1.el8.x86_64
What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)
$ rpm -q pmix
pmix-4.2.2-1.el8.x86_64

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I am building OSU Micro-Benchmarks with spack containerize using the following environment:

spack:
  specs:
  - osu-micro-benchmarks
  - openmpi fabrics=ofi +pmi
  - libfabric fabrics=sockets,tcp,udp,psm2,verbs

  container:
    format: singularity

    images:
      os: centos:stream
      spack: v0.19.0

    strip: true

    os_packages:
      final:
        - libgfortran

    labels:
      app: "osu-micro-benchmarks"
      mpi: "openmpi"

This is built like apptainer build --fakeroot osu-micro-benchmarks.sif <(spack containerize).

I can run this with Slurm.

$ srun --mpi=pmi2 --ntasks 2 --ntasks-per-node 1 --partition opa ./osu-micro-benchmarks.sif osu_init
# OSU MPI Init Test v7.0
nprocs: 2, min: 293 ms, max: 295 ms, avg: 294 ms

But I get an error when I attempt to run it with prrte.

$ prterun -n 2 --map-by=ppr:1:node --hostfile ~/janderson/workflows/util/prrte/hostfile.txt ./osu-micro-benchmarks.sif osu_init
--------------------------------------------------------------------------
Open MPI's OFI driver detected multiple equidistant NICs from the current process,
but had insufficient information to ensure MPI processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
1851085648 or higher.
--------------------------------------------------------------------------
c5.190935map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b0000) size 262144 failed: Resource temporarily unavailable
c5.190935osu_init: An unrecoverable error occurred while communicating with the driver
[c5:190935] *** Process received signal ***
[c5:190935] Signal: Aborted (6)
[c5:190935] Signal code:  (-6)
[c5:190935] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f8c6ec62cf0]
[c5:190935] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f8c6e8d9acf]
[c5:190935] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f8c6e8acea5]
[c5:190935] [ 3] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x47804)[0x7f8c6c5af804]
[c5:190935] [ 4] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xde3e)[0x7f8c6c575e3e]
[c5:190935] [ 5] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xecdb)[0x7f8c6c576cdb]
[c5:190935] [ 6] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x11353)[0x7f8c6c579353]
[c5:190935] [ 7] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(psm2_ep_open+0x209)[0x7f8c6c57aa49]
[c5:190935] [ 8] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0x9cb14)[0x7f8c6dfdfb14]
[c5:190935] [ 9] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0xa62be)[0x7f8c6dfe92be]
[c5:190935] [10] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(+0x8cd2d)[0x7f8c6e2d0d2d]
[c5:190935] [11] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(mca_btl_base_select+0xe3)[0x7f8c6e2c0b83]
[c5:190935] [12] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x7f8c6ef47f42]
[c5:190935] [13] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7f8c6ef46084]
[c5:190935] [14] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(ompi_mpi_init+0x64c)[0x7f8c6f1105cc]
[c5:190935] [15] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f8c6ef1fa4e]
[c5:190935] [16] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x4015be]
[c5:190935] [17] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f8c6e8c5d85]
[c5:190935] [18] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x40176e]
[c5:190935] *** End of error message ***
--------------------------------------------------------------------------
Open MPI's OFI driver detected multiple equidistant NICs from the current process,
but had insufficient information to ensure MPI processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
-1891646640 or higher.
--------------------------------------------------------------------------
c6.191679map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b0000) size 262144 failed: Resource temporarily unavailable
c6.191679osu_init: An unrecoverable error occurred while communicating with the driver
[c6:191679] *** Process received signal ***
[c6:191679] Signal: Aborted (6)
[c6:191679] Signal code:  (-6)
[c6:191679] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f518fb09cf0]
[c6:191679] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f518f780acf]
[c6:191679] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f518f753ea5]
[c6:191679] [ 3] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x47804)[0x7f518d456804]
[c6:191679] [ 4] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xde3e)[0x7f518d41ce3e]
[c6:191679] [ 5] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xecdb)[0x7f518d41dcdb]
[c6:191679] [ 6] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x11353)[0x7f518d420353]
[c6:191679] [ 7] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(psm2_ep_open+0x209)[0x7f518d421a49]
[c6:191679] [ 8] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0x9cb14)[0x7f518ee86b14]
[c6:191679] [ 9] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0xa62be)[0x7f518ee902be]
[c6:191679] [10] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(+0x8cd2d)[0x7f518f177d2d]
[c6:191679] [11] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(mca_btl_base_select+0xe3)[0x7f518f167b83]
[c6:191679] [12] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x7f518fdeef42]
[c6:191679] [13] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7f518fded084]
[c6:191679] [14] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(ompi_mpi_init+0x64c)[0x7f518ffb75cc]
[c6:191679] [15] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f518fdc6a4e]
[c6:191679] [16] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x4015be]
[c6:191679] [17] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f518f76cd85]
[c6:191679] [18] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x40176e]
[c6:191679] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 0 on node c5 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
anderbubble commented 1 year ago

@rhc54 can you advise? Am I missing something obvious here?

rhc54 commented 1 year ago

I have no earthly idea and am unfamiliar with the OFI code in OMPI - probably best to ask them over there.

anderbubble commented 1 year ago

For the sake of my demo, what MPI stack is most tested with prrte that I could use for now?

rhc54 commented 1 year ago

OMPI main seems to be working pretty well, but I don't know what transports are employed in their nightly regression testing. Note that the issue flagged was unlikely due to anything in PRRTE but probably a bug in the OMPI OFI code integration.

hppritcha commented 1 year ago

I'd suggest this experiment. Do the run using srun, but exporting the following environment variable to the container: export OMPI_MCA_mtl_base_verbose=100 and likewise for running with prterun.

The warning from OpenMPI is showing that its trying to use the OFI MTL/BTL. But that in and of itself is not the direct cause of the failure.

Also, could you check your spack install to see which pmix was built into the container?

And one other thing, could you report the output from

srun --mpi=list

?

anderbubble commented 1 year ago

Thanks for the suggestions, @hppritcha!

Do the run using srun, but exporting the following environment variable to the container: export OMPI_MCA_mtl_base_verbose=100

$ env OMPI_MCA_mtl_base_verbose=100 srun --mpi=pmi2 --ntasks=2 --ntasks-per-node=1 --partition=opa ./osu-micro-benchmarks.sif osu_init
[c5:289002] mca: base: components_register: registering framework mtl components
[c5:289002] mca: base: components_register: found loaded component ofi
[c5:289002] mca: base: components_register: component ofi register function successful
[c5:289002] mca: base: components_open: opening mtl components
[c5:289002] mca: base: components_open: found loaded component ofi
[c5:289002] mca: base: components_open: component ofi open function successful
[c6:195799] mca: base: components_register: registering framework mtl components
[c6:195799] mca: base: components_register: found loaded component ofi
[c6:195799] mca: base: components_register: component ofi register function successful
[c6:195799] mca: base: components_open: opening mtl components
[c6:195799] mca: base: components_open: found loaded component ofi
[c6:195799] mca: base: components_open: component ofi open function successful
[c5:289002] mca:base:select: Auto-selecting mtl components
[c5:289002] mca:base:select:(  mtl) Querying component [ofi]
[c5:289002] mca:base:select:(  mtl) Query of component [ofi] set priority to 25
[c5:289002] mca:base:select:(  mtl) Selected component [ofi]
[c5:289002] select: initializing mtl component ofi
[c5:289002] mtl_ofi_component.c:365: mtl:ofi:provider: hfi1_0
[c6:195799] mca:base:select: Auto-selecting mtl components
[c6:195799] mca:base:select:(  mtl) Querying component [ofi]
[c6:195799] mca:base:select:(  mtl) Query of component [ofi] set priority to 25
[c6:195799] mca:base:select:(  mtl) Selected component [ofi]
[c6:195799] select: initializing mtl component ofi
[c6:195799] mtl_ofi_component.c:365: mtl:ofi:provider: hfi1_0
[c5:289002] select: init returned success
[c5:289002] select: component ofi selected
[c6:195799] select: init returned success
[c6:195799] select: component ofi selected
# OSU MPI Init Test v7.0
nprocs: 2, min: 288 ms, max: 292 ms, avg: 290 ms
[c6:195799] mca: base: close: component ofi closed
[c6:195799] mca: base: close: unloading component ofi
[c5:289002] mca: base: close: component ofi closed
[c5:289002] mca: base: close: unloading component ofi

and likewise for running with prterun.

$ prterun -x OMPI_MCA_mtl_base_verbose=100 -n 2 --map-by=ppr:1:node --hostfile ~/janderson/workflows/util/prrte/hostfile.txt ./osu-micro-benchmarks.sif osu_init
--------------------------------------------------------------------------
Open MPI's OFI driver detected multiple equidistant NICs from the current process,
but had insufficient information to ensure MPI processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
264684368 or higher.
--------------------------------------------------------------------------
c5.289266map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b0000) size 262144 failed: Resource temporarily unavailable
c5.289266osu_init: An unrecoverable error occurred while communicating with the driver
[c5:289266] *** Process received signal ***
[c5:289266] Signal: Aborted (6)
[c5:289266] Signal code:  (-6)
[c5:289266] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7ff310379cf0]
[c5:289266] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7ff30fff0acf]
[c5:289266] [ 2] /lib64/libc.so.6(abort+0x127)[0x7ff30ffc3ea5]
[c5:289266] [ 3] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x47804)[0x7ff30dcc6804]
[c5:289266] [ 4] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xde3e)[0x7ff30dc8ce3e]
[c5:289266] [ 5] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xecdb)[0x7ff30dc8dcdb]
[c5:289266] [ 6] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x11353)[0x7ff30dc90353]
[c5:289266] [ 7] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(psm2_ep_open+0x209)[0x7ff30dc91a49]
[c5:289266] [ 8] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0x9cb14)[0x7ff30f6f6b14]
[c5:289266] [ 9] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0xa62be)[0x7ff30f7002be]
[c5:289266] [10] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(+0x8cd2d)[0x7ff30f9e7d2d]
[c5:289266] [11] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(mca_btl_base_select+0xe3)[0x7ff30f9d7b83]
[c5:289266] [12] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x7ff31065ef42]
[c5:289266] [13] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7ff31065d084]
[c5:289266] [14] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(ompi_mpi_init+0x64c)[0x7ff3108275cc]
[c5:289266] [15] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(MPI_Init+0x5e)[0x7ff310636a4e]
[c5:289266] [16] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x4015be]
[c5:289266] [17] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7ff30ffdcd85]
[c5:289266] [18] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x40176e]
[c5:289266] *** End of error message ***
--------------------------------------------------------------------------
Open MPI's OFI driver detected multiple equidistant NICs from the current process,
but had insufficient information to ensure MPI processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
392926032 or higher.
--------------------------------------------------------------------------
c6.196069map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b0000) size 262144 failed: Resource temporarily unavailable
c6.196069osu_init: An unrecoverable error occurred while communicating with the driver
[c6:196069] *** Process received signal ***
[c6:196069] Signal: Aborted (6)
[c6:196069] Signal code:  (-6)
[c6:196069] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7fa817dc6cf0]
[c6:196069] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fa817a3dacf]
[c6:196069] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fa817a10ea5]
[c6:196069] [ 3] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x47804)[0x7fa815713804]
[c6:196069] [ 4] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xde3e)[0x7fa8156d9e3e]
[c6:196069] [ 5] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xecdb)[0x7fa8156dacdb]
[c6:196069] [ 6] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x11353)[0x7fa8156dd353]
[c6:196069] [ 7] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(psm2_ep_open+0x209)[0x7fa8156dea49]
[c6:196069] [ 8] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0x9cb14)[0x7fa817143b14]
[c6:196069] [ 9] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0xa62be)[0x7fa81714d2be]
[c6:196069] [10] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(+0x8cd2d)[0x7fa817434d2d]
[c6:196069] [11] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(mca_btl_base_select+0xe3)[0x7fa817424b83]
[c6:196069] [12] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x7fa8180abf42]
[c6:196069] [13] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7fa8180aa084]
[c6:196069] [14] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(ompi_mpi_init+0x64c)[0x7fa8182745cc]
[c6:196069] [15] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fa818083a4e]
[c6:196069] [16] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x4015be]
[c6:196069] [17] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7fa817a29d85]
[c6:196069] [18] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x40176e]
[c6:196069] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 0 on node c5 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

(it doesn't appear to have produced additional output when exported via prterun)

could you check your spack install to see which pmix was built into the container

This should be the full spec that it's using (looks like pmix@4.1.2):

$ spack spec osu-micro-benchmarks ^openmpi fabrics=ofi +pmi ^libfabric fabrics=sockets,tcp,udp,psm2,verbs
Input spec
--------------------------------
osu-micro-benchmarks
    ^libfabric fabrics=psm2,sockets,tcp,udp,verbs
    ^openmpi+pmi fabrics=ofi

Concretized
--------------------------------
osu-micro-benchmarks@7.0%gcc@8.5.0~cuda~rocm build_system=autotools arch=linux-rocky8-zen
    ^openmpi@4.1.4%gcc@8.5.0~atomics~cuda~cxx~cxx_exceptions~gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker+pmi+romio+rsh~singularity+static+vt+wrapper-rpath build_system=autotools fabrics=ofi schedulers=slurm arch=linux-rocky8-zen
        ^hwloc@2.8.0%gcc@8.5.0~cairo~cuda~gl~libudev+libxml2~netloc~nvml~oneapi-level-zero~opencl+pci~rocm build_system=autotools libs=shared,static arch=linux-rocky8-zen
            ^libpciaccess@0.16%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                ^util-macros@1.19.3%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^libxml2@2.10.3%gcc@8.5.0~python build_system=autotools arch=linux-rocky8-zen
                ^libiconv@1.16%gcc@8.5.0 build_system=autotools libs=shared,static arch=linux-rocky8-zen
                ^xz@5.2.7%gcc@8.5.0~pic build_system=autotools libs=shared,static arch=linux-rocky8-zen
            ^ncurses@6.3%gcc@8.5.0~symlinks+termlib abi=none build_system=autotools arch=linux-rocky8-zen
        ^libfabric@1.16.1%gcc@8.5.0~debug~kdreg build_system=autotools fabrics=psm2,sockets,tcp,udp,verbs arch=linux-rocky8-zen
            ^opa-psm2@11.2.230%gcc@8.5.0+avx2 build_system=makefile arch=linux-rocky8-zen
            ^rdma-core@41.0%gcc@8.5.0~ipo build_system=cmake build_type=RelWithDebInfo arch=linux-rocky8-zen
                ^cmake@3.25.0%gcc@8.5.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-rocky8-zen
                ^libnl@3.3.0%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                    ^flex@2.6.3%gcc@8.5.0+lex~nls build_system=autotools arch=linux-rocky8-zen
                        ^findutils@4.9.0%gcc@8.5.0 build_system=autotools patches=440b954 arch=linux-rocky8-zen
                ^py-docutils@0.19%gcc@8.5.0 build_system=python_pip arch=linux-rocky8-zen
                    ^py-pip@22.2.2%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
                    ^py-setuptools@65.5.0%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
                    ^py-wheel@0.37.1%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
        ^numactl@2.0.14%gcc@8.5.0 build_system=autotools patches=4e1d78c,62fc8a8,ff37630 arch=linux-rocky8-zen
            ^autoconf@2.69%gcc@8.5.0 build_system=autotools patches=35c4492,7793209,a49dd5b arch=linux-rocky8-zen
            ^automake@1.16.5%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^libtool@2.4.7%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^m4@1.4.19%gcc@8.5.0+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-rocky8-zen
                ^diffutils@3.8%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                ^libsigsegv@2.13%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
        ^openssh@9.1p1%gcc@8.5.0+gssapi build_system=autotools arch=linux-rocky8-zen
            ^krb5@1.20.1%gcc@8.5.0+shared build_system=autotools arch=linux-rocky8-zen
                ^bison@3.8.2%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                ^gettext@0.21.1%gcc@8.5.0+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools arch=linux-rocky8-zen
                    ^tar@1.34%gcc@8.5.0 build_system=autotools zip=pigz arch=linux-rocky8-zen
                        ^pigz@2.7%gcc@8.5.0 build_system=makefile arch=linux-rocky8-zen
                        ^zstd@1.5.2%gcc@8.5.0+programs build_system=makefile compression=none libs=shared,static arch=linux-rocky8-zen
            ^libedit@3.1-20210216%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^libxcrypt@4.4.33%gcc@8.5.0~obsolete_api build_system=autotools arch=linux-rocky8-zen
            ^openssl@1.1.1s%gcc@8.5.0~docs~shared build_system=generic certs=mozilla arch=linux-rocky8-zen
                ^ca-certificates-mozilla@2022-10-11%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
        ^perl@5.36.0%gcc@8.5.0+cpanm+shared+threads build_system=generic arch=linux-rocky8-zen
            ^berkeley-db@18.1.40%gcc@8.5.0+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc arch=linux-rocky8-zen
            ^bzip2@1.0.8%gcc@8.5.0~debug~pic+shared build_system=generic arch=linux-rocky8-zen
            ^gdbm@1.23%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
        ^pkgconf@1.8.0%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
        ^pmix@4.1.2%gcc@8.5.0~docs+pmi_backwards_compatibility~python~restful build_system=autotools arch=linux-rocky8-zen
            ^libevent@2.1.12%gcc@8.5.0+openssl build_system=autotools arch=linux-rocky8-zen
        ^slurm@21-08-8-2%gcc@8.5.0~gtk~hdf5~hwloc~mariadb~pmix+readline~restd build_system=autotools sysconfdir=PREFIX/etc arch=linux-rocky8-zen
            ^curl@7.85.0%gcc@8.5.0~gssapi~ldap~libidn2~librtmp~libssh~libssh2~nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-rocky8-zen
            ^glib@2.74.1%gcc@8.5.0~libmount build_system=generic tracing=none arch=linux-rocky8-zen
                ^elfutils@0.188%gcc@8.5.0~bzip2~debuginfod+nls~xz~zstd build_system=autotools arch=linux-rocky8-zen
                ^libffi@3.4.2%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                ^meson@0.64.0%gcc@8.5.0 build_system=python_pip patches=0f0b1bd arch=linux-rocky8-zen
                ^ninja@1.11.1%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
                    ^re2c@2.2%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
                ^pcre2@10.39%gcc@8.5.0~jit+multibyte build_system=autotools arch=linux-rocky8-zen
                ^python@3.10.8%gcc@8.5.0+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic patches=0d98e93,7d40923,f2fd060 arch=linux-rocky8-zen
                    ^expat@2.4.8%gcc@8.5.0+libbsd build_system=autotools arch=linux-rocky8-zen
                        ^libbsd@0.11.5%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                            ^libmd@1.0.4%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                    ^sqlite@3.40.0%gcc@8.5.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-rocky8-zen
                    ^util-linux-uuid@2.38.1%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^json-c@0.16%gcc@8.5.0~ipo build_system=cmake build_type=RelWithDebInfo arch=linux-rocky8-zen
            ^lz4@1.9.4%gcc@8.5.0 build_system=makefile libs=shared,static arch=linux-rocky8-zen
            ^munge@0.5.15%gcc@8.5.0 build_system=autotools localstatedir=PREFIX/var arch=linux-rocky8-zen
                ^libgcrypt@1.10.1%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                    ^libgpg-error@1.46%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                        ^gawk@5.1.1%gcc@8.5.0~nls build_system=autotools arch=linux-rocky8-zen
                            ^gmp@6.2.1%gcc@8.5.0 build_system=autotools libs=shared,static arch=linux-rocky8-zen
                            ^mpfr@4.1.0%gcc@8.5.0 build_system=autotools libs=shared,static arch=linux-rocky8-zen
                                ^autoconf-archive@2022.02.11%gcc@8.5.0 build_system=autotools patches=139214f arch=linux-rocky8-zen
                                ^texinfo@7.0%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^readline@8.1.2%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
        ^zlib@1.2.13%gcc@8.5.0+optimize+pic+shared build_system=makefile arch=linux-rocky8-zen

srun --mpi=list

$ srun --mpi=list
MPI plugin types are...
    cray_shasta
    none
    pmi2
hppritcha commented 1 year ago

Hmm.. this is getting somewhat bizarre. Seems to me that somehow the HFI device isn't being initialized properly when running under prterun.

Could you try turning on FI level debug?

export FI_LOG_LEVEL=debug

or if that doesn't say much

export FI_LOG_LEVEL=info

and then run with srun and prterun again?

Is this some HPE EX system using HFI/OPX rather than slingshot?

anderbubble commented 1 year ago

Is this some HPE EX system using HFI/OPX rather than slingshot?

This is just a bog-standard 4-node (+FM) OPA cluster.

[ciq@admin1 omb-openmpi-ofi]$ env FI_LOG_LEVEL=debug srun --mpi=pmi2 --ntasks=2 --ntasks-per-node=1 --partition=opa ./osu-micro-benchmarks.sif osu_init
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable perf_cntr=<not set>
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable hook=<not set>
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ZE not supported
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable hmem_disable_p2p=<not set>
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable mr_cache_max_size=<not set>
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable mr_cache_max_count=<not set>
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable mr_cache_monitor=<not set>
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:289469:1673478625::core:mr:ofi_default_cache_size():78<info> default cache size=1054213584
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable provider=<not set>
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable universe_size=<not set>
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable provider_path=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable perf_cntr=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable hook=<not set>
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ZE not supported
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable hmem_disable_p2p=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable mr_cache_max_size=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable mr_cache_max_count=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable mr_cache_monitor=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:196260:1673478625::core:mr:ofi_default_cache_size():78<info> default cache size=1054213600
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable provider=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable universe_size=<not set>
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable provider_path=<not set>
libfabric:289469:1673478625::psm2:core:fi_psm2_ini():691<info> build options: HAVE_PSM2_SRC=0, HAVE_PSM2_AM_REGISTER_HANDLERS_2=1, HAVE_PSM2_MQ_FP_MSG=0, PSMX2_USE_REQ_CONTEXT=0
libfabric:289469:1673478625::psm2:core:psmx2_init_env():88<info> Open MPI job key: 000001020000001a-0000001a0000001a.
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable name_server=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable tagged_rma=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable uuid=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable delay=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable timeout=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable conn_timeout=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable prog_interval=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable prog_affinity=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable inject_size=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable lazy_conn=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable disconnect=<not set>
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable tag_layout=<not set>
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: psm2 (116.10)
libfabric:196260:1673478625::psm2:core:fi_psm2_ini():691<info> build options: HAVE_PSM2_SRC=0, HAVE_PSM2_AM_REGISTER_HANDLERS_2=1, HAVE_PSM2_MQ_FP_MSG=0, PSMX2_USE_REQ_CONTEXT=0
libfabric:196260:1673478625::psm2:core:psmx2_init_env():88<info> Open MPI job key: 000001020000001a-0000001a0000001a.
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable name_server=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable tagged_rma=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable uuid=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable delay=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable timeout=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable conn_timeout=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable prog_interval=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable prog_affinity=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable inject_size=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable lazy_conn=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable disconnect=<not set>
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable tag_layout=<not set>
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: psm2 (116.10)
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable tx_iov_limit=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable rx_iov_limit=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable inline_size=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable min_rnr_timer=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable use_odp=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable prefer_xrc=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable xrcd_filename=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable cqread_bunch_size=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable gid_idx=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable device_name=<not set>
libfabric:289469:1673478625::verbs:core:vrb_read_params():717<info> dmabuf support is disabled
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable dgram_use_name_server=<not set>
libfabric:289469:1673478625::verbs:core:fi_param_get_():279<info> variable dgram_name_server_port=<not set>
libfabric:289469:1673478625::verbs:fabric:verbs_devs_print():888<info> list of verbs devices found for FI_EP_MSG:
libfabric:289469:1673478625::verbs:fabric:verbs_devs_print():892<info> #1 hfi1_0 - IPoIB addresses:
libfabric:289469:1673478625::verbs:fabric:verbs_devs_print():902<info>  10.10.12.41
libfabric:289469:1673478625::verbs:fabric:verbs_devs_print():902<info>  fe80::211:7501:178:d9a5
libfabric:289469:1673478625::verbs:fabric:vrb_get_device_attrs():619<info> device hfi1_0: first found active port is 1
libfabric:289469:1673478625::verbs:fabric:vrb_get_device_attrs():565<info> XRC support unavailable in device: hfi1_0
libfabric:289469:1673478625::verbs:fabric:vrb_get_device_attrs():619<info> device hfi1_0: first found active port is 1
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: verbs (116.10)
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: udp (116.10)
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: sockets (116.10)
libfabric:289469:1673478625::tcp:core:fi_param_get_():279<info> variable port_high_range=<not set>
libfabric:289469:1673478625::tcp:core:fi_param_get_():279<info> variable port_low_range=<not set>
libfabric:289469:1673478625::tcp:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:289469:1673478625::tcp:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:289469:1673478625::tcp:core:fi_param_get_():279<info> variable nodelay=<not set>
libfabric:289469:1673478625::tcp:core:fi_param_get_():279<info> variable staging_sbuf_size=<not set>
libfabric:289469:1673478625::tcp:core:fi_param_get_():279<info> variable prefetch_rbuf_size=<not set>
libfabric:289469:1673478625::tcp:core:fi_param_get_():279<info> variable zerocopy_size=<not set>
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: tcp (116.10)
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable prov_name=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable port_high_range=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable port_low_range=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable nodelay=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable staging_sbuf_size=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable prefetch_rbuf_size=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable zerocopy_size=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable poll_fairness=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable poll_cooldown=<not set>
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable disable_auto_progress=<not set>
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: net (116.10)
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_perf (116.10)
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_debug (116.10)
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ZE not supported
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:289469:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:289469:1673478625::core:core:fi_param_get_():279<info> variable hmem_disable_p2p=<not set>
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_hmem (116.10)
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_dmabuf_peer_mem (116.10)
libfabric:289469:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_noop (116.10)
libfabric:289469:1673478625::psm2:core:psmx2_getinfo():523<info> 
libfabric:289469:1673478625::psm2:core:psmx2_init_prov_info():254<info> RMA only instance included
libfabric:289469:1673478625::psm2:core:psmx2_init_prov_info():268<info> TAG60 instance included
libfabric:289469:1673478625::psm2:core:psmx2_init_prov_info():281<info> TAG64 instance included
libfabric:289469:1673478625::psm2:core:psmx2_init_lib():257<info> PSM2 header version = (2, 2)
libfabric:289469:1673478625::psm2:core:psmx2_init_lib():259<info> PSM2 library version = (2, 2)
libfabric:289469:1673478625::psm2:core:psmx2_init_lib():262<info> PSM2 multi-ep feature enabled.
libfabric:289469:1673478625::psm2:core:psmx2_update_hfi_info():427<info> hfi1 units: total 1, active 1; hfi1 contexts: total 128, free 128
libfabric:289469:1673478625::psm2:core:psmx2_update_hfi_info():439<info> Tx/Rx contexts: 128 in total, 128 available.
libfabric:289469:1673478625::psm2:core:psmx2_alter_prov_info():449<info> 3 instances available, 2 with CQ data flag set
libfabric:289469:1673478625::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #1 hfi1_0
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #2 hfi1_0-dgram
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:289469:1673478625::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:289469:1673478625::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::sockets:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:289469:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.41, iface name: eth0, speed: 25000
libfabric:289469:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.41, iface name: ib0, speed: 100000
libfabric:289469:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:fed7:3bf0, iface name: eth0, speed: 25000
libfabric:289469:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:d9a5, iface name: ib0, speed: 100000
libfabric:289469:1673478625::sockets:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:289469:1673478625::sockets:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:289469:1673478625::sockets:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.41, speed 100000
libfabric:289469:1673478625::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:289469:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.41, iface name: eth0, speed: 25000
libfabric:289469:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.41, iface name: ib0, speed: 100000
libfabric:289469:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:fed7:3bf0, iface name: eth0, speed: 25000
libfabric:289469:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:d9a5, iface name: ib0, speed: 100000
libfabric:289469:1673478625::net:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:289469:1673478625::net:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:289469:1673478625::net:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.41, speed 100000
libfabric:289469:1673478625::core:mr:ofi_monitor_import():845<info> setting imported memory monitor as default
libfabric:289469:1673478625::psm2:core:psmx2_fabric():90<info> 
libfabric:289469:1673478625::core:core:fi_fabric_():1341<info> Opened fabric: psm2
libfabric:289469:1673478625::psm2:domain:psmx2_domain_open():356<info> 
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:289469:1673478625::psm2:core:psmx2_init_tag_layout():171<info> use tag64: tag_mask: FFFFFFFFFFFFFFFF, data_mask: 0FFFFFFF
libfabric:289469:1673478625::psm2:av:psmx2_av_open():1060<info> FI_AV_MAP asked, but force FI_AV_TABLE for multi-EP support
libfabric:289469:1673478625::psm2:core:psmx2_trx_ctxt_alloc():298<info> uuid: 1A000000-0201-0000-1A00-00001A000000
libfabric:289469:1673478625::psm2:core:psmx2_trx_ctxt_alloc():303<info> ep_open_opts: unit=0 port=0
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable tx_iov_limit=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable rx_iov_limit=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable inline_size=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable min_rnr_timer=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable use_odp=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable prefer_xrc=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable xrcd_filename=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable cqread_bunch_size=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable gid_idx=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable device_name=<not set>
libfabric:196260:1673478625::verbs:core:vrb_read_params():717<info> dmabuf support is disabled
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable dgram_use_name_server=<not set>
libfabric:196260:1673478625::verbs:core:fi_param_get_():279<info> variable dgram_name_server_port=<not set>
libfabric:289469:1673478625::psm2:core:psmx2_trx_ctxt_alloc():333<info> epid: 0000000000060b02 (tx+rx)
libfabric:289469:1673478625::psm2:core:psmx2_am_init():116<info> epid 0000000000060b02
libfabric:289469:1673478625::core:core:ofi_ns_add_local_name():372<warn> Cannot add local name - name server uninitialized
libfabric:289469:1673478625::psm2:core:psmx2_am_init():116<info> epid 0000000000060b02
libfabric:196260:1673478625::verbs:fabric:verbs_devs_print():888<info> list of verbs devices found for FI_EP_MSG:
libfabric:196260:1673478625::verbs:fabric:verbs_devs_print():892<info> #1 hfi1_0 - IPoIB addresses:
libfabric:196260:1673478625::verbs:fabric:verbs_devs_print():902<info>  10.10.12.42
libfabric:196260:1673478625::verbs:fabric:verbs_devs_print():902<info>  fe80::211:7501:178:1c8b
libfabric:196260:1673478625::verbs:fabric:vrb_get_device_attrs():619<info> device hfi1_0: first found active port is 1
libfabric:196260:1673478625::verbs:fabric:vrb_get_device_attrs():565<info> XRC support unavailable in device: hfi1_0
libfabric:196260:1673478625::verbs:fabric:vrb_get_device_attrs():619<info> device hfi1_0: first found active port is 1
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: verbs (116.10)
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: udp (116.10)
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: sockets (116.10)
libfabric:196260:1673478625::tcp:core:fi_param_get_():279<info> variable port_high_range=<not set>
libfabric:196260:1673478625::tcp:core:fi_param_get_():279<info> variable port_low_range=<not set>
libfabric:196260:1673478625::tcp:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:196260:1673478625::tcp:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:196260:1673478625::tcp:core:fi_param_get_():279<info> variable nodelay=<not set>
libfabric:196260:1673478625::tcp:core:fi_param_get_():279<info> variable staging_sbuf_size=<not set>
libfabric:196260:1673478625::tcp:core:fi_param_get_():279<info> variable prefetch_rbuf_size=<not set>
libfabric:196260:1673478625::tcp:core:fi_param_get_():279<info> variable zerocopy_size=<not set>
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: tcp (116.10)
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable prov_name=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable port_high_range=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable port_low_range=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable nodelay=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable staging_sbuf_size=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable prefetch_rbuf_size=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable zerocopy_size=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable poll_fairness=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable poll_cooldown=<not set>
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable disable_auto_progress=<not set>
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: net (116.10)
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_perf (116.10)
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_debug (116.10)
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ZE not supported
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:196260:1673478625::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:196260:1673478625::core:core:fi_param_get_():279<info> variable hmem_disable_p2p=<not set>
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_hmem (116.10)
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_dmabuf_peer_mem (116.10)
libfabric:196260:1673478625::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_noop (116.10)
libfabric:196260:1673478625::psm2:core:psmx2_getinfo():523<info> 
libfabric:196260:1673478625::psm2:core:psmx2_init_prov_info():254<info> RMA only instance included
libfabric:196260:1673478625::psm2:core:psmx2_init_prov_info():268<info> TAG60 instance included
libfabric:196260:1673478625::psm2:core:psmx2_init_prov_info():281<info> TAG64 instance included
libfabric:196260:1673478625::psm2:core:psmx2_init_lib():257<info> PSM2 header version = (2, 2)
libfabric:196260:1673478625::psm2:core:psmx2_init_lib():259<info> PSM2 library version = (2, 2)
libfabric:196260:1673478625::psm2:core:psmx2_init_lib():262<info> PSM2 multi-ep feature enabled.
libfabric:196260:1673478625::psm2:core:psmx2_update_hfi_info():427<info> hfi1 units: total 1, active 1; hfi1 contexts: total 128, free 128
libfabric:196260:1673478625::psm2:core:psmx2_update_hfi_info():439<info> Tx/Rx contexts: 128 in total, 128 available.
libfabric:196260:1673478625::psm2:core:psmx2_alter_prov_info():449<info> 3 instances available, 2 with CQ data flag set
libfabric:196260:1673478625::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #1 hfi1_0
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #2 hfi1_0-dgram
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:196260:1673478625::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:196260:1673478625::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::sockets:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:196260:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.42, iface name: eth0, speed: 25000
libfabric:196260:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.42, iface name: ib0, speed: 100000
libfabric:196260:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:feb2:f0da, iface name: eth0, speed: 25000
libfabric:196260:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:1c8b, iface name: ib0, speed: 100000
libfabric:196260:1673478625::sockets:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:196260:1673478625::sockets:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:196260:1673478625::sockets:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.42, speed 100000
libfabric:196260:1673478625::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:196260:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.42, iface name: eth0, speed: 25000
libfabric:196260:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.42, iface name: ib0, speed: 100000
libfabric:196260:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:feb2:f0da, iface name: eth0, speed: 25000
libfabric:196260:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:1c8b, iface name: ib0, speed: 100000
libfabric:196260:1673478625::net:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:196260:1673478625::net:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:196260:1673478625::net:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.42, speed 100000
libfabric:196260:1673478625::core:mr:ofi_monitor_import():845<info> setting imported memory monitor as default
libfabric:196260:1673478625::psm2:core:psmx2_fabric():90<info> 
libfabric:196260:1673478625::core:core:fi_fabric_():1341<info> Opened fabric: psm2
libfabric:196260:1673478625::psm2:domain:psmx2_domain_open():356<info> 
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:196260:1673478625::psm2:core:psmx2_init_tag_layout():171<info> use tag64: tag_mask: FFFFFFFFFFFFFFFF, data_mask: 0FFFFFFF
libfabric:196260:1673478625::psm2:av:psmx2_av_open():1060<info> FI_AV_MAP asked, but force FI_AV_TABLE for multi-EP support
libfabric:196260:1673478625::psm2:core:psmx2_trx_ctxt_alloc():298<info> uuid: 1A000000-0201-0000-1A00-00001A000000
libfabric:196260:1673478625::psm2:core:psmx2_trx_ctxt_alloc():303<info> ep_open_opts: unit=0 port=0
libfabric:289469:1673478625::psm2:core:psmx2_getinfo():523<info> 
libfabric:289469:1673478625::psm2:core:psmx2_init_prov_info():268<info> TAG60 instance included
libfabric:289469:1673478625::psm2:core:psmx2_alter_prov_info():449<info> 1 instances available, 1 with CQ data flag set
libfabric:289469:1673478625::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #1 hfi1_0
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #2 hfi1_0-dgram
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:289469:1673478625::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:289469:1673478625::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:289469:1673478625::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::sockets:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:289469:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.41, iface name: eth0, speed: 25000
libfabric:289469:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.41, iface name: ib0, speed: 100000
libfabric:289469:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:fed7:3bf0, iface name: eth0, speed: 25000
libfabric:289469:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:d9a5, iface name: ib0, speed: 100000
libfabric:289469:1673478625::sockets:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:289469:1673478625::sockets:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:289469:1673478625::sockets:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.41, speed 100000
libfabric:289469:1673478625::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289469:1673478625::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289469:1673478625::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289469:1673478625::net:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:289469:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.41, iface name: eth0, speed: 25000
libfabric:289469:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.41, iface name: ib0, speed: 100000
libfabric:289469:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:fed7:3bf0, iface name: eth0, speed: 25000
libfabric:289469:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:d9a5, iface name: ib0, speed: 100000
libfabric:289469:1673478625::net:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:289469:1673478625::net:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:289469:1673478625::net:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.41, speed 100000
libfabric:289469:1673478625::psm2:core:psmx2_fabric():90<info> 
libfabric:289469:1673478625::core:core:fi_fabric_():1341<info> Opened fabric: psm2
libfabric:289469:1673478625::psm2:domain:psmx2_domain_open():356<info> 
libfabric:289469:1673478625::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:289469:1673478625::psm2:core:psmx2_init_tag_layout():122<info> tag layout already set opened domain.
libfabric:289469:1673478625::psm2:core:psmx2_init_tag_layout():171<info> use tag64: tag_mask: FFFFFFFFFFFFFFFF, data_mask: 0FFFFFFF
libfabric:289469:1673478625::psm2:core:psmx2_trx_ctxt_alloc():298<info> uuid: 1A000000-0201-0000-1A00-00001A000000
libfabric:289469:1673478625::psm2:core:psmx2_trx_ctxt_alloc():303<info> ep_open_opts: unit=0 port=0
libfabric:196260:1673478625::psm2:core:psmx2_trx_ctxt_alloc():333<info> epid: 0000000000050b02 (tx+rx)
libfabric:196260:1673478625::psm2:core:psmx2_am_init():116<info> epid 0000000000050b02
libfabric:196260:1673478625::core:core:ofi_ns_add_local_name():372<warn> Cannot add local name - name server uninitialized
libfabric:196260:1673478625::psm2:core:psmx2_am_init():116<info> epid 0000000000050b02
libfabric:289469:1673478625::psm2:core:psmx2_trx_ctxt_alloc():333<info> epid: 0000000000060c02 (tx+rx)
libfabric:289469:1673478625::psm2:ep_data:psmx2_ep_optimize_ops():92<info> tagged ops optimized for op_flags=0 and directed receive
libfabric:289469:1673478625::core:core:ofi_ns_add_local_name():372<warn> Cannot add local name - name server uninitialized
libfabric:289469:1673478625::psm2:av:psmx2_av_open():1060<info> FI_AV_MAP asked, but force FI_AV_TABLE for multi-EP support
libfabric:289469:1673478625::psm2:ep_data:psmx2_ep_optimize_ops():92<info> tagged ops optimized for op_flags=0 and directed receive
libfabric:289469:1673478625::psm2:ep_data:psmx2_ep_optimize_ops():92<info> tagged ops optimized for op_flags=0 and directed receive
libfabric:196260:1673478625::psm2:core:psmx2_getinfo():523<info> 
libfabric:196260:1673478625::psm2:core:psmx2_init_prov_info():268<info> TAG60 instance included
libfabric:196260:1673478625::psm2:core:psmx2_alter_prov_info():449<info> 1 instances available, 1 with CQ data flag set
libfabric:196260:1673478625::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #1 hfi1_0
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #2 hfi1_0-dgram
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:196260:1673478625::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:196260:1673478625::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:196260:1673478625::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::core:core:fi_getinfo_():1143<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::sockets:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:196260:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.42, iface name: eth0, speed: 25000
libfabric:196260:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.42, iface name: ib0, speed: 100000
libfabric:196260:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:feb2:f0da, iface name: eth0, speed: 25000
libfabric:196260:1673478625::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:1c8b, iface name: ib0, speed: 100000
libfabric:196260:1673478625::sockets:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:196260:1673478625::sockets:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:196260:1673478625::sockets:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.42, speed 100000
libfabric:196260:1673478625::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196260:1673478625::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196260:1673478625::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196260:1673478625::net:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:196260:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.42, iface name: eth0, speed: 25000
libfabric:196260:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.42, iface name: ib0, speed: 100000
libfabric:196260:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:feb2:f0da, iface name: eth0, speed: 25000
libfabric:196260:1673478625::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:1c8b, iface name: ib0, speed: 100000
libfabric:196260:1673478625::net:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:196260:1673478625::net:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:196260:1673478625::net:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.42, speed 100000
libfabric:196260:1673478625::psm2:core:psmx2_fabric():90<info> 
libfabric:196260:1673478625::core:core:fi_fabric_():1341<info> Opened fabric: psm2
libfabric:196260:1673478625::psm2:domain:psmx2_domain_open():356<info> 
libfabric:196260:1673478625::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:196260:1673478625::psm2:core:psmx2_init_tag_layout():122<info> tag layout already set opened domain.
libfabric:196260:1673478625::psm2:core:psmx2_init_tag_layout():171<info> use tag64: tag_mask: FFFFFFFFFFFFFFFF, data_mask: 0FFFFFFF
libfabric:196260:1673478625::psm2:core:psmx2_trx_ctxt_alloc():298<info> uuid: 1A000000-0201-0000-1A00-00001A000000
libfabric:196260:1673478625::psm2:core:psmx2_trx_ctxt_alloc():303<info> ep_open_opts: unit=0 port=0
libfabric:196260:1673478625::psm2:core:psmx2_trx_ctxt_alloc():333<info> epid: 0000000000050c02 (tx+rx)
libfabric:196260:1673478625::psm2:ep_data:psmx2_ep_optimize_ops():92<info> tagged ops optimized for op_flags=0 and directed receive
libfabric:196260:1673478625::core:core:ofi_ns_add_local_name():372<warn> Cannot add local name - name server uninitialized
libfabric:196260:1673478625::psm2:av:psmx2_av_open():1060<info> FI_AV_MAP asked, but force FI_AV_TABLE for multi-EP support
libfabric:196260:1673478625::psm2:ep_data:psmx2_ep_optimize_ops():92<info> tagged ops optimized for op_flags=0 and directed receive
libfabric:196260:1673478625::psm2:ep_data:psmx2_ep_optimize_ops():92<info> tagged ops optimized for op_flags=0 and directed receive
# OSU MPI Init Test v7.0
nprocs: 2, min: 294 ms, max: 295 ms, avg: 294 ms
libfabric:196260:1673478625::psm2:av:psmx2_av_disconnect_addr():641<info> trx_ctxt_id 1 epid 60c02 epaddr 0x1b87890
libfabric:196260:1673478625::psm2:core:psmx2_trx_ctxt_free():190<info> epid: 0000000000050c02 (tx+rx)
libfabric:289469:1673478625::psm2:av:psmx2_av_disconnect_addr():641<info> trx_ctxt_id 1 epid 50c02 epaddr 0xeba570
libfabric:289469:1673478625::psm2:core:psmx2_trx_ctxt_free():190<info> epid: 0000000000060c02 (tx+rx)
libfabric:196260:1673478625::psm2:domain:psmx2_domain_close():185<info> refcnt=0
libfabric:196260:1673478625::psm2:core:psmx2_fabric_close():48<info> refcnt=3
libfabric:196260:1673478625::psm2:core:psmx2_trx_ctxt_free():190<info> epid: 0000000000050b02 (tx+rx)
libfabric:289469:1673478625::psm2:domain:psmx2_domain_close():185<info> refcnt=0
libfabric:289469:1673478625::psm2:core:psmx2_fabric_close():48<info> refcnt=3
libfabric:289469:1673478625::psm2:core:psmx2_trx_ctxt_free():190<info> epid: 0000000000060b02 (tx+rx)
libfabric:196260:1673478625::psm2:domain:psmx2_domain_close():185<info> refcnt=0
libfabric:196260:1673478625::psm2:core:psmx2_fabric_close():48<info> refcnt=0
libfabric:289469:1673478625::psm2:domain:psmx2_domain_close():185<info> refcnt=0
libfabric:289469:1673478625::psm2:core:psmx2_fabric_close():48<info> refcnt=0
libfabric:196260:1673478626::psm2:core:psmx2_fini():656<info> 
libfabric:289469:1673478626::psm2:core:psmx2_fini():656<info>
$ prterun -x FI_LOG_LEVEL=debug -n 2 --map-by=ppr:1:node --hostfile ~/janderson/workflows/util/prrte/hostfile.txt ./osu-micro-benchmarks.sif osu_init
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable perf_cntr=<not set>
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable hook=<not set>
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ZE not supported
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable hmem_disable_p2p=<not set>
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable mr_cache_max_size=<not set>
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable mr_cache_max_count=<not set>
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable mr_cache_monitor=<not set>
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:289593:1673478705::core:mr:ofi_default_cache_size():78<info> default cache size=1054213584
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable provider=<not set>
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable universe_size=<not set>
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable provider_path=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable perf_cntr=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable hook=<not set>
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ZE not supported
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable hmem_disable_p2p=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable mr_cache_max_size=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable mr_cache_max_count=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable mr_cache_monitor=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable mr_ze_cache_monitor_enabled=<not set>
libfabric:196390:1673478705::core:mr:ofi_default_cache_size():78<info> default cache size=1054213600
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable provider=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable universe_size=<not set>
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable provider_path=<not set>
libfabric:289593:1673478705::psm2:core:fi_psm2_ini():691<info> build options: HAVE_PSM2_SRC=0, HAVE_PSM2_AM_REGISTER_HANDLERS_2=1, HAVE_PSM2_MQ_FP_MSG=0, PSMX2_USE_REQ_CONTEXT=0
libfabric:289593:1673478705::psm2:core:psmx2_init_env():88<info> Open MPI job key: 7596790b1eea50ef-922e890c0e284de9.
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable name_server=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable tagged_rma=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable uuid=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable delay=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable timeout=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable conn_timeout=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable prog_interval=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable prog_affinity=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable inject_size=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable lazy_conn=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable disconnect=<not set>
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable tag_layout=<not set>
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: psm2 (116.10)
libfabric:196390:1673478705::psm2:core:fi_psm2_ini():691<info> build options: HAVE_PSM2_SRC=0, HAVE_PSM2_AM_REGISTER_HANDLERS_2=1, HAVE_PSM2_MQ_FP_MSG=0, PSMX2_USE_REQ_CONTEXT=0
libfabric:196390:1673478705::psm2:core:psmx2_init_env():88<info> Open MPI job key: 7596790b1eea50ef-922e890c0e284de9.
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable name_server=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable tagged_rma=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable uuid=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable delay=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable timeout=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable conn_timeout=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable prog_interval=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable prog_affinity=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable inject_size=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable lazy_conn=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable disconnect=<not set>
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable tag_layout=<not set>
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: psm2 (116.10)
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable tx_iov_limit=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable rx_iov_limit=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable inline_size=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable min_rnr_timer=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable use_odp=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable prefer_xrc=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable xrcd_filename=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable cqread_bunch_size=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable gid_idx=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable device_name=<not set>
libfabric:289593:1673478705::verbs:core:vrb_read_params():717<info> dmabuf support is disabled
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable dgram_use_name_server=<not set>
libfabric:289593:1673478705::verbs:core:fi_param_get_():279<info> variable dgram_name_server_port=<not set>
libfabric:289593:1673478705::verbs:fabric:verbs_devs_print():888<info> list of verbs devices found for FI_EP_MSG:
libfabric:289593:1673478705::verbs:fabric:verbs_devs_print():892<info> #1 hfi1_0 - IPoIB addresses:
libfabric:289593:1673478705::verbs:fabric:verbs_devs_print():902<info>  10.10.12.41
libfabric:289593:1673478705::verbs:fabric:verbs_devs_print():902<info>  fe80::211:7501:178:d9a5
libfabric:289593:1673478705::verbs:fabric:vrb_get_device_attrs():619<info> device hfi1_0: first found active port is 1
libfabric:289593:1673478705::verbs:fabric:vrb_get_device_attrs():565<info> XRC support unavailable in device: hfi1_0
libfabric:289593:1673478705::verbs:fabric:vrb_get_device_attrs():619<info> device hfi1_0: first found active port is 1
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: verbs (116.10)
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: udp (116.10)
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: sockets (116.10)
libfabric:289593:1673478705::tcp:core:fi_param_get_():279<info> variable port_high_range=<not set>
libfabric:289593:1673478705::tcp:core:fi_param_get_():279<info> variable port_low_range=<not set>
libfabric:289593:1673478705::tcp:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:289593:1673478705::tcp:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:289593:1673478705::tcp:core:fi_param_get_():279<info> variable nodelay=<not set>
libfabric:289593:1673478705::tcp:core:fi_param_get_():279<info> variable staging_sbuf_size=<not set>
libfabric:289593:1673478705::tcp:core:fi_param_get_():279<info> variable prefetch_rbuf_size=<not set>
libfabric:289593:1673478705::tcp:core:fi_param_get_():279<info> variable zerocopy_size=<not set>
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: tcp (116.10)
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable prov_name=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable port_high_range=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable port_low_range=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable nodelay=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable staging_sbuf_size=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable prefetch_rbuf_size=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable zerocopy_size=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable poll_fairness=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable poll_cooldown=<not set>
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable disable_auto_progress=<not set>
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: net (116.10)
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_perf (116.10)
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_debug (116.10)
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ZE not supported
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:289593:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:289593:1673478705::core:core:fi_param_get_():279<info> variable hmem_disable_p2p=<not set>
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_hmem (116.10)
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_dmabuf_peer_mem (116.10)
libfabric:289593:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_noop (116.10)
libfabric:289593:1673478705::psm2:core:psmx2_getinfo():523<info> 
libfabric:289593:1673478705::psm2:core:psmx2_init_prov_info():254<info> RMA only instance included
libfabric:289593:1673478705::psm2:core:psmx2_init_prov_info():268<info> TAG60 instance included
libfabric:289593:1673478705::psm2:core:psmx2_init_prov_info():281<info> TAG64 instance included
libfabric:289593:1673478705::psm2:core:psmx2_init_lib():257<info> PSM2 header version = (2, 2)
libfabric:289593:1673478705::psm2:core:psmx2_init_lib():259<info> PSM2 library version = (2, 2)
libfabric:289593:1673478705::psm2:core:psmx2_init_lib():262<info> PSM2 multi-ep feature enabled.
libfabric:289593:1673478705::psm2:core:psmx2_update_hfi_info():427<info> hfi1 units: total 1, active 1; hfi1 contexts: total 128, free 128
libfabric:289593:1673478705::psm2:core:psmx2_update_hfi_info():439<info> Tx/Rx contexts: 128 in total, 128 available.
libfabric:289593:1673478705::psm2:core:psmx2_alter_prov_info():449<info> 3 instances available, 2 with CQ data flag set
libfabric:289593:1673478705::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #1 hfi1_0
libfabric:289593:1673478705::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289593:1673478705::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289593:1673478705::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289593:1673478705::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #2 hfi1_0-dgram
libfabric:289593:1673478705::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289593:1673478705::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:289593:1673478705::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289593:1673478705::core:core:fi_getinfo_():1143<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:289593:1673478705::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289593:1673478705::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:289593:1673478705::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289593:1673478705::core:core:fi_getinfo_():1143<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:289593:1673478705::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289593:1673478705::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289593:1673478705::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289593:1673478705::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289593:1673478705::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289593:1673478705::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289593:1673478705::core:core:fi_getinfo_():1143<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:289593:1673478705::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289593:1673478705::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:289593:1673478705::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289593:1673478705::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289593:1673478705::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289593:1673478705::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289593:1673478705::sockets:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:289593:1673478705::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.41, iface name: eth0, speed: 25000
libfabric:289593:1673478705::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.41, iface name: ib0, speed: 100000
libfabric:289593:1673478705::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:fed7:3bf0, iface name: eth0, speed: 25000
libfabric:289593:1673478705::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:d9a5, iface name: ib0, speed: 100000
libfabric:289593:1673478705::sockets:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:289593:1673478705::sockets:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:289593:1673478705::sockets:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.41, speed 100000
libfabric:289593:1673478705::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289593:1673478705::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289593:1673478705::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289593:1673478705::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:289593:1673478705::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:289593:1673478705::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:289593:1673478705::net:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:289593:1673478705::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.41, iface name: eth0, speed: 25000
libfabric:289593:1673478705::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.41, iface name: ib0, speed: 100000
libfabric:289593:1673478705::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:fed7:3bf0, iface name: eth0, speed: 25000
libfabric:289593:1673478705::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:d9a5, iface name: ib0, speed: 100000
libfabric:289593:1673478705::net:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:289593:1673478705::net:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:289593:1673478705::net:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.41, speed 100000
--------------------------------------------------------------------------
Open MPI's OFI driver detected multiple equidistant NICs from the current process,
but had insufficient information to ensure MPI processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
1402151760 or higher.
--------------------------------------------------------------------------
libfabric:289593:1673478705::core:mr:ofi_monitor_import():845<info> setting imported memory monitor as default
libfabric:289593:1673478705::psm2:core:psmx2_fabric():90<info> 
libfabric:289593:1673478705::core:core:fi_fabric_():1341<info> Opened fabric: psm2
libfabric:289593:1673478705::psm2:domain:psmx2_domain_open():356<info> 
libfabric:289593:1673478705::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:289593:1673478705::psm2:core:psmx2_init_tag_layout():171<info> use tag64: tag_mask: FFFFFFFFFFFFFFFF, data_mask: 0FFFFFFF
libfabric:289593:1673478705::psm2:av:psmx2_av_open():1060<info> FI_AV_MAP asked, but force FI_AV_TABLE for multi-EP support
libfabric:289593:1673478705::psm2:core:psmx2_trx_ctxt_alloc():298<info> uuid: EF50EA1E-0B79-9675-E94D-280E0C892E92
libfabric:289593:1673478705::psm2:core:psmx2_trx_ctxt_alloc():303<info> ep_open_opts: unit=0 port=0
c5.289593map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b0000) size 262144 failed: Resource temporarily unavailable
c5.289593osu_init: An unrecoverable error occurred while communicating with the driver
[c5:289593] *** Process received signal ***
[c5:289593] Signal: Aborted (6)
[c5:289593] Signal code:  (-6)
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable tx_iov_limit=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable rx_iov_limit=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable inline_size=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable min_rnr_timer=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable use_odp=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable prefer_xrc=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable xrcd_filename=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable cqread_bunch_size=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable gid_idx=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable device_name=<not set>
libfabric:196390:1673478705::verbs:core:vrb_read_params():717<info> dmabuf support is disabled
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable dgram_use_name_server=<not set>
libfabric:196390:1673478705::verbs:core:fi_param_get_():279<info> variable dgram_name_server_port=<not set>
[c5:289593] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f575403fcf0]
[c5:289593] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f5753cb6acf]
[c5:289593] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f5753c89ea5]
[c5:289593] [ 3] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x47804)[0x7f575198c804]
[c5:289593] [ 4] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xde3e)[0x7f5751952e3e]
[c5:289593] [ 5] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xecdb)[0x7f5751953cdb]
[c5:289593] [ 6] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x11353)[0x7f5751956353]
[c5:289593] [ 7] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(psm2_ep_open+0x209)[0x7f5751957a49]
[c5:289593] [ 8] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0x9cb14)[0x7f57533bcb14]
[c5:289593] [ 9] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0xa62be)[0x7f57533c62be]
[c5:289593] [10] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(+0x8cd2d)[0x7f57536add2d]
[c5:289593] [11] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(mca_btl_base_select+0xe3)[0x7f575369db83]
[c5:289593] [12] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x7f5754324f42]
[c5:289593] [13] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7f5754323084]
[c5:289593] [14] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(ompi_mpi_init+0x64c)[0x7f57544ed5cc]
[c5:289593] [15] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f57542fca4e]
[c5:289593] [16] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x4015be]
[c5:289593] [17] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f5753ca2d85]
[c5:289593] [18] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x40176e]
[c5:289593] *** End of error message ***
libfabric:196390:1673478705::verbs:fabric:verbs_devs_print():888<info> list of verbs devices found for FI_EP_MSG:
libfabric:196390:1673478705::verbs:fabric:verbs_devs_print():892<info> #1 hfi1_0 - IPoIB addresses:
libfabric:196390:1673478705::verbs:fabric:verbs_devs_print():902<info>  10.10.12.42
libfabric:196390:1673478705::verbs:fabric:verbs_devs_print():902<info>  fe80::211:7501:178:1c8b
libfabric:196390:1673478705::verbs:fabric:vrb_get_device_attrs():619<info> device hfi1_0: first found active port is 1
libfabric:196390:1673478705::verbs:fabric:vrb_get_device_attrs():565<info> XRC support unavailable in device: hfi1_0
libfabric:196390:1673478705::verbs:fabric:vrb_get_device_attrs():619<info> device hfi1_0: first found active port is 1
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: verbs (116.10)
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: udp (116.10)
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: sockets (116.10)
libfabric:196390:1673478705::tcp:core:fi_param_get_():279<info> variable port_high_range=<not set>
libfabric:196390:1673478705::tcp:core:fi_param_get_():279<info> variable port_low_range=<not set>
libfabric:196390:1673478705::tcp:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:196390:1673478705::tcp:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:196390:1673478705::tcp:core:fi_param_get_():279<info> variable nodelay=<not set>
libfabric:196390:1673478705::tcp:core:fi_param_get_():279<info> variable staging_sbuf_size=<not set>
libfabric:196390:1673478705::tcp:core:fi_param_get_():279<info> variable prefetch_rbuf_size=<not set>
libfabric:196390:1673478705::tcp:core:fi_param_get_():279<info> variable zerocopy_size=<not set>
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: tcp (116.10)
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable prov_name=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable port_high_range=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable port_low_range=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable nodelay=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable staging_sbuf_size=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable prefetch_rbuf_size=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable zerocopy_size=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable poll_fairness=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable poll_cooldown=<not set>
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable disable_auto_progress=<not set>
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: net (116.10)
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_perf (116.10)
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_debug (116.10)
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_ZE not supported
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_NEURON not supported
libfabric:196390:1673478705::core:core:ofi_hmem_init():249<info> Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:196390:1673478705::core:core:fi_param_get_():279<info> variable hmem_disable_p2p=<not set>
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_hmem (116.10)
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_dmabuf_peer_mem (116.10)
libfabric:196390:1673478705::core:core:ofi_register_provider():468<info> registering provider: ofi_hook_noop (116.10)
libfabric:196390:1673478705::psm2:core:psmx2_getinfo():523<info> 
libfabric:196390:1673478705::psm2:core:psmx2_init_prov_info():254<info> RMA only instance included
libfabric:196390:1673478705::psm2:core:psmx2_init_prov_info():268<info> TAG60 instance included
libfabric:196390:1673478705::psm2:core:psmx2_init_prov_info():281<info> TAG64 instance included
libfabric:196390:1673478705::psm2:core:psmx2_init_lib():257<info> PSM2 header version = (2, 2)
libfabric:196390:1673478705::psm2:core:psmx2_init_lib():259<info> PSM2 library version = (2, 2)
libfabric:196390:1673478705::psm2:core:psmx2_init_lib():262<info> PSM2 multi-ep feature enabled.
libfabric:196390:1673478705::psm2:core:psmx2_update_hfi_info():427<info> hfi1 units: total 1, active 1; hfi1 contexts: total 128, free 128
libfabric:196390:1673478705::psm2:core:psmx2_update_hfi_info():439<info> Tx/Rx contexts: 128 in total, 128 available.
libfabric:196390:1673478705::psm2:core:psmx2_alter_prov_info():449<info> 3 instances available, 2 with CQ data flag set
libfabric:196390:1673478705::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #1 hfi1_0
libfabric:196390:1673478705::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196390:1673478705::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196390:1673478705::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196390:1673478705::verbs:fabric:vrb_get_matching_info():1522<info> checking domain: #2 hfi1_0-dgram
libfabric:196390:1673478705::verbs:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196390:1673478705::verbs:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:196390:1673478705::verbs:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196390:1673478705::core:core:fi_getinfo_():1143<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:196390:1673478705::udp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196390:1673478705::udp:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:196390:1673478705::udp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196390:1673478705::core:core:fi_getinfo_():1143<info> fi_getinfo: provider udp returned -61 (No data available)
libfabric:196390:1673478705::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196390:1673478705::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196390:1673478705::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196390:1673478705::tcp:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196390:1673478705::tcp:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196390:1673478705::tcp:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196390:1673478705::core:core:fi_getinfo_():1143<info> fi_getinfo: provider tcp returned -61 (No data available)
libfabric:196390:1673478705::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196390:1673478705::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_DGRAM
libfabric:196390:1673478705::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196390:1673478705::sockets:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196390:1673478705::sockets:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196390:1673478705::sockets:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196390:1673478705::sockets:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:196390:1673478705::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.42, iface name: eth0, speed: 25000
libfabric:196390:1673478705::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.42, iface name: ib0, speed: 100000
libfabric:196390:1673478705::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:feb2:f0da, iface name: eth0, speed: 25000
libfabric:196390:1673478705::sockets:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:1c8b, iface name: ib0, speed: 100000
libfabric:196390:1673478705::sockets:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:196390:1673478705::sockets:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:196390:1673478705::sockets:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.42, speed 100000
libfabric:196390:1673478705::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196390:1673478705::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196390:1673478705::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196390:1673478705::net:core:ofi_check_ep_type():667<info> unsupported endpoint type
libfabric:196390:1673478705::net:core:ofi_check_ep_type():668<info> Supported: FI_EP_MSG
libfabric:196390:1673478705::net:core:ofi_check_ep_type():668<info> Requested: FI_EP_RDM
libfabric:196390:1673478705::net:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:196390:1673478705::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.10.42, iface name: eth0, speed: 25000
libfabric:196390:1673478705::net:core:ofi_get_list_of_addr():2053<info> Available addr: 10.10.12.42, iface name: ib0, speed: 100000
libfabric:196390:1673478705::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::b226:28ff:feb2:f0da, iface name: eth0, speed: 25000
libfabric:196390:1673478705::net:core:ofi_get_list_of_addr():2053<info> Available addr: fe80::211:7501:178:1c8b, iface name: ib0, speed: 100000
libfabric:196390:1673478705::net:core:ofi_insert_loopback_addr():1884<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:196390:1673478705::net:core:ofi_insert_loopback_addr():1899<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:196390:1673478705::net:core:util_getinfo_ifs():334<info> Chosen addr for using: 10.10.12.42, speed 100000
--------------------------------------------------------------------------
Open MPI's OFI driver detected multiple equidistant NICs from the current process,
but had insufficient information to ensure MPI processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
595116880 or higher.
--------------------------------------------------------------------------
libfabric:196390:1673478705::core:mr:ofi_monitor_import():845<info> setting imported memory monitor as default
libfabric:196390:1673478705::psm2:core:psmx2_fabric():90<info> 
libfabric:196390:1673478705::core:core:fi_fabric_():1341<info> Opened fabric: psm2
libfabric:196390:1673478705::psm2:domain:psmx2_domain_open():356<info> 
libfabric:196390:1673478705::psm2:core:fi_param_get_():279<info> variable lock_level=<not set>
libfabric:196390:1673478705::psm2:core:psmx2_init_tag_layout():171<info> use tag64: tag_mask: FFFFFFFFFFFFFFFF, data_mask: 0FFFFFFF
libfabric:196390:1673478705::psm2:av:psmx2_av_open():1060<info> FI_AV_MAP asked, but force FI_AV_TABLE for multi-EP support
libfabric:196390:1673478705::psm2:core:psmx2_trx_ctxt_alloc():298<info> uuid: EF50EA1E-0B79-9675-E94D-280E0C892E92
libfabric:196390:1673478705::psm2:core:psmx2_trx_ctxt_alloc():303<info> ep_open_opts: unit=0 port=0
c6.196390map_hfi_mem: mmap of rcvhdr_bufbase (0xdabbad00040b0000) size 262144 failed: Resource temporarily unavailable
c6.196390osu_init: An unrecoverable error occurred while communicating with the driver
[c6:196390] *** Process received signal ***
[c6:196390] Signal: Aborted (6)
[c6:196390] Signal code:  (-6)
[c6:196390] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f9123e99cf0]
[c6:196390] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f9123b10acf]
[c6:196390] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f9123ae3ea5]
[c6:196390] [ 3] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x47804)[0x7f91217e6804]
[c6:196390] [ 4] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xde3e)[0x7f91217ace3e]
[c6:196390] [ 5] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0xecdb)[0x7f91217adcdb]
[c6:196390] [ 6] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(+0x11353)[0x7f91217b0353]
[c6:196390] [ 7] /opt/software/linux-centos8-zen/gcc-8.5.0/opa-psm2-11.2.230-k66aykcpei5ijztxoafbzaqmplh3pu42/lib/libpsm2.so.2(psm2_ep_open+0x209)[0x7f91217b1a49]
[c6:196390] [ 8] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0x9cb14)[0x7f9123216b14]
[c6:196390] [ 9] /opt/software/linux-centos8-zen/gcc-8.5.0/libfabric-1.16.1-apf5ltuppxfa5sbg4vjtv7xv3gpj6gpj/lib/libfabric.so.1(+0xa62be)[0x7f91232202be]
[c6:196390] [10] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(+0x8cd2d)[0x7f9123507d2d]
[c6:196390] [11] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libopen-pal.so.40(mca_btl_base_select+0xe3)[0x7f91234f7b83]
[c6:196390] [12] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x7f912417ef42]
[c6:196390] [13] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7f912417d084]
[c6:196390] [14] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(ompi_mpi_init+0x64c)[0x7f91243475cc]
[c6:196390] [15] /opt/software/linux-centos8-zen/gcc-8.5.0/openmpi-4.1.4-u2e2bpyhubhxg7tq5j3tctorf4ep4xiv/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f9124156a4e]
[c6:196390] [16] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x4015be]
[c6:196390] [17] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f9123afcd85]
[c6:196390] [18] /opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init[0x40176e]
[c6:196390] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 0 on node c5 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
anderbubble commented 1 year ago

I just realized that I didn't do anything specific to build prrte (or pmix) against libfabric or psm2. Do I need to?

I installed from RPMs that I built like this:

mock --chain pmix-4.2.2-1.src.rpm prrte-3.0.0-1.src.rpm --isolation=simple --localrepo=repo
hppritcha commented 1 year ago

Can you force spack to use a newer pmix than 4.1.2?

rhc54 commented 1 year ago

Any updates on this? I'm not sure what there is for prrte to do?

anderbubble commented 1 year ago

Sorry, I thought I responded before: I tried forcing spack to use the current master branch of pmix, but then it failed to compile.

I'm left wondering why Slurm can start a given PMI application but PRRTE can't. Maybe there's simply something I don't understand about how PRRTE / PMIx works, but I expected to be able to use PRRTE as a drop-in replacement for starting PMI applications without Slurm.

If this is my misunderstanding, then I'll stop trying to get PRRTE to work in this case; but if this is meant to work, I'm happy to provide whatever data I can that would be useful in determining why it's not working. I just don't know what else to provide.

rhc54 commented 1 year ago

Sorry, I thought I responded before: I tried forcing spack to use the current master branch of pmix, but then it failed to compile.

I'm sorry - what failed to compile? Your app? OMPI?

I expected to be able to use PRRTE as a drop-in replacement for starting PMI applications without Slurm.

People are doing precisely that every day, so this isn't a core problem. Quite frankly, this is the first we've heard of any problem, and the problem appears to be something to do with PSM2 (as opposed to PRRTE or PMIx). My guess is that there is something in PSM2 that is the culprit here.

Have you tried raising this with the OMPI community so someone from the PSM2 community might see it?

rhc54 commented 1 year ago

Couple of things occurred to me - apologies if you have already tried them:

anderbubble commented 1 year ago

I'm not trying to be difficult; I just wanted to check my assumptions against actual expectations for PRRTE, to make sure I wasn't working against intent. I'm glad to know that's not the case.

One thing I might still be wrong about, though: I was under the impression that PMIx is backwards-compatible with PMI-1 and PMI-2. I had further interpreted this to mean that I should be able to start a PMI-1 or PMI-2 application with PRRTE. But perhaps I have that backwards: perhaps it's only that a PMI-1 or PMI-2 resource manager / server should be able to start a PMIx application? If you could confirm the expected behavior in these directions, that would be helpful for my understanding.

I'll go back and try to replicate my experience again, and provide an updated account of my experience. Thanks for your help.

rhc54 commented 1 year ago

I'm not trying to be difficult; I just wanted to check my assumptions against actual expectations for PRRTE, to make sure I wasn't working against intent. I'm glad to know that's not the case.

No worries - I didn't take it that way. Just clarifying that we know this basically works, so trying to understand the difference here.

One thing I might still be wrong about, though: I was under the impression that PMIx is backwards-compatible with PMI-1 and PMI-2. I had further interpreted this to mean that I should be able to start a PMI-1 or PMI-2 application with PRRTE.

Ah - no, that's not true I'm afraid. First off, there is no standard wire protocol - so if you build something against (for example) the Slurm PMI-1 library, it won't speak to a PMIx server in PRRTE (or anywhere else for that matter). Similarly, if you try to use the PMI-1 wrapper we used to provide with PMIx (so you can call PMI-1 functions in your library), then that won't speak to a PMI-1 server - it would only talk to a PMIx server.

So bottom line is that it really is best to match chickens with chickens - trying to match chickens with turkeys isn't likely to end well. PRRTE is a PMIx-based system and has knowledge of PMI-1/2, so it cannot start an application based on PMI-1/2.

anderbubble commented 1 year ago

@rhc54 I've returned to this and tried to simplify my use case:

So here's my current spack environment:

spack:
  specs:
  - osu-micro-benchmarks
  - mpich pmi=pmix

  container:
    format: singularity

    images:
      os: centos:stream
      spack: v0.19.0

    strip: true

    os_packages:
      final:
        - libgfortran

    labels:
      app: "osu-micro-benchmarks"
      mpi: "mpich"

And how I'm building the container:

apptainer build --fakeroot omb-mpich-pmix.sif <(spack containerize)

But trying to run gives an assertion error:

[janderson@admin1 omb-mpich]$ prterun --host c5,c6,c7 --np 2  ./omb-mpich-pmix.sif osu_init
Assertion failed in file src/util/mpir_localproc.c at line 181: node_id < num_nodes
/opt/software/linux-centos8-zen/gcc-8.5.0/mpich-4.0.2-lnfsvy4tcggpygg6uu35scnzv55xhfgu/lib/libmpi.so.12(+0x5337d4) [0x7f4d3c3677d4]
/opt/software/linux-centos8-zen/gcc-8.5.0/mpich-4.0.2-lnfsvy4tcggpygg6uu35scnzv55xhfgu/lib/libmpi.so.12(+0x43d454) [0x7f4d3c271454]
/opt/software/linux-centos8-zen/gcc-8.5.0/mpich-4.0.2-lnfsvy4tcggpygg6uu35scnzv55xhfgu/lib/libmpi.so.12(+0x47bc98) [0x7f4d3c2afc98]
/opt/software/linux-centos8-zen/gcc-8.5.0/mpich-4.0.2-lnfsvy4tcggpygg6uu35scnzv55xhfgu/lib/libmpi.so.12(+0x3c734e) [0x7f4d3c1fb34e]
/opt/software/linux-centos8-zen/gcc-8.5.0/mpich-4.0.2-lnfsvy4tcggpygg6uu35scnzv55xhfgu/lib/libmpi.so.12(+0x3c7958) [0x7f4d3c1fb958]
/opt/software/linux-centos8-zen/gcc-8.5.0/mpich-4.0.2-lnfsvy4tcggpygg6uu35scnzv55xhfgu/lib/libmpi.so.12(+0x3c3280) [0x7f4d3c1f7280]
/opt/software/linux-centos8-zen/gcc-8.5.0/mpich-4.0.2-lnfsvy4tcggpygg6uu35scnzv55xhfgu/lib/libmpi.so.12(+0x3f35c5) [0x7f4d3c2275c5]
/opt/software/linux-centos8-zen/gcc-8.5.0/mpich-4.0.2-lnfsvy4tcggpygg6uu35scnzv55xhfgu/lib/libmpi.so.12(+0x3f3c45) [0x7f4d3c227c45]
/opt/software/linux-centos8-zen/gcc-8.5.0/mpich-4.0.2-lnfsvy4tcggpygg6uu35scnzv55xhfgu/lib/libmpi.so.12(MPI_Init+0x1e) [0x7f4d3bf0947e]
/opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init() [0x400dce]
/lib64/libc.so.6(__libc_start_main+0xe5) [0x7f4d3baa9d85]
/opt/view/libexec/osu-micro-benchmarks/mpi/startup/osu_init() [0x400f6e]
Abort(1) on node 1: Internal error
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [prterun-admin1-2823137@1,1]
  Exit code:    1
--------------------------------------------------------------------------

You said

People are doing precisely that every day, so this isn't a core problem.

Can you confirm whether this is commonly done with apptainer or singularity? I'm thinking the issue might be some part of the state not making its way to the containerized mpi/application. But I'm not an expert in how the internals work.

rhc54 commented 1 year ago

There are people running containers underneath PRRTE, but I have no visibility into their container technology or how they set them up. I am likewise ignorant of MPICH, though I'd be suspicious of that "mpir_localproc" assertion.

If I were you, I'd start by trying to run that application outside of a container to ensure you have things wired up correctly before adding container complexity to the equation.

hppritcha commented 1 year ago

i've used ompi 5.0.x with embedded prrte/openpmix in a charliecloud container, and used an external prrte/openpmix of the same version to launch the container. We also built slurm against the same openpmix and used srun --mpi=pmix to launch the container as well.

anderbubble commented 1 year ago

@rhc54 I had the same realization, so I built the application (still with spack, but outside of any container) and tried again. I also switched to openmpi.

I was able to run osu_hello with OpenMPI's mpirun/mpiexec, but if I try to run it with prterun I get multiple independent single-rank runs, rather than a single coordinated MPI application with multiple ranks.

I understand this has got to be something I'm doing wrong; but I don't understand what it could be.

[janderson@admin1 omb-openmpi]$ spack install osu-micro-benchmarks ^openmpi +pmi +legacylaunchers
[...]
[+] /home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/slurm-22-05-7-1-xtg6cyt2iykghrfb6p25cxngou3h7h36
==> Installing openmpi-4.1.4-nqb3647lbifcrcp346rfacnnvfhhfbeg
==> No binary for openmpi-4.1.4-nqb3647lbifcrcp346rfacnnvfhhfbeg found: installing from source
==> Fetching https://mirror.spack.io/_source-cache/archive/92/92912e175fd1234368c8730c03f4996fe5942e7479bb1d10059405e7f2b3930d.tar.bz2
==> No patches needed for openmpi  
==> openmpi: Executing phase: 'autoreconf'     
==> openmpi: Executing phase: 'configure'
==> openmpi: Executing phase: 'build'                          
==> openmpi: Executing phase: 'install'
==> openmpi: Successfully installed openmpi-4.1.4-nqb3647lbifcrcp346rfacnnvfhhfbeg
  Stage: 3.32s.  Autoreconf: 0.00s.  Configure: 1m 34.04s.  Build: 2m 16.44s.  Install: 14.85s.  Total: 4m 8.93s
[+] /home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/openmpi-4.1.4-nqb3647lbifcrcp346rfacnnvfhhfbeg
==> Installing osu-micro-benchmarks-7.0.1-emqw3ggw6p4wq5cif2fyybtp4wwomcki   
==> No binary for osu-micro-benchmarks-7.0.1-emqw3ggw6p4wq5cif2fyybtp4wwomcki found: installing from source
==> Using cached archive: /home/janderson/spack/var/spack/cache/_source-cache/archive/04/04954aea082ba1b90a461ffab82a3cee43fe2d5a60fed99f5cb4585ac7da8c66.tar.gz
==> No patches needed for osu-micro-benchmarks        
==> osu-micro-benchmarks: Executing phase: 'autoreconf'                                                                                                                                                                                    ==> osu-micro-benchmarks: Executing phase: 'configure'                                   
==> osu-micro-benchmarks: Executing phase: 'build'
==> osu-micro-benchmarks: Executing phase: 'install'
==> osu-micro-benchmarks: Successfully installed osu-micro-benchmarks-7.0.1-emqw3ggw6p4wq5cif2fyybtp4wwomcki

[janderson@admin1 omb-openmpi]$ for host in c{5..7}; do echo $host; done >hostfile.txt
[janderson@admin1 omb-openmpi]$ /home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/openmpi-4.1.4-nqb3647lbifcrcp346rfacnnvfhhfbeg/bin/mpiexec -n 2 --hostfile hostfile.txt /home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/osu-micro-benchmarks-7.0.1-emqw3ggw6p4wq5cif2fyybtp4wwomcki/libexec/osu-micro-benchmarks/mpi/startup/osu_hello 
# OSU MPI Hello World Test v7.0
This is a test with 2 processes

[janderson@admin1 omb-openmpi]$ prterun --host c5,c6,c7 --np 2  /home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/osu-micro-benchmarks-7.0.1-emqw3ggw6p4wq5cif2fyybtp4wwomcki/libexec/osu-micro-benchmarks/mpi/startup/osu_hello
# OSU MPI Hello World Test v7.0
This is a test with 1 processes
# OSU MPI Hello World Test v7.0
This is a test with 1 processes
rhc54 commented 1 year ago

The problem is here: spack install osu-micro-benchmarks ^openmpi +pmi +legacylaunchers

I suspect that means you installed an OMPI build with the old PMI-1/2 support instead of PMIx. I'm not familiar with spack - is there some rationale for why it must be installed from there? Some reason not to just download and build OMPI yourself, where you have a little more direct control over how it gets built?

anderbubble commented 1 year ago

I'm using spack to try to make sure that my builds are consistent and repeatable, including all its dependencies, and to try to show that what I'm building in the container is the same is what I'm building outside the container. The spack package for openmpi only has a pmi boolean, so I inferred from that that it supports all pmi, including PMIx, given that this is the concrete spec: [note that it pulled in pmix]

[janderson@admin1 omb-openmpi]$ spack spec osu-micro-benchmarks ^openmpi +pmi +legacylaunchers
Input spec
--------------------------------
osu-micro-benchmarks
    ^openmpi+legacylaunchers+pmi

Concretized
--------------------------------
osu-micro-benchmarks@7.0.1%gcc@8.5.0~cuda~graphing~papi~rocm build_system=autotools arch=linux-rocky8-zen
    ^openmpi@4.1.4%gcc@8.5.0~atomics~cuda~cxx~cxx_exceptions~gpfs~internal-hwloc~java+legacylaunchers~lustre~memchecker~orterunprefix+pmi+romio+rsh~singularity+static+vt+wrapper-rpath build_system=autotools fabrics=none schedulers=slurm arch=linux-rocky8-zen                                                                      
        ^hwloc@2.9.0%gcc@8.5.0~cairo~cuda~gl~libudev+libxml2~netloc~nvml~oneapi-level-zero~opencl+pci~rocm build_system=autotools libs=shared,static arch=linux-rocky8-zen
            ^libpciaccess@0.16%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                ^util-macros@1.19.3%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^libxml2@2.10.3%gcc@8.5.0~python build_system=autotools arch=linux-rocky8-zen
                ^libiconv@1.16%gcc@8.5.0 build_system=autotools libs=shared,static arch=linux-rocky8-zen
                ^xz@5.2.7%gcc@8.5.0~pic build_system=autotools libs=shared,static arch=linux-rocky8-zen
            ^ncurses@6.3%gcc@8.5.0~symlinks+termlib abi=none build_system=autotools arch=linux-rocky8-zen
        ^numactl@2.0.14%gcc@8.5.0 build_system=autotools patches=4e1d78c,62fc8a8,ff37630 arch=linux-rocky8-zen
            ^autoconf@2.69%gcc@8.5.0 build_system=autotools patches=35c4492,7793209,a49dd5b arch=linux-rocky8-zen
            ^automake@1.16.5%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^libtool@2.4.7%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^m4@1.4.19%gcc@8.5.0+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-rocky8-zen
                ^diffutils@3.8%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                ^libsigsegv@2.13%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
        ^openssh@9.1p1%gcc@8.5.0+gssapi build_system=autotools arch=linux-rocky8-zen
            ^krb5@1.20.1%gcc@8.5.0+shared build_system=autotools arch=linux-rocky8-zen
                ^bison@3.8.2%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                ^gettext@0.21.1%gcc@8.5.0+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools arch=linux-rocky8-zen
                    ^tar@1.34%gcc@8.5.0 build_system=autotools zip=pigz arch=linux-rocky8-zen
                        ^pigz@2.7%gcc@8.5.0 build_system=makefile arch=linux-rocky8-zen
                        ^zstd@1.5.2%gcc@8.5.0+programs build_system=makefile compression=none libs=shared,static arch=linux-rocky8-zen
            ^libedit@3.1-20210216%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^libxcrypt@4.4.33%gcc@8.5.0~obsolete_api build_system=autotools arch=linux-rocky8-zen
            ^openssl@1.1.1s%gcc@8.5.0~docs~shared build_system=generic certs=mozilla arch=linux-rocky8-zen
                ^ca-certificates-mozilla@2022-10-11%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
        ^perl@5.36.0%gcc@8.5.0+cpanm+open+shared+threads build_system=generic arch=linux-rocky8-zen
            ^berkeley-db@18.1.40%gcc@8.5.0+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc arch=linux-rocky8-zen
            ^bzip2@1.0.8%gcc@8.5.0~debug~pic+shared build_system=generic arch=linux-rocky8-zen
            ^gdbm@1.23%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
        ^pkgconf@1.8.0%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
        ^pmix@4.1.2%gcc@8.5.0~docs+pmi_backwards_compatibility~python~restful build_system=autotools arch=linux-rocky8-zen
            ^libevent@2.1.12%gcc@8.5.0+openssl build_system=autotools arch=linux-rocky8-zen
        ^slurm@22-05-7-1%gcc@8.5.0~gtk~hdf5~hwloc~mariadb~pmix+readline~restd build_system=autotools sysconfdir=PREFIX/etc arch=linux-rocky8-zen
            ^curl@7.85.0%gcc@8.5.0~gssapi~ldap~libidn2~librtmp~libssh~libssh2~nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-rocky8-zen
            ^glib@2.74.3%gcc@8.5.0~libmount build_system=generic tracing=none arch=linux-rocky8-zen
                ^elfutils@0.188%gcc@8.5.0~bzip2~debuginfod+nls~xz~zstd build_system=autotools arch=linux-rocky8-zen
                ^libffi@3.4.2%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                ^meson@1.0.0%gcc@8.5.0 build_system=python_pip patches=0f0b1bd arch=linux-rocky8-zen
                    ^py-pip@22.2.2%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
                    ^py-setuptools@65.5.0%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
                    ^py-wheel@0.37.1%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
                ^ninja@1.11.1%gcc@8.5.0+re2c build_system=generic arch=linux-rocky8-zen
                    ^re2c@2.2%gcc@8.5.0 build_system=generic arch=linux-rocky8-zen
                ^pcre2@10.42%gcc@8.5.0~jit+multibyte build_system=autotools arch=linux-rocky8-zen
                ^python@3.10.8%gcc@8.5.0+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic patches=0d98e93,7d40923,f2fd060 arch=linux-rocky8-zen                                                                                       
                    ^expat@2.5.0%gcc@8.5.0+libbsd build_system=autotools arch=linux-rocky8-zen
                        ^libbsd@0.11.5%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                            ^libmd@1.0.4%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                    ^sqlite@3.40.0%gcc@8.5.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-rocky8-zen
                    ^util-linux-uuid@2.38.1%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^json-c@0.16%gcc@8.5.0~ipo build_system=cmake build_type=RelWithDebInfo arch=linux-rocky8-zen
                ^cmake@3.25.1%gcc@8.5.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-rocky8-zen
            ^lz4@1.9.4%gcc@8.5.0 build_system=makefile libs=shared,static arch=linux-rocky8-zen
            ^munge@0.5.15%gcc@8.5.0 build_system=autotools localstatedir=PREFIX/var arch=linux-rocky8-zen
                ^libgcrypt@1.10.1%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                    ^libgpg-error@1.46%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
                        ^gawk@5.1.1%gcc@8.5.0~nls build_system=autotools arch=linux-rocky8-zen
                            ^gmp@6.2.1%gcc@8.5.0+cxx build_system=autotools libs=shared,static arch=linux-rocky8-zen
                            ^mpfr@4.1.0%gcc@8.5.0 build_system=autotools libs=shared,static arch=linux-rocky8-zen
                                ^autoconf-archive@2022.02.11%gcc@8.5.0 build_system=autotools patches=139214f arch=linux-rocky8-zen
                                ^texinfo@7.0%gcc@8.5.0 build_system=autotools arch=linux-rocky8-zen
            ^readline@8.2%gcc@8.5.0 build_system=autotools patches=bbf97f1 arch=linux-rocky8-zen
        ^zlib@1.2.13%gcc@8.5.0+optimize+pic+shared build_system=makefile arch=linux-rocky8-zen

I can try to build it without spack; it just is likely to take me longer to get to it.

rhc54 commented 1 year ago

I can't advise you there - I can only report what I see. prterun appears to be behaving correctly, and there is no reason to believe there is a problem there. Your last observation is consistent with OMPI not being built with PMIx support, which is why I suggest doing it manually to ensure you really know how it was built.

I would suggest a couple of possible steps forward:

I'm not sure we can be of any further help to you, though I wish we had a solution.

anderbubble commented 1 year ago

I re-ran the spack install with additional logging output to show the ./configure command.

==> [2023-02-07-15:14:56.920648] '/tmp/janderson/spack-stage/spack-stage-openmpi-4.1.4-nqb3647lbifcrcp346rfacnnvfhhfbeg/spack-src/configure' '--prefix=/home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/openmpi-4.1.4-nqb3647lbifcrcp346rfacnnvfhhfbeg' '--enable-shared' '--disable-silent-rules' '--disable-builtin-atomics' 
'--with-pmi=/home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/slurm-22-05-7-1-xtg6cyt2iykghrfb6p25cxngou3h7h36' '--enable-static' '--enable-mpi1-compatibility' '--without-mxm' '--without-xpmem' '--without-psm2' '--without-verbs' '--without-psm' '--without-hcoll' '--without-knem' '--without-fca' '--without-ucx' '--witho
ut-ofi' '--without-cma' '--without-cray-xpmem' '--without-lsf' '--without-sge' '--without-tm' '--without-loadleveler' '--without-alps' '--with-slurm' '--disable-memchecker' '--with-libevent=/home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/libevent-2.1.12-euakxsu44ntuszldvih2o4b36h7nc4mc' '--with-pmix=/home/janderson/s
pack/opt/spack/linux-rocky8-zen/gcc-8.5.0/pmix-4.1.2-32ueni2oqpubkbuvq2j73zlw4tzdbznu' '--with-zlib=/home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/zlib-1.2.13-jragaowg5v42ht6srly4x5ylooj3zc2w' '--with-hwloc=/home/janderson/spack/opt/spack/linux-rocky8-zen/gcc-8.5.0/hwloc-2.9.0-qaaazekwc2jbttur52z7lkvdvovtyurm' '--di
sable-java' '--disable-mpi-java' '--with-gpfs=no' '--without-cuda' '--enable-wrapper-rpath' '--disable-wrapper-runpath' '--disable-mpi-cxx' '--disable-cxx-exceptions'

Note there the presence of --with-pmix=/home/janderson/s pack/opt/spack/linux-rocky8-zen/gcc-8.5.0/pmix-4.1.2-32ueni2oqpubkbuvq2j73zlw4tzdbznu. That seems to indicate, to me, that it is building PMIx support, so I don't think that's the issue.

I can try to ask OpenMPI about this; but since I'm not able to get it to work even outside of a container, and it does work with the OpenMPI launcher mpirun/mpiexec. I expect they'll just point back to prrte.

rhc54 commented 1 year ago

I'm honestly somewhat at a loss to help. PRRTE runs OMPI and MPICH and PGAS and a number of other programming models just fine on a daily basis, under Slurm as well as other environments. I can only look at how you are building/installing things and see that this isn't the way others I'm familiar with do it, and they have functioning installations.

Now is that the reason for the problems you are seeing? I don't know - I can only see it is different. The only way I can try to help decipher why you are hitting the problems is to try and remove those differences until we find the one that is causing the problem.

This means going back to basics, downloading and building from source (not via rpm or spack or some other distro), and testing it outside of a container. Somewhere along the line, your method is generating a non-functional build - only way I know to debug that is to follow conventional methods and see where things go awry.

I can understand that this might not fit with your eventual goals, and I empathize that it takes some effort...so it may not be worth it to you. I just don't see how I can help any other way.

rhc54 commented 1 year ago

Just to further clarify my "it works" comments. Our CI actually runs PMIx-based apps using prterun as well as prte + prun in containers for every PR - we use Docker, but there is nothing fundamental about the container. Developers such as myself use Docker containers on our desktop computers every day to simulate larger clusters, all running MPI and other PMIx-based apps using PRRTE.

Every night, OMPI runs thousands of tests exercising a very large range of MPI functions, and using PRRTE to do so (prterun is the mpirun for OMPI main and v5 branches). Not all of these are run inside containers, but quite a few are (all using Docker), and the tests span a wide range of compilers and environments. Howard described his use with charliecloud and Slurm.

PRRTE is used to run MPI applications (both OMPI and MPICH) in production under PBSPro at a number of locations. Most of those are probably bare metal, but I suspect there is a sprinkling of containers as well.

We also have quite a few sites using PRRTE running under Cray ALPS, acting as a "shim" that provides support for dynamic spawn and workflow-based applications. This was actually the first use-case that drove PRRTE development. We have since seen that expand to cover Slurm installations for similar reasons. I don't have knowledge of the bare metal vs container split.

So we have a great deal of experience with PRRTE running OMPI applications, both bare metal and Docker containers. As far as I know, most of those uses are built from downloaded source code. I suspect some of the production uses are installed from distros (you'll find PRRTE and PMIx are available from the OS packagers as well as OpenHPC, Spack, and others), especially in the last couple of years as distro support has grown.

I therefore have to believe there is something wrong with your PRRTE (and possible OMPI) installation, though I can't immediately point to the source of the trouble. Hence my suggestion to build it directly from source, preferably starting with a git clone of the PRRTE repository and using the master branch so we have more direct control over the results.

Optionally, you could clone the OMPI main branch or the v5.0.x branch (or take a nightly tarball from either of them). These contain not only the MPI layer, but also both PMIx and PRRTE. Just build the default configuration that uses the internal copy of those two code bases to ensure that everything is in sync. As I said above, mpirun for those two branches is just prterun, so you will at least know that everything has a common root.

rhc54 commented 1 year ago

Feel free to re-open this, but I'm going to close this for now as I'm honestly not sure there is anything we can think of to do.