open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 858 forks source link

Received msg header indicates a size that is too large - ptl_base_max_msg_size #8278

Open fm-dewal opened 3 years ago

fm-dewal commented 3 years ago

Background information

Successfully able to run a 60k replica hello_c example across 1024 hosts with 60 slots-per-host.

In same environment, attempting to run a ~100k replica hello_c example across 1024 hosts with 128 slots-per-host.

What version of Open MPI are you using? Describe how Open MPI was installed

I have Open MPI version 4.0.5 installed in my docker images. Installation done using the openmpi-4-0-5.tar.gz tarball. Release date: Aug 26, 2020

Please describe the system on which you are running

Operating system/version: Native: centos:7 with 3.10.0-957.el7.x86_64 kernel version Docker: 19.03.9, build 9d988398e7 Computer hardware: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70 Ghz *Network type: Docker Swarm - User-defined Overlay


Details of the problem

After setting up the cluster, the following command is executed:

mpirun -n 131072 \ --hostfile ./hostfile \ --mca mpi_yield_when_idle 1 \ --mca hwloc_base_binding_policy none \ --mca mpi_oversubscribe true \ --mca btl tcp,self,vader \ --mca btl_tcp_if_include x.x.0.0/16 \ --mca oob_tcp_if_include x.x.0.0/16 \ --mca opal_net_private_ipv4 x.x.0.0/16 \ --mca orte_tmpdir_base /openmpi/tmp \ --mca opal_event_include epoll \ --mca event_libevent2022_event_include epoll \ --mca opal_set_max_sys_limits 1 \ --mca oob_tcp_listen_mode listen_thread \ --mca pmix_base_async_modex true \ --mca orte_keep_fqdn_hostnames true \

--mca orte_hostname_cutoff 2000 \ --mca orte_enable_recovery true \ /opt/exec/hello_c.o

The following error message is received:

A received msg header indicates a size that is too large: Requested size: 25836785 Size limit: 16777216 If you believe this msg is legitimate, please increase the max msg size via the ptl_base_max_msg_size parameter.

The ptl framework has been marked depricated/outdated in the openMPI faq page here: https://www.open-mpi.org/faq/?category=tuning#frameworks

Please suggest how can I increase the maximum message size as I do believe the message is legitimate. Happy to provide any additional information as required. Thank you.

rhc54 commented 3 years ago

The "ptl" referenced here is inside the PMIx code - the one referenced in the OMPI FAQ is a very old thing long gone from OMPI code. The message more than likely indicates a mismatch between OMPI versions somewhere - it is highly unlikely that PMIx would pass a message of more than a few kilobytes. I'd check to ensure you aren't picking up some incorrect library version.

fm-dewal commented 3 years ago

I am using docker based containers as hosts for my openMPI jobs. The image are shared using a local registry and validated to be same same across all servers. I hope, this helps ensure the match between OMPI versions.

As stated in the initial description, the 60k ranks job works successfully.

Further, the following observations were also made: The "Requested size" provided in the error message varies as we reduce the size of the openMPI job. For example, For a 1024x128 size job, Requested Size: 25836785 For a 1024x100 size job, Requested Size: 20198646

Could you please help provide more details on validating the library versions [how|which]? Thanks.

rhc54 commented 3 years ago

Try the following with just one instance - i.e., don't launch 1024 nodes with 128 ppn, just launch 1 node with 1 ppn. Set PMIX_MCA_gds_base_verbose=10 so we can see what storage mechanism PMIx is using on your setup.

fm-dewal commented 3 years ago

Hi Ralph, Thank you for the suggestion. Based on your suggestion, here is the output received on running just one instance

$ echo $PMIX_MCA_gds_base_verbose
10
$ mpirun -n 1 ./hello_c.o
[docker_cntnr:00931] mca: base: component_find: searching (null) for gds components
[docker_cntnr:00931] mca: base: find_dyn_components: checking (null) for gds components
[docker_cntnr:00931] pmix:mca: base: components_register: registering framework gds components
[docker_cntnr:00931] pmix:mca: base: components_register: found loaded component hash
[docker_cntnr:00931] pmix:mca: base: components_register: component hash has no register or open function
[docker_cntnr:00931] pmix:mca: base: components_register: found loaded component ds21
[docker_cntnr:00931] pmix:mca: base: components_register: component ds21 has no register or open function
[docker_cntnr:00931] pmix:mca: base: components_register: found loaded component ds12
[docker_cntnr:00931] pmix:mca: base: components_register: component ds12 has no register or open function
[docker_cntnr:00931] mca: base: components_open: opening gds components
[docker_cntnr:00931] mca: base: components_open: found loaded component hash
[docker_cntnr:00931] mca: base: components_open: component hash open function successful
[docker_cntnr:00931] mca: base: components_open: found loaded component ds21
[docker_cntnr:00931] mca: base: components_open: component ds21 open function successful
[docker_cntnr:00931] mca: base: components_open: found loaded component ds12
[docker_cntnr:00931] mca: base: components_open: component ds12 open function successful
[docker_cntnr:00931] mca:gds:select: checking available component hash
[docker_cntnr:00931] mca:gds:select: Querying component [hash]
[docker_cntnr:00931] gds: hash init
[docker_cntnr:00931] mca:gds:select: checking available component ds21
[docker_cntnr:00931] mca:gds:select: Querying component [ds21]
[docker_cntnr:00931] pmix:gds:dstore init
[docker_cntnr:00931] mca:gds:select: checking available component ds12
[docker_cntnr:00931] mca:gds:select: Querying component [ds12]
[docker_cntnr:00931] pmix:gds:dstore init
[docker_cntnr:00931] Final gds priorities
[docker_cntnr:00931]       gds: ds21 Priority: 30
[docker_cntnr:00931]       gds: ds12 Priority: 20
[docker_cntnr:00931]       gds: hash Priority: 10
[docker_cntnr:00931] [ptl_tcp_component.c:665] GDS STORE KV WITH hash
[docker_cntnr:00931] [4041080832:0] gds:hash:hash_store for proc [4041080832:0] key pmix.srvr.uri type PMIX_STRING scope STORE INTERNALLY
[docker_cntnr:00931] [server/pmix_server.c:1635] GDS STORE KV WITH hash
[docker_cntnr:00931] [4041080832:0] gds:hash:hash_store for proc [4041080832:0] key opal.puri type PMIX_STRING scope STORE INTERNALLY
[docker_cntnr:00931] [server/pmix_server.c:1635] GDS STORE KV WITH hash
[docker_cntnr:00931] [4041080832:0] gds:hash:hash_store for proc [4041080832:0] key opal.puri type PMIX_STRING scope STORE INTERNALLY
[docker_cntnr:00931] [server/pmix_server.c:1635] GDS STORE KV WITH hash
[docker_cntnr:00931] [4041080832:0] gds:hash:hash_store for proc [4041080832:0] key opal.puri type PMIX_STRING scope STORE INTERNALLY
[docker_cntnr:00931] [server/pmix_server.c:587] GDS ADD NSPACE 4041080833
[docker_cntnr:00931] gds: dstore add nspace
[docker_cntnr:00931] gds: dstore add nspace
[docker_cntnr:00931] [server/pmix_server.c:597] GDS CACHE JOB INFO WITH hash
[docker_cntnr:00931] [4041080832:0] gds:hash:cache_job_info for nspace 4041080833 with 22 info
[docker_cntnr:00931] [4041080832:0] gds:hash:cache_job_info proc data for [4041080833:0]: key pmix.locstr
[docker_cntnr:00931] [4041080832:0] gds:hash:cache_job_info proc data for [4041080833:0]: key pmix.grank
[docker_cntnr:00931] [4041080832:0] gds:hash:cache_job_info proc data for [4041080833:0]: key pmix.lrank
[docker_cntnr:00931] [4041080832:0] gds:hash:cache_job_info proc data for [4041080833:0]: key pmix.nrank
[docker_cntnr:00931] [4041080832:0] gds:hash:cache_job_info proc data for [4041080833:0]: key pmix.nodeid
[docker_cntnr:00931] [4041080832:0] gds:hash:cache_job_info proc data for [4041080833:0]: key pmix.hname
[docker_cntnr:00931] [4041080832:0] gds:hash:store_map
[docker_cntnr:00931] [4041080832:0] gds:hash:store_map for [4041080833:0]: key pmix.hname
[docker_cntnr:00931] gds: dstore setup fork
[docker_cntnr:00931] gds: dstore setup fork
[docker_cntnr:00935] mca: base: component_find: searching (null) for gds components
[docker_cntnr:00935] mca: base: find_dyn_components: checking (null) for gds components
[docker_cntnr:00935] pmix:mca: base: components_register: registering framework gds components
[docker_cntnr:00935] pmix:mca: base: components_register: found loaded component hash
[docker_cntnr:00935] pmix:mca: base: components_register: component hash has no register or open function
[docker_cntnr:00935] pmix:mca: base: components_register: found loaded component ds21
[docker_cntnr:00935] pmix:mca: base: components_register: component ds21 has no register or open function
[docker_cntnr:00935] pmix:mca: base: components_register: found loaded component ds12
[docker_cntnr:00935] pmix:mca: base: components_register: component ds12 has no register or open function
[docker_cntnr:00935] mca: base: components_open: opening gds components
[docker_cntnr:00935] mca: base: components_open: found loaded component hash
[docker_cntnr:00935] mca: base: components_open: component hash open function successful
[docker_cntnr:00935] mca: base: components_open: found loaded component ds21
[docker_cntnr:00935] mca: base: components_open: component ds21 open function successful
[docker_cntnr:00935] mca: base: components_open: found loaded component ds12
[docker_cntnr:00935] mca: base: components_open: component ds12 open function successful
[docker_cntnr:00935] mca:gds:select: checking available component hash
[docker_cntnr:00935] mca:gds:select: Querying component [hash]
[docker_cntnr:00935] gds: hash init
[docker_cntnr:00935] mca:gds:select: checking available component ds21
[docker_cntnr:00935] mca:gds:select: Querying component [ds21]
[docker_cntnr:00935] pmix:gds:dstore init
[docker_cntnr:00935] mca:gds:select: checking available component ds12
[docker_cntnr:00935] mca:gds:select: Querying component [ds12]
[docker_cntnr:00935] pmix:gds:dstore init
[docker_cntnr:00935] Final gds priorities
[docker_cntnr:00935]       gds: ds21 Priority: 30
[docker_cntnr:00935]       gds: ds12 Priority: 20
[docker_cntnr:00935]       gds: hash Priority: 10
[docker_cntnr:00931] [ptl_tcp_component.c:1666] GDS CACHE JOB INFO WITH hash
[docker_cntnr:00931] [4041080832:0] gds:hash:cache_job_info for nspace 4041080833 with 1 info
[docker_cntnr:00935] [ptl_tcp.c:764] GDS STORE KV WITH hash
[docker_cntnr:00935] [4041080833:0] gds:hash:hash_store for proc [4041080833:0] key pmix.srvr.uri type PMIX_STRING scope STORE INTERNALLY
[docker_cntnr:00931] [server/pmix_server.c:3406] GDS REG JOB INFO WITH ds21
[docker_cntnr:00931] [4041080832:0] gds:dstore:register_job_info for peer [4041080833:0]
[docker_cntnr:00931] [dstore_base.c:2738] GDS FETCH KV WITH hash
[docker_cntnr:00931] [4041080832:0] pmix:gds:hash fetch NULL for proc [4041080833:WILDCARD] on scope STORE INTERNALLY
[docker_cntnr:00931] FETCHING NODE INFO
[docker_cntnr:00931] FETCHING APP INFO
[docker_cntnr:00931] pmix: unpacked key pmix.srv.nspace
[docker_cntnr:00931] pmix: unpacked key pmix.srv.rank
[docker_cntnr:00931] pmix: unpacked key pmix.jobid
[docker_cntnr:00931] pmix: unpacked key pmix.offset
[docker_cntnr:00931] pmix: unpacked key pmix.nmap
[docker_cntnr:00931] pmix: unpacked key pmix.nodeid
[docker_cntnr:00931] pmix: unpacked key pmix.node.size
[docker_cntnr:00931] pmix: unpacked key pmix.num.nodes
[docker_cntnr:00931] pmix: unpacked key pmix.univ.size
[docker_cntnr:00931] pmix: unpacked key pmix.job.size
[docker_cntnr:00931] pmix: unpacked key pmix.job.napps
[docker_cntnr:00931] pmix: unpacked key pmix.max.size
[docker_cntnr:00931] pmix: unpacked key pmix.toposig
[docker_cntnr:00931] pmix: unpacked key pmix.pmem
[docker_cntnr:00931] pmix: unpacked key pmix.mapby
[docker_cntnr:00931] pmix: unpacked key pmix.rankby
[docker_cntnr:00931] pmix: unpacked key pmix.bindto
[docker_cntnr:00931] pmix: unpacked key pmix.lldr
[docker_cntnr:00931] pmix: unpacked key pmix.srvr.tmpdir
[docker_cntnr:00931] pmix: unpacked key pmix.sing.listnr
[docker_cntnr:00931] pmix: unpacked key pmix.srv.monitor
[docker_cntnr:00931] pmix: unpacked key pmix.nlist
[docker_cntnr:00931] pmix: unpacked key pmix.bfrops.mod
[docker_cntnr:00931] pmix: unpacked key pmix.pdata
[docker_cntnr:00931] [dstore_base.c:2738] GDS FETCH KV WITH hash
[docker_cntnr:00931] [4041080832:0] pmix:gds:hash fetch NULL for proc [4041080833:0] on scope STORE INTERNALLY
[docker_cntnr:00931] pmix: unpacked key pmix.locstr
[docker_cntnr:00931] pmix: unpacked key pmix.grank
[docker_cntnr:00931] pmix: unpacked key pmix.lrank
[docker_cntnr:00931] pmix: unpacked key pmix.nrank
[docker_cntnr:00931] pmix: unpacked key pmix.nodeid
[docker_cntnr:00931] pmix: unpacked key pmix.hname
[docker_cntnr:00935] [client/pmix_client.c:241] GDS STORE JOB INFO WITH ds21
[docker_cntnr:00935] [4041080833:0] pmix:gds:dstore store job info for nspace 4041080833
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.dbg.init`
[docker_cntnr:00935] [client/pmix_client_get.c:696] GDS FETCH IS THREAD SAFE WITH hash
[docker_cntnr:00935] [client/pmix_client_get.c:773] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch pmix.dbg.init for proc [4041080833:WILDCARD] on scope UNDEFINED
[docker_cntnr:00935] [client/pmix_client_get.c:793] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.dbg.init`
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.lrank`
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.nrank`
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.max.size`
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.job.size`
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.appnum`
[docker_cntnr:00935] [client/pmix_client_get.c:696] GDS FETCH IS THREAD SAFE WITH hash
[docker_cntnr:00935] [client/pmix_client_get.c:773] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch pmix.appnum for proc [4041080833:0] on scope UNDEFINED
[docker_cntnr:00935] [client/pmix_client_get.c:793] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.appnum`
[docker_cntnr:00931] [server/pmix_server_get.c:822] GDS FETCH KV WITH hash
[docker_cntnr:00931] [4041080832:0] pmix:gds:hash fetch NULL for proc [4041080833:UNDEF] on scope SHARE ACROSS ALL NODES
[docker_cntnr:00935] [client/pmix_client_get.c:773] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch pmix.local.size for proc [4041080833:UNDEF] on scope UNDEFINED
[docker_cntnr:00935] FETCHING NODE INFO
[docker_cntnr:00931] FETCHING NODE INFO
[docker_cntnr:00931] [server/pmix_server_get.c:830] GDS ASSEMBLE REQ WITH hash
[docker_cntnr:00935] [client/pmix_client_get.c:490] GDS ACCEPT RESP WITH hash
[docker_cntnr:00935] PROCESSING NODE ARRAY
[docker_cntnr:00935] [client/pmix_client_get.c:517] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch pmix.local.size for proc [4041080833:UNDEF] on scope UNDEFINED
[docker_cntnr:00935] FETCHING NODE INFO
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.num.nodes`
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.tmpdir`
[docker_cntnr:00935] [client/pmix_client_get.c:696] GDS FETCH IS THREAD SAFE WITH hash
[docker_cntnr:00935] [client/pmix_client_get.c:773] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch pmix.tmpdir for proc [4041080833:WILDCARD] on scope UNDEFINED
[docker_cntnr:00935] [client/pmix_client_get.c:793] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.tmpdir`
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.nsdir`
[docker_cntnr:00935] [client/pmix_client_get.c:696] GDS FETCH IS THREAD SAFE WITH hash
[docker_cntnr:00935] [client/pmix_client_get.c:773] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch pmix.nsdir for proc [4041080833:WILDCARD] on scope UNDEFINED
[docker_cntnr:00935] [client/pmix_client_get.c:793] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.nsdir`
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.pdir`
[docker_cntnr:00935] [client/pmix_client_get.c:696] GDS FETCH IS THREAD SAFE WITH hash
[docker_cntnr:00935] [client/pmix_client_get.c:773] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch pmix.pdir for proc [4041080833:WILDCARD] on scope UNDEFINED
[docker_cntnr:00935] [client/pmix_client_get.c:793] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.pdir`
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.tdir.rmclean`
[docker_cntnr:00935] [client/pmix_client_get.c:696] GDS FETCH IS THREAD SAFE WITH hash
[docker_cntnr:00935] [client/pmix_client_get.c:773] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch pmix.tdir.rmclean for proc [4041080833:WILDCARD] on scope UNDEFINED
[docker_cntnr:00935] [client/pmix_client_get.c:793] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.tdir.rmclean`
[docker_cntnr:00935] [server/pmix_server.c:1635] GDS STORE KV WITH hash
[docker_cntnr:00935] [4041080833:0] gds:hash:hash_store for proc [4041080832:0] key opal.puri type PMIX_STRING scope STORE INTERNALLY
[docker_cntnr:00935] [client/pmix_client.c:1081] GDS STORE KV WITH hash
[docker_cntnr:00935] [4041080833:0] gds:hash:hash_store for proc [4041080833:0] key btl.tcp.4.0 type PMIX_BYTE_OBJECT scope SHARE ACROSS ALL NODES
[docker_cntnr:00935] [client/pmix_client.c:1167] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch NULL for proc [4041080833:0] on scope SHARE ON LOCAL NODE ONLY
[docker_cntnr:00935] [client/pmix_client.c:1208] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch NULL for proc [4041080833:0] on scope SHARE ON REMOTE NODES ONLY
[docker_cntnr:00931] [server/pmix_server_ops.c:192] GDS STORE KV WITH ds21
[docker_cntnr:00931] [4041080833:0] gds: dstore store for key 'btl.tcp.4.0' scope 1
[docker_cntnr:00931] pmix: unpacked key btl.tcp.4.0
[docker_cntnr:00931] [server/pmix_server_ops.c:201] GDS STORE KV WITH hash
[docker_cntnr:00931] [4041080832:0] gds:hash:hash_store for proc [4041080833:0] key btl.tcp.4.0 type PMIX_BYTE_OBJECT scope SHARE ON REMOTE NODES ONLY
[docker_cntnr:00931] [server/pmix_server_ops.c:793] GDS FETCH KV WITH hash
[docker_cntnr:00931] [4041080832:0] pmix:gds:hash fetch NULL for proc [4041080833:0] on scope SHARE ON REMOTE NODES ONLY
[docker_cntnr:00931] [server/pmix_server.c:2460] GDS STORE MODEX WITH ds21
[docker_cntnr:00931] [4041080832:0] gds:dstore:store_modex for nspace 4041080833
[docker_cntnr:00935] [client/pmix_client_get.c:689] GDS FETCH IS THREAD SAFE WITH ds21
[docker_cntnr:00935] [client/pmix_client_get.c:691] GDS FETCH KV WITH ds21
[docker_cntnr:00935] gds: dstore fetch `pmix.mapby`
[docker_cntnr:00935] [client/pmix_client_get.c:773] GDS FETCH KV WITH hash
[docker_cntnr:00935] [4041080833:0] pmix:gds:hash fetch pmix.lpeers for proc [4041080833:UNDEF] on scope UNDEFINED
[docker_cntnr:00935] FETCHING NODE INFO
Init:Thu Dec 10 17:42:13 2020
0 of 1
[docker_cntnr:00935] gds: hash finalize
[docker_cntnr:00935] mca: base: close: component hash closed
[docker_cntnr:00935] mca: base: close: unloading component hash
[docker_cntnr:00935] mca: base: close: component ds21 closed
[docker_cntnr:00935] mca: base: close: unloading component ds21
[docker_cntnr:00935] mca: base: close: component ds12 closed
[docker_cntnr:00935] mca: base: close: unloading component ds12
[docker_cntnr:00931] [server/pmix_server.c:839] GDS DEL NSPACE 4041080833
[docker_cntnr:00931] gds: hash finalize
[docker_cntnr:00931] mca: base: close: component hash closed
[docker_cntnr:00931] mca: base: close: unloading component hash
[docker_cntnr:00931] mca: base: close: component ds21 closed
[docker_cntnr:00931] mca: base: close: unloading component ds21
[docker_cntnr:00931] mca: base: close: component ds12 closed
[docker_cntnr:00931] mca: base: close: unloading component ds12
rhc54 commented 3 years ago

Hmmm....it all looks okay. I'm struggling to understand why a PMIx message would get so large. We store the data in a shared memory region, so the message from the server to any client is only a few kilobytes to tell it where the shmem region sits. You don't appear to be using the "hash" support which would entail the server sending all the data to each client.

You are welcome to try with a larger max message size - worst that can happen is that something will crash. You can adjust the value by setting PMIX_MCA_ptl_base_max_msg_size=1000 (or whatever number you like), where the number is in Mbytes.

fm-dewal commented 3 years ago

Since using the PMIX_MCA_ptl_base_max_msg_size=30, I no longer encounter the ptl_base_max_msg_size error. However, the job still fails. To help digest the error messages better, I have divided them into five parts. Please let me know if you have suggestions on how to tackle these errors. Thank you for your time and expertise.

Part 1:

[root@server:./logDir]# cat stderr.log | grep -v -E 'PMIX|\*\*\*|OPAL|not able to aggregate error messages'
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[22910,1],5999]) is on host: <dockerHost1>
  Process 2 ([[22910,1],5887]) is on host: <IPv4_dockerHost1>
  BTLs attempted: self tcp vader

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  <dockerHost2>
  System call: open(2)
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
[dockerHost3:188157] 19266 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[dockerHost3:188157] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[dockerHost3:188157] 19266 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:pml-add-procs-fail
[dockerHost3:188157] 19266 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[dockerHost3:188157] 71 more processes have sent help message help-opal-shmem-mmap.txt / sys call fail

Part 2:

[root@server:./logDir]# cat stderr.log | grep 'PMIX' | head -n 10
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1714
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1758
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1714
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1758
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1714
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1758
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1714
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1758
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1714
>>[dockerHostXY:123297] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1758
[root@server:./logDir]# cat stderr.log | grep 'PMIX' | tail -n 10
>>[dockerHostXY:130859] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
>>[dockerHostXY:147725] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
>>[dockerHostXY:185888] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
>>[dockerHostXY:130859] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
>>[dockerHostXY:147725] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
>>[dockerHostXY:185888] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
>>[dockerHostXY:130859] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
>>[dockerHostXY:147725] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
>>[dockerHostXY:185888] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
>>[dockerHostXY:130859] PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
[root@server:./logDir]# cat stderr.log | grep 'PMIX' | grep -v -E 'PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923|PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 17' | head -n 10
>>////NO_LOGS_FOUND////

Part 3:

[root@server:./logDir]# cat stderr.log | grep '\*\*\*' | head -n 10
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator

Part 4:

[root@server:./logDir]# cat stderr.log | grep 'OPAL' | head -n 10
[dockerHostXY:186232] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dockerHostXY:164100] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dockerHostXY:165790] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dockerHostXY:164104] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dockerHostXY:165793] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dockerHostXY:164109] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dockerHostXY:165796] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dockerHostXY:164111] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dockerHostXY:165804] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
[dockerHostXY:164106] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112

Part 5:

[root@server:./logDir]# cat stderr.log | grep 'not able to aggregate error messages' | head -n 10
[dockerHostXY:186232] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[dockerHostXY:164100] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[dockerHostXY:165790] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[dockerHostXY:164104] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[dockerHostXY:165793] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[dockerHostXY:164109] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[dockerHostXY:165796] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[dockerHostXY:164111] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[dockerHostXY:165804] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[dockerHostXY:164106] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
jsquyres commented 3 years ago

Also see the discussion on #8282.

rhc54 commented 3 years ago

@fm-dewal Any further progress on this or can it be closed?