open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 844 forks source link

Hang during MPI_Init #11566

Closed lrbison closed 1 year ago

lrbison commented 1 year ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source: ./configure --with-sge --without-verbs --with-libfabric=/opt/amazon/efa --disable-man-pages --enable-debug --prefix=/home/ec2-user/ompi

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Before and after https://github.com/open-mpi/ompi/pull/11563

I'm looking at the stack trace and hangs from after the PR, but I suspect it's the same issue as before the PR, but happening less for some reason.

Please describe the system on which you are running


Details of the problem

Hang during startup of a very simple all-gather test. (ibm/collective/allgatherv). Seems to be more common with larger runs. Test case is 512 ranks on 8 hosts.

By inserting prints at the front I can see the allgatherv program is launched, can print "hello", but sometimes it hangs forever during MPI_Init().

Stack trace:

#0  0x0000ffff7ee07f7c in nanosleep () from /lib64/libc.so.6
#1  0x0000ffff7ee2fd40 in usleep () from /lib64/libc.so.6
#2  0x0000ffff7efecb10 in ompi_mpi_instance_init_common (argc=1, argv=0xfffff5404768) at instance/instance.c:739
#3  0x0000ffff7efece9c in ompi_mpi_instance_init (ts_level=0, info=0xffff7f4891e8 <ompi_mpi_info_null>, errhandler=0xffff7f481228 <ompi_mpi_errors_are_fatal>,
    instance=0xffff7f491cc8 <ompi_mpi_instance_default>, argc=1, argv=0xfffff5404768) at instance/instance.c:814
#4  0x0000ffff7efdb1b0 in ompi_mpi_init (argc=1, argv=0xfffff5404768, requested=0, provided=0xfffff54045ac, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#5  0x0000ffff7f0512a4 in PMPI_Init (argc=0xfffff54045dc, argv=0xfffff54045d0) at init.c:67

Launch command:

$PREFIX/bin/mpirun --prefix=$PREFIX -np 512 -N 64 -hostfile /home/ec2-user/PortaFiducia/hostfile /home/ec2-user/ompi-tests/ibm/collective/allgatherv

lrbison commented 1 year ago

I find that of all 512 processes, all but 5 have a stack trace like:

#0  0x0000ffffba8d5f7c in nanosleep () from /lib64/libc.so.6
#1  0x0000ffffba8fdd40 in usleep () from /lib64/libc.so.6
#2  0x0000ffffbaabab10 in ompi_mpi_instance_init_common (argc=1, argv=0xffffd105cc38) at instance/instance.c:739
#3  0x0000ffffbaabae9c in ompi_mpi_instance_init (ts_level=0, info=0xffffbaf571e8 <ompi_mpi_info_null>, errhandler=0xffffbaf4f228 <ompi_mpi_errors_are_fatal>, instance=0xffffbaf5fcc8 <ompi_mpi_instance_default>, argc=1, argv=0xffffd105cc38) at instance/instance.c:814
#4  0x0000ffffbaaa91b0 in ompi_mpi_init (argc=1, argv=0xffffd105cc38, requested=0, provided=0xffffd105ca7c, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#5  0x0000ffffbab1f2a4 in PMPI_Init (argc=0xffffd105caac, argv=0xffffd105caa0) at init.c:67

But the other 5 have one that looks like:

#0  0x0000ffff86741be4 in epoll_pwait () from /lib64/libc.so.6
#1  0x0000ffff85ed2fbc in epoll_dispatch () from /lib64/libevent_core-2.0.so.5
#2  0x0000ffff85ebc3a8 in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3  0x0000ffff865208f4 in opal_progress_events () at runtime/opal_progress.c:188
#4  0x0000ffff86520a0c in opal_progress () at runtime/opal_progress.c:238
#5  0x0000ffff868f7b08 in ompi_mpi_instance_init_common (argc=1, argv=0xffffec8e22b8) at instance/instance.c:739
#6  0x0000ffff868f7e9c in ompi_mpi_instance_init (ts_level=0, info=0xffff86d941e8 <ompi_mpi_info_null>, errhandler=0xffff86d8c228 <ompi_mpi_errors_are_fatal>, instance=0xffff86d9ccc8 <ompi_mpi_instance_default>, argc=1, argv=0xffffec8e22b8) at instance/instance.c:814
#7  0x0000ffff868e61b0 in ompi_mpi_init (argc=1, argv=0xffffec8e22b8, requested=0, provided=0xffffec8e20fc, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#8  0x0000ffff8695c2a4 in PMPI_Init (argc=0xffffec8e212c, argv=0xffffec8e2120) at init.c:67

I though that perhaps I happened to catch these 5 outside of the sleep but when I issue another stack trace, I find the same processes are in that state, so it seems they they are actually waiting on something.

lrbison commented 1 year ago

Some statistics:

with https://github.com/open-mpi/ompi/pull/11563 I only see 71 failures after 2370 tests (3%) without the pr, I see 533 failures after 2370 tests (22%)

hppritcha commented 1 year ago

Do you observe this if not using SGE?

mjkoop commented 1 year ago

Yes, this still occurs without compiling with SGE support (at least specifically)

wckzhang commented 1 year ago

I've tried to reproduce this with v5.0.x branch and cannot. I am able to reproduce this in main branch. I think it's highly probable that the latest pmix/prrte submodules have introduced some bugs in them. It's good that this isn't a 5.0.x blocker at least

wckzhang commented 1 year ago

Judging based on where it is hanging:

    if (!opal_process_info.is_singleton) {                                                                              
        /* if we executed the above fence in the background, then                                                       
         * we have to wait here for it to complete. However, there                                                      
         * is no reason to do two barriers! */                                                                          
        if (background_fence) {                                                                                         
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);                                                                      
        } else if (!ompi_async_mpi_init) {                                                                              
            /* wait for everyone to reach this point - this is a hard                                                   
             * barrier requirement at this time, though we hope to relax                                                
             * it at a later point */                                                                                   
            bool flag = false;                                                                                          
            active = true;                                                                                              
            OPAL_POST_OBJECT(&active);                                                                                  
            PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &flag, PMIX_BOOL);                                              
            if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, info, 1,                                                   
                                                    fence_release, (void*)&active))) {                                  
                ret = opal_pmix_convert_status(rc);                                                                     
                return ompi_instance_print_error ("PMIx_Fence_nb() failed", ret);                                       
            }                                                                                                           
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);   // <---- Here                                                                   
        }                                                                                                               
    }

I'd guess that something has gone wrong in PMIx_Fence_nb

@rhc54 do you think this may be a PMIx issue?

rhc54 commented 1 year ago

Could be? Could also be in PRRTE, but I don't know. I suggest you try to chase it down as the PRRTE submodule in the v5.0 branch will be catching up to what is in main very soon. You could try adding --prtemca pmix_server_verbose 5 to the cmd line and see if the local daemon is getting the "fence" upcall from the PMIx server.

wckzhang commented 1 year ago

Just to see, updated the PRRTE submodule to match OMPI main, haven't hit the issue though I don't know for sure it's not some PRRTE issue. Will see if I can get some time to investigate.

rhc54 commented 1 year ago

Sounds like it is probably a PMIx problem. We did recently make a change to the client fence code - I cannot reproduce this problem, but I'll take a look at the code.

rhc54 commented 1 year ago

Only thing I could find between PMIx v4.2 (what is in OMPI v5) and PMIx master (what is in OMPI main) is that the latter wasn't sorting the procs in the fence operation. I'm not sure why that would manifest the problem stated here, so I expect this isn't related. I fixed it anyway so we have consistency between the PMIx branches.

I'm afraid I'll have to rely on others to provide more debug info - like I said, I'm unable to reproduce the problem here.

rhc54 commented 1 year ago

One thing occurs to me. The biggest difference between PMIx v4.x and PMIx master is the new shared memory subsystem. Could you folks please add PMIX_MCA_gds=hash to your environment and run your tests again? I'd like to see if that works.

lrbison commented 1 year ago

PMIX_MCA_gds=hash

This fixes it on main! 0/178 runs failed, where previously 1/5 runs failed. Double-confirmed by re-running without the environment variable and failures returned.

rhc54 commented 1 year ago

@samuelkgutierrez Can you please take a look into this? It appears that the modex information storage is having a problem. Might be we need the failover code after all, or it could be a race condition in the code where you store it.

samuelkgutierrez commented 1 year ago

I wonder if this is related to https://github.com/openpmix/openpmix/issues/2967?

This certainly looks like a race condition. My understanding, though, is that the modex code in gds is single-threaded as far as the component is concerned. Perhaps there is an interaction that is causing this in gds/shmem related to differences exhibited by hash and shmem?

I've spent some time trying to track this hang down, but haven't found anything obvious on my end. So any help here is greatly appreciated. If memory serves me, I was able to replicate this hang with hash, but it was certainly much less frequent.

For those that are able to help take a look, this is the modex code in gds/shmem in question:

samuelkgutierrez commented 1 year ago

Judging based on where it is hanging:

    if (!opal_process_info.is_singleton) {                                                                              
        /* if we executed the above fence in the background, then                                                       
         * we have to wait here for it to complete. However, there                                                      
         * is no reason to do two barriers! */                                                                          
        if (background_fence) {                                                                                         
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);                                                                      
        } else if (!ompi_async_mpi_init) {                                                                              
            /* wait for everyone to reach this point - this is a hard                                                   
             * barrier requirement at this time, though we hope to relax                                                
             * it at a later point */                                                                                   
            bool flag = false;                                                                                          
            active = true;                                                                                              
            OPAL_POST_OBJECT(&active);                                                                                  
            PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &flag, PMIX_BOOL);                                              
            if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, info, 1,                                                   
                                                    fence_release, (void*)&active))) {                                  
                ret = opal_pmix_convert_status(rc);                                                                     
                return ompi_instance_print_error ("PMIx_Fence_nb() failed", ret);                                       
            }                                                                                                           
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);   // <---- Here                                                                   
        }                                                                                                               
    }

I'd guess that something has gone wrong in PMIx_Fence_nb

@rhc54 do you think this may be a PMIx issue?

@wckzhang would it be possible to get a stack trace from all the threads to see where in PMIx this is happening? Thank you.

wckzhang commented 1 year ago

I think it should be possible, I'll try to get that, though I haven't tried reproducing with less than 144 ranks.

rhc54 commented 1 year ago

I think the question @samuelkgutierrez is trying to determine is whether the client has recvd the job info and is hung trying to process it, or if the client has not recvd the job info. The referenced PMIx issue indicates that it may just be one or two ranks that are actually stuck.

There is another fence at the end of MPI_Init, so it is possible that all the other ranks are sitting in that fence waiting for the "stuck" ranks to catch up. Hence, you won't see the "hello" output from anyone.

What I would do is add some output right after that referenced code that simply states "rank N has completed modex" so you can see which ranks are stuck, and then give us the full stack trace for just one of those.

rhc54 commented 1 year ago

If memory serves me, I was able to replicate this hang with hash, but it was certainly much less frequent.

We iterated off-list. What Sam was remembering was something completely unrelated to the MPI_Init hang issue. This seems to be something specific to the gds/shmem component as disabling it resolves the problem. I suspect the problem is in the client side, and that the shmem component is hitting an issue that causes the client to not complete the fence operation.

I'll start looking thru the code, but the results of the above test from someone who can reproduce the problem would help a lot.

wckzhang commented 1 year ago

I put the print right after the referenced code, does not print anything during the hang, so it doesn't seem like that was the case:

    if (!opal_process_info.is_singleton) {                                                                              
        /* if we executed the above fence in the background, then                                                       
         * we have to wait here for it to complete. However, there                                                      
         * is no reason to do two barriers! */                                                                          
        if (background_fence) {                                                                                         
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);                                                                      
        } else if (!ompi_async_mpi_init) {                                                                              
            /* wait for everyone to reach this point - this is a hard                                                   
             * barrier requirement at this time, though we hope to relax                                                
             * it at a later point */                                                                                   
            bool flag = false;                                                                                          
            active = true;                                                                                              
            OPAL_POST_OBJECT(&active);                                                                                  
            PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &flag, PMIX_BOOL);                                              
            if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, info, 1,                                                   
                                                    fence_release, (void*)&active))) {                                  
                ret = opal_pmix_convert_status(rc);                                                                     
                return ompi_instance_print_error ("PMIx_Fence_nb() failed", ret);                                       
            }                                                                                                           
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);   // <---- Hangs Here                                                                   
        }                                                                                                               
    }
/* Added prints here */
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Rank %d has completed the fence.\n", rank);
wckzhang commented 1 year ago

Added --prtemca pmix_server_verbose 5 but that doesn't seem to give any useful information:

ip-172-31-20-101:10470] [prterun-ip-172-31-20-101-10470@0,0] spawn upcalled on behalf of proc prterun-ip-172-31-20-101-10470@0:0 with 8 job infos
[ip-172-31-20-101:10470] [prterun-ip-172-31-20-101-10470@0,0] spawn called from proc [prterun-ip-172-31-20-101-10470@0,0] with 1 apps
[ip-172-31-20-101:10470] [prterun-ip-172-31-20-101-10470@0,0] register nspace for prterun-ip-172-31-20-101-10470@1
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11

/* Hangs here */
rhc54 commented 1 year ago

That is....totally unexpected. However, I went and checked and there is sadly no debug output in the "fence" upcall. Sigh. I'll fix that for next time.

I'm also going to add some output in the shmem component to bark when it runs into trouble. Will update when it is ready.

rhc54 commented 1 year ago

Okay, I would deeply appreciate it if you could git pull both the PMIx and PRRTE submodules master branches:

$ cd 3rd-party/openpmix
$ git checkout master
$ git pull
$ cd ../prrte
$ git checkout master
$ git pull

You may have to do a git submodule update after each git pull as the submodules each have their own submodules that were updated recently. You'll then have to rerun autogen.pl, reconfigure, and rebuild.

Once you have that done, please add the following to your respective mpirun cmd line:

--prtemca pmix_server_verbose 5 --pmixmca pmix_client_force_debug_verbose 5 --leave-session-attached

On a "good" run, I expect you will see something like the following output (amongst other things):

From each daemon:

[rhc-node01:61350] [prterun-rhc-node01-61350@0,0] FENCE UPCALLED ON NODE <hostname>

From each client:

[rhc-node02:00168] client:unpack fence received status SUCCESS
[rhc-node02:00168] gds:shmem recv modex complete called
[rhc-node02:00168] gds:shmem connection complete with status SUCCESS
[rhc-node02:00168] client:unpack fence modex complete with status SUCCESS

On a "bad" run, I expect you will see the same output from the daemons. However, on at least one node, you should see at least one proc that hangs after emitting:

[rhc-node02:00168] client:unpack fence received status SUCCESS
[rhc-node02:00168] gds:shmem recv modex complete called
...no further output

This will confirm that the hang is due to some problem with attaching to the shmem region after completing the modex. If you could then attach to one of those processes, it would help to get a full stack trace from it showing all threads.

wckzhang commented 1 year ago

Doing it, but it is proving to be taking a while to reproduce. I suspect it's because, from what Luke said about - https://github.com/open-mpi/ompi/pull/11563 - the latest pmix/prrte pointers only have a 3% error rate compared to the 22% error rate on OMPI main branch.

wckzhang commented 1 year ago

This was the (tail) output of one hang:

[queue-c5n18xlarge-st-c5n18xlarge-3:08575] UUID: ipv4://06:9f:0a:ba:3c:47 OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[queue-c5n18xlarge-st-c5n18xlarge-3:08575] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[queue-c5n18xlarge-st-c5n18xlarge-4:09131] [prterun-ip-172-31-20-101-19515@0,4] FENCE UPCALLED ON NODE queue-c5n18xlarge-st-c5n18xlarge-4
[queue-c5n18xlarge-st-c5n18xlarge-3:08575] [prterun-ip-172-31-20-101-19515@0,3] FENCE UPCALLED ON NODE queue-c5n18xlarge-st-c5n18xlarge-3
[queue-c5n18xlarge-st-c5n18xlarge-1:02052] [prterun-ip-172-31-20-101-19515@0,1] FENCE UPCALLED ON NODE queue-c5n18xlarge-st-c5n18xlarge-1
/* Hangs */
/* Output when I enter newline in cmd */
[ip-172-31-20-101:19515] [prterun-ip-172-31-20-101-19515@0,0] pmix_server_stdin_push to dest [prterun-ip-172-31-20-101-19515@1,0]: size 1
[ip-172-31-20-101:19515] [prterun-ip-172-31-20-101-19515@0,0] pmix_server_stdin_push to dest [prterun-ip-172-31-20-101-19515@1,0]: size 1
[ip-172-31-20-101:19515] [prterun-ip-172-31-20-101-19515@0,0] pmix_server_stdin_push to dest [prterun-ip-172-31-20-101-19515@1,0]: size 1
[ip-172-31-20-101:19515] [prterun-ip-172-31-20-101-19515@0,0] pmix_server_stdin_push to dest [prterun-ip-172-31-20-101-19515@1,0]: size 1
wckzhang commented 1 year ago

I just noticed, only 3 nodes are calling the fence upcall, one node is missing

wckzhang commented 1 year ago

Adding some more prints, when I checked GDB, the hang was on this section of code this time:

    if (!opal_process_info.is_singleton) {                                                                              
        if (opal_pmix_base_async_modex) {                                                                               
            /* if we are doing an async modex, but we are collecting all                                                
             * data, then execute the non-blocking modex in the background.                                             
             * All calls to modex_recv will be cached until the background                                              
             * modex completes. If collect_all_data is false, then we skip                                              
             * the fence completely and retrieve data on-demand from the                                                
             * source node.                                                                                             
             */                                                                                                         
            if (opal_pmix_collect_all_data) {                                                                           
                /* execute the fence_nb in the background to collect                                                    
                 * the data */                                                                                          
                background_fence = true;                                                                                
                active = true;                                                                                          
                OPAL_POST_OBJECT(&active);                                                                              
                PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &opal_pmix_collect_all_data, PMIX_BOOL);                    
                if( PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, NULL, 0,                                               
                                                        fence_release,                                                  
                                                        (void*)&active))) {                                             
                    ret = opal_pmix_convert_status(rc);                                                                 
                    error = "PMIx_Fence_nb() failed";                                                                   
                    goto error;                                                                                         
                }                                                                                                       
            }                                                                                                           
        } else {                                                                                                        
            /* we want to do the modex - we block at this point, but we must                                            
             * do so in a manner that allows us to call opal_progress so our                                            
             * event library can be cycled as we have tied PMIx to that                                                 
             * event base */                                                                                            
            active = true;                                                                                              
            OPAL_POST_OBJECT(&active);                                                                                  
            PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &opal_pmix_collect_all_data, PMIX_BOOL);                        
            rc = PMIx_Fence_nb(NULL, 0, info, 1, fence_release, (void*)&active);                                        
            if( PMIX_SUCCESS != rc) {                                                                                   
                ret = opal_pmix_convert_status(rc);                                                                     
                error = "PMIx_Fence() failed";                                                                          
                goto error;                                                                                             
            }                                                                                                           
            /* cannot just wait on thread as we need to call opal_progress */                                           
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);                                <--- hangs here                                     
        }                                                                                                               
    }                                               

I added some prints before and after the hang to around this fence as well, now trying to reproduce again

wckzhang commented 1 year ago

I guess rank is not instantiated at this point in instance.c as getting rank from comm world only returns 0 for the rank.

I was however able to check and on a hang, there is one less process that calls PMIx_Fence (143 procs), than a successful run (144 procs) at this location. So one proc is missing when trying to call the fence

rhc54 commented 1 year ago

Let me see if I understand what you are saying. You put some prints around the call to PMIx_Fence_nb. When the app hangs, you are seeing that at least one rank never actually calls PMIx_Fence_nb - it hangs somewhere before that point.

If you set PMIX_MCA_gds=hash (thereby disabling the shmem component), the problem disappears.

Is that an accurate summary? If so, then it sounds like the next step would be to put some print statements above this point and see if we can localize where the hang occurs.

I guess rank is not instantiated at this point in instance.c as getting rank from comm world only returns 0 for the rank

That is highly suspicious as the rank is set when you init the RTE, which must be long before you get to a Fence.

wckzhang commented 1 year ago

You put some prints around the call to PMIx_Fence_nb.

Yes

When the app hangs, you are seeing that at least one rank never actually calls PMIx_Fence_nb

Yes

it hangs somewhere before that point.

Not sure if it hangs or something else is occurring.

If you set PMIX_MCA_gds=hash (thereby disabling the shmem component), the problem disappears.

Yes (or --mca gds ^shmem)

That is highly suspicious as the rank is set when you init the RTE, which must be long before you get to a Fence.

All the prints look like:

Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence

This is the diff for the prints:

+++ b/ompi/instance/instance.c
@@ -537,7 +537,8 @@ static int ompi_mpi_instance_init_common (int argc, char **argv)
         OMPI_TIMING_NEXT("pmix-barrier-2");
     }
 #endif
-
+    int rank;
+   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (! opal_process_info.is_singleton) {
         if (opal_pmix_base_async_modex) {
             /* if we are doing an async modex, but we are collecting all
@@ -569,13 +570,16 @@ static int ompi_mpi_instance_init_common (int argc, char **argv)
             active = true;
             OPAL_POST_OBJECT(&active);
             PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &opal_pmix_collect_all_data, PMIX_BOOL);
+           printf("Rank %d calling fence\n", rank);
             rc = PMIx_Fence_nb(NULL, 0, info, 1, fence_release, (void*)&active);
             if( PMIX_SUCCESS != rc) {
                 ret = opal_pmix_convert_status(rc);
                 return ompi_instance_print_error ("PMIx_Fence() failed", ret);
             }
             /* cannot just wait on thread as we need to call opal_progress */
+           printf("Rank %d waiting on fence\n", rank);
             OMPI_LAZY_WAIT_FOR_COMPLETION(active);
+           printf("Rank % completed fence\n", rank);
         }
     }
rhc54 commented 1 year ago

Very odd why you get rank=0 on every proc - sounds like it thinks you are starting a bunch of singletons? I don't know why else the rank would be set to 0.

Is the rank=0 on runs that don't hang? I'm wondering if that might be a tip to what is happening here.

wckzhang commented 1 year ago

rank=0 is on all runs, so I don't think it's related, plus these prints occur in this if statement:

    if (! opal_process_info.is_singleton) {

It's only in this function that the rank is 0, when I put prints in the ompi_instance_init functions, it properly prints out a correct rank. I assumed that rank wasn't set at that point

rhc54 commented 1 year ago

Okay, makes no sense to me, but let's set it aside for now. Can you add print statements in the code to get an idea of where it hangs?

bosilca commented 1 year ago

It makes sense, once you understand what is going on. The OMPI object for MPI_COMM_WORLD is static and therefore initialized with rank = 0 on all nodes. As the printf happens very early in the process it is normal that all processes show a rank 0, simply because the predefined objects are not yet initialized.

I noticed this few days ago when I was looking why add_procs added a bunch of rank 0 (in MPI_COMM_WORLD). The root cause was that in the sessions code, the PML/BML initialization is triggered very early in the process, before the predefined communicators are correctly setup.

wckzhang commented 1 year ago

The good news is at the start of the function I printed a statement which printed 144 times, so the error should be contained in this one function, I should be able to narrow it down from here.

wckzhang commented 1 year ago

I think my prints are affecting the timing of the hang, so I might have to reduce the number of prints to reproduce the issue.

rhc54 commented 1 year ago

It makes sense, once you understand what is going on. The OMPI object for MPI_COMM_WORLD is static and therefore initialized with rank = 0 on all nodes. As the printf happens very early in the process it is normal that all processes show a rank 0, simply because the predefined objects are not yet initialized.

I noticed this few days ago when I was looking why add_procs added a bunch of rank 0 (in MPI_COMM_WORLD). The root cause was that in the sessions code, the PML/BML initialization is triggered very early in the process, before the predefined communicators are correctly setup.

I see - thanks for the explanation! The comm_world rank is provided by PMIx_Init, which has to occur before anything in this file can take place. Would it make sense and alleviate confusion to assign the comm_world rank at that point?

hppritcha commented 1 year ago

predefined communicators aren't set up if an application is only using sessions.

hppritcha commented 1 year ago

May be time to look more at https://github.com/open-mpi/ompi/pull/11451 ?

rhc54 commented 1 year ago

Are those fences occurring before this one? We need to ensure that we are getting thru all of the fences. If we block on one, then we won't get to the next.

wckzhang commented 1 year ago

It looks like it's hanging somewhere in this code block. Any prints I put in here seem to make it much harder to reproduce. I suspect it's stuck in the ompi_rte_init:

    ret = mca_base_var_find("opal", "event", "*", "event_include");
    if (ret >= 0) {
        char *allvalue = "all";
        /* We have to explicitly "set" the MCA param value here
           because libevent initialization will re-register the MCA
           param and therefore override the default. Setting the value
           here puts the desired value ("all") in different storage
           that is not overwritten if/when the MCA param is
           re-registered. This is unless the user has specified a different
           value for this MCA parameter. Make sure we check to see if the
           default is specified before forcing "all" in case that is not what
           the user desires. Note that we do *NOT* set this value as an
           environment variable, just so that it won't be inherited by
           any spawned processes and potentially cause unintended
           side-effects with launching RTE tools... */
        mca_base_var_set_value(ret, allvalue, 4, MCA_BASE_VAR_SOURCE_DEFAULT, NULL);
    }

    OMPI_TIMING_NEXT("initialization");

    /* Setup RTE */
    if (OMPI_SUCCESS != (ret = ompi_rte_init (&argc, &argv))) {
            printf("EXTREME ERROR!\n");
        return ompi_instance_print_error ("ompi_mpi_init: ompi_rte_init failed", ret);
    }

Briefly looking into the function, it does PMIx init and various operations, so it looks suspicious. However considering that adding more prints seem to be causing an issue in reproducing it, I'm going to just manually attach to all 144 processes and see if one of them is doing something different.

rhc54 commented 1 year ago

If you can find where in that function it blocks, it would help a lot. I'm betting it's in PMIx_Init, and the problem is that the shmem component cannot attach, which is why disabling that component makes the problem go away. However, it could be that you get a little further and we hang while trying to retrieve some piece of info.

wckzhang commented 1 year ago

I couldn't find any different stack trace. There should be 36 proc's per node, but it looks like on the hanging node (the one w/o a fence upcall), there's only 35 procs. Could something be happening that kills a process in pmix?

rhc54 commented 1 year ago

It shouldn't happen, but I cannot say that it is impossible. What I'd do is put print statements around the PMIx_Init call and see if you always exit it cleanly. If you see the print going in, but don't see a print coming out and the proc is gone - that's a sure sign we are somehow calling exit or causing you to segfault. Puzzling thing: PRRTE should see either of those situations and abort the job.

rhc54 commented 1 year ago

I suppose it is possible we are exiting with zero status - PRRTE would think that is normal and do nothing about it. I'll see if we can add some debug to help detect that situation should you find we are not returning from PMIx_Init and the proc is gone. If the proc is still there, then we are likely hung and PRRTE won't know about it.

wckzhang commented 1 year ago

Yes I'm pretty sure PMIx_Init is killing a proc. It doesn't seem to be just hanging, there's a proc missing on one node each time, and I confirmed that this print:

 /* initialize the selected module */
    printf("Entering pmix init\n");
    if (!PMIx_Initialized() && (PMIX_SUCCESS != (ret = PMIx_Init(&opal_process_info.myprocid, NULL, 0)))) {
        printf("PMIx_Init errored\n");
    .
    .
    .
    else {
        printf("Exiting pmix init successfully\n");
    }

The "Enter" print prints out 144 times, the error print printed out 0 times, and the exit print printed out 143 times. There is also a missing proc when I look at "top", only for the node without a fence upcall.

rhc54 commented 1 year ago

Okay, that helps a bunch - thanks! I'm going to disable the gds/shmem component until someone has time to address this problem. I'll post a submodule update to pick up the chang and report it here as I lack commit privileges for OMPI.

rhc54 commented 1 year ago

Okay, the update is here

samuelkgutierrez commented 1 year ago

I suppose it is possible we are exiting with zero status - PRRTE would think that is normal and do nothing about it.

A straightforward way to test this might include registering an atexit() function before PMIx_Init() is called that emits useful debugging information.

samuelkgutierrez commented 1 year ago

I've been thinking about this some. Here are some general thoughts and questions:

rhc54 commented 1 year ago

The lack of any indicators is indeed puzzling. If the process was terminating abnormally (e.g., SIGBUS), then PRRTE would (a) bark loudly, and (b) terminate the entire job. It wouldn't be allowed to hang. Likewise, if at least one local process calls PMIx_Init, then PRRTE marks the job as requiring formal finalization - i.e., all processes are required to call PMIx_Finalize prior to exiting. If any process fails to do so, then PRRTE will declare the job to be abnormally terminated, bark loudly, and kill the entire job.

I see a few possibilities:

The fact that the problem goes away with gds/shmem disabled would seem to indicate that it is somehow related to that component being present. Just not clear precisely how it is involved.