open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

MPI_T_pvar_handle_alloc() notify callback to unloaded code #4433

Closed markalle closed 4 years ago

markalle commented 7 years ago

I'm hitting segv when the pvar system tries to trigger a callback into code from an MCA that has been dlclosed(). I'm able to hit the issue with master on x86

Testcase: https://gist.github.com/markalle/9dbc086af688afdc2632228ca2c91356 % mpicc -o x pvars_from_unloaded_mca.c % mpirun -np 1 ./x

The sequence of events is

I'm wondering what is the preferred way to fix this. I'm inclined to say if an MCA gets unloaded then any pvars it created need to be removed/invalidated. I found an mca_base_pvar_mark_invalid() function, and if I add a few calls to that inside mca_pml_ob1_component_close() matching what was registered earlier the problem does go away, eg:

    int i, n;
    int nameLen, descLen, verbosity, var_class;
    int binding, is_readonly, is_continuous, is_atomic;
    char name[128], desc[1024];
    MPI_T_enum enumtype;
    MPI_Datatype datatype;

    MPI_T_pvar_get_num(&n);
    for (i=0; i<n; ++i) {
        nameLen = sizeof(name);
        descLen = sizeof(desc);
        MPI_T_pvar_get_info(i, name, &nameLen, &verbosity, &var_class,
                        &datatype, &enumtype, desc, &descLen, &binding,
                        &is_readonly, &is_continuous, &is_atomic);
        if (0 == strcmp(name, "pml_ob1_unexpected_msgq_length") ||
            0 == strcmp(name, "pml_ob1_posted_recvq_length"))
        {
            mca_base_pvar_mark_invalid(i);
        }
    }

But I really don't like that solution. Is there a more automated way built into the pvar system to make this happen? And I haven't examined the cvar or any other of the var_register systems, do they also have callbacks that will cause the same problem?

bosilca commented 7 years ago

@markalle the PML selection calls mca_base_components_close on all not-selected components. If we follow the function call chain, we see a call to mca_base_component_close and finally the expected mca_base_component_unload which removes all vars related to the component. So the problem seems to be coming from some other place ...

I could reproduce the segfault and I looked into it. The problem is not the MCA variables, but the way MPI forces us to declare the callback triggers. At the API level we call MPI_T_pvar_handle_alloc with &comm, but this gets translated simply into a void. However, from OMPI perspective this should be a MPI_Comm or more precisely a ompi_communicator_t**.

There are 2 ways to fix this. Either we modify all PVAR callbacks to do the extra translation (aka. instead of (ompi_communicator_t*)obj_handle use *(ompi_communicator_t**)obj_handle), or we modify the mca_base_pvar_handle_alloc to convert from the void* into the correct type void**.

Personally I prefer the second approach (patch below), as it would allow us to defined sessions that can persist outside the C code block boundaries. @hjelmn what do you think ?

diff --git a/opal/mca/base/mca_base_pvar.c b/opal/mca/base/mca_base_pvar.c
index ca1528278f..c6b1ca4d7d 100644
--- a/opal/mca/base/mca_base_pvar.c
+++ b/opal/mca/base/mca_base_pvar.c
@@ -463,7 +463,7 @@ int mca_base_pvar_handle_alloc (mca_base_pvar_session_t *session, int index, voi
             break;
         }

-        pvar_handle->obj_handle = obj_handle;
+        pvar_handle->obj_handle = (NULL == obj_handle ? NULL : *(void**)obj_handle);
         pvar_handle->pvar = pvar;

         *handle = pvar_handle;
markalle commented 7 years ago

I think we're hitting two different segv then. I think I agree with your fix for the one you're hitting.

I have to admit my original segv was on ppc and I assumed the x86 segv was the same. But I'm still able to demonstrate an issue on x86 using the below blob of code that looks up a given function pointer in /proc/<pid>/maps

  1. I put that code in mca_base_pvar_notify() so any time it triggers a notification function it prints the %p of it and looks up where that address lives
  2. I put a similar blob in mca_pml_ob1_component_register() just with a different address to lookup: addr = (void*)mca_pml_ob1_comm_size_notify and a different print statement

The result is a pvar gets created in mca_pml_ob1_component_register() with callback mca_pml_ob1_comm_size_notify which at that time is in /proc/pid/maps at /path/to/ompi/lib/openmpi/mca_pml_ob1.so. Then when the testcase walks the list of pvars calling MPI_T_pvar_handle_alloc(), it triggers the same function pointer which is now completely different code: /path/to/ompi/lib/libopen-pal.so.0.0.0. And you also add a print statement in the callbcak mca_pml_ob1_comm_size_notify() to confirm that that function is not being triggered.

I haven't traced through the mca_base_component_unload code you described to see where that might be going wrong, but the print statements at var-register time and var-notify time are showing the callback being set and then being used even though the var should have been unloaded.

{
    void *addr, *left, *right;
    char filename[256];
    char *line;
    size_t len;
    char str1[48], str2[48], *pdash;
    FILE *fp;

    addr = *(void**)&(handle->pvar->notify); // sidestep warning about data ptr vs fn ptr
    printf("- looking up %p notificaiton function in /proc/%d/maps\n", addr, getpid());
    sprintf(filename, "/proc/%d/maps", getpid());
    fp = fopen(filename, "r");
    line = NULL;
    len = 0;
    // assume the addr format is hex w/o leading 0x, so add 0x here
    strcpy(str1, "0x");
    strcpy(str2, "0x");
    while (getline(&line, &len, fp) >= 0) {
        pdash = line;
        while (*pdash && *pdash != '-') { ++pdash; }
        if (*pdash) { *pdash = ' '; }
        if (*pdash && 2 == sscanf(line, "%s %s ", &str1[2], &str2[2])) {
            *pdash = '-';
            left = (void*) strtoll(str1, NULL, 0);
            right = (void*) strtoll(str2, NULL, 0);
            if (left <= addr && addr < right) {
                printf("[contains %p]: %s", addr, line);
            }
        }
    }
    if (line) { free(line); }
    close(fp);
    fflush(stdout);
}
bosilca commented 7 years ago

I figured out what was going on. In my environment the default PML was OB1, it was therefore never unloaded, so the 2 pvars were always valid. Enabling another PML via MCA allows to reproduce your segfault.

4442 has the fix you are looking for. The short story is that the pvars were incorrectly registered without the MCA_BASE_PVAR_FLAG_IWG flag. Without this flag the PVAR remains valid even after the associated component has been unloaded (dlclosed), which is totally weird to me.

However, while looking at this I noticed few questionable things:

  1. If a PVAR can be associated with a group that pertains to a component, why it is not gaining the MCA_BASE_PVAR_FLAG_IWG automatically ?

  2. register_components is a little to eager to load all components (completely ignoring the selection arguments provided by the user). As an example, even if I set pml=XXX, all PML components are loaded and registered, and later on they are all unregistered with the exception of the selected one. This increases the traffic as all component need to be open.

jsquyres commented 7 years ago

Whoa -- that doesn't sound like correct register_components() behavior.

@hjelmn Was that intended?

bosilca commented 7 years ago

It might be a mistake, as by the time we call register_component we already have the user selection in the framework structure. I would easily add this to the PR, but I not sure if this will not break something else.

hjelmn commented 6 years ago

@bosilca You fixed this already in master, correct? Did you open a PR to all the affected branches?

jsquyres commented 6 years ago

@hjelmn George did a first fix in #4442; I think he's asking 2 followup questions (see https://github.com/open-mpi/ompi/issues/4433#issuecomment-341625270). Especially the 2nd question -- I don't think that's correct register_components() behavior. George's #4442 doesn't do anything about that.

hjelmn commented 6 years ago

@bosilca The decision was made to only register using the DWG flag with cvars when specifically requested due to some variable use cases in orte. pvars inherited this from cvars.

As to registering all components. We have to. A user can call MPI_T_Init () then change the pml so we need to know about all the pmls and their variables. It should be fine as long as the correct pvar registration function is used.

bosilca commented 6 years ago

@hjelmn if the user requests a specific component, no tools should be allowed to overwrite this !

hjelmn commented 6 years ago

We cannot assume a variable set by mpit is set by a tool. The interface can and will be used by apps.

bosilca commented 6 years ago

I'm specifically referring to your second comment regarding registering all components. If the PML is set when we reach the call to MPI_T_init it means it has been set either via a configuration file or via the command line. In both cases, allowing the selection to be changed via MPI_T sounds unreasonable to me. To be more precise, I would understand that one can limit the components via the corresponding MCA variable to a subset of the existing components and then select one of them via MPI_T. However, once a component has been selected via an MCA var the choice should be final.

hjelmn commented 6 years ago

Not sure I agree. The way I see it file < environment < command line < MPI_T. The setup reflects that choice. I don't really care either way but this order was set after some discussion during the implementation of MPI_T in Open MPI. This order only has an affect pre-MPI_Init. After MPI_Init all MCA selection choices are final and can not be changed. The only exception are lazy-load frameworks like io.

jsquyres commented 4 years ago

@markalle @bosilca The reproducer on this issue no longer causes a segv. Do you remember if this is still an issue? If not, I'm inclined to close this issue.

bosilca commented 4 years ago

It has been fixed in a reasonable way in #4442.