open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

No protection for opening plugins from wrong OMPI version #475

Open opoplawski opened 9 years ago

opoplawski commented 9 years ago

I'm trying to build openmpi master from openmpi-dev-1330-g7640507 as the Fedora package for testing. I'm getting:

$ ./dlopen_test 
[barry:08971] *** Process received signal ***
[barry:08971] Signal: Segmentation fault (11)
[barry:08971] Signal code: Address not mapped (1)
[barry:08971] Failing at address: 0x1
[barry:08971] [ 0] /lib64/libpthread.so.0(+0x100d0)[0x7f6a37c2c0d0]
[barry:08971] [ 1] /lib64/libc.so.6(strlen+0x2a)[0x7f6a378e9c8a]
[barry:08971] [ 2] /lib64/libc.so.6(__strdup+0xe)[0x7f6a378e99ae]
[barry:08971] [ 3] /export/home/orion/fedora/openmpi/openmpi-dev-1330-g7640507/opal/.libs/libopen-pal.so.0(+0x46204)[0x7f6a38e15204]
[barry:08971] [ 4] /export/home/orion/fedora/openmpi/openmpi-dev-1330-g7640507/opal/.libs/libopen-pal.so.0(mca_base_component_var_register+0x3c)[0x7f6a38e1549c]
[barry:08971] [ 5] /usr/lib64/openmpi/lib/openmpi/mca_shmem_mmap.so(+0x1264)[0x7f6a36e40264]
[barry:08971] [ 6] /export/home/orion/fedora/openmpi/openmpi-dev-1330-g7640507/opal/.libs/libopen-pal.so.0(mca_base_framework_components_register+0x159)[0x7f6a38e196e9]
[barry:08971] [ 7] /export/home/orion/fedora/openmpi/openmpi-dev-1330-g7640507/opal/.libs/libopen-pal.so.0(mca_base_framework_register+0x166)[0x7f6a38e19a46]
[barry:08971] [ 8] /export/home/orion/fedora/openmpi/openmpi-dev-1330-g7640507/opal/.libs/libopen-pal.so.0(mca_base_framework_open+0x31)[0x7f6a38e19ac1]
[barry:08971] [ 9] /export/home/orion/fedora/openmpi/openmpi-dev-1330-g7640507/opal/.libs/libopen-pal.so.0(opal_init+0x18c)[0x7f6a38df690c]
[barry:08971] [10] /export/home/orion/fedora/openmpi/openmpi-dev-1330-g7640507/ompi/debuggers/.libs/lt-dlopen_test(+0x121f)[0x7f6a3981021f]
[barry:08971] [11] /lib64/libc.so.6(__libc_start_main+0xf0)[0x7f6a3787efe0]
[barry:08971] [12] /export/home/orion/fedora/openmpi/openmpi-dev-1330-g7640507/ompi/debuggers/.libs/lt-dlopen_test(+0xd59)[0x7f6a3980fd59]
[barry:08971] *** End of error message ***

(gdb) bt
#0  strlen () at ../sysdeps/x86_64/strlen.S:106
#1  0x00007fe55ad4c9ae in __GI___strdup (s=0x1 <error: Cannot access memory at address 0x1>)
    at strdup.c:41
#2  0x00007fe55c278204 in register_variable (framework_name=<optimized out>,
    component_name=0x7fe55a4a5178 <mca_shmem_mmap_component+56> "mmap",
    variable_name=<optimized out>, description=<optimized out>,
    type=MCA_BASE_VAR_TYPE_VERSION_STRING, enumerator=<optimized out>, bind=0,
    flags=(MCA_BASE_VAR_FLAG_SETTABLE | MCA_BASE_VAR_FLAG_DWG), info_lvl=OPAL_INFO_LVL_9,
    scope=MCA_BASE_VAR_SCOPE_LOCAL, synonym_for=-1,
    storage=0x7fe55a4a5100 <opal_shmem_mmap_nfs_warning>, project_name=0x0)
    at mca_base_var.c:1417
#3  0x00007fe55c278457 in mca_base_var_register (project_name=project_name@entry=0x0,
    framework_name=<optimized out>, component_name=<optimized out>,
    variable_name=<optimized out>, description=<optimized out>, type=<optimized out>,
    enumerator=<optimized out>, bind=<optimized out>, flags=<optimized out>,
    info_lvl=<optimized out>, scope=<optimized out>,
    storage=0x7fe55a4a5100 <opal_shmem_mmap_nfs_warning>) at mca_base_var.c:1444
#4  0x00007fe55c27849c in mca_base_component_var_register (component=<optimized out>,
    variable_name=<optimized out>, description=<optimized out>, type=<optimized out>,
    enumerator=<optimized out>, bind=<optimized out>, flags=MCA_BASE_VAR_FLAG_SETTABLE,
    info_lvl=OPAL_INFO_LVL_9, scope=MCA_BASE_VAR_SCOPE_LOCAL,
    storage=0x7fe55a4a5100 <opal_shmem_mmap_nfs_warning>) at mca_base_var.c:1457
#5  0x00007fe55a2a3264 in mmap_register ()
   from /usr/lib64/openmpi/lib/openmpi/mca_shmem_mmap.so
#6  0x00007fe55c27c6e9 in register_components (project_name=<optimized out>,
    dest=0x7fe55c4c8550 <opal_shmem_base_framework+80>, src=0x7fff2a23da20, output_id=-1,
    type_name=<optimized out>) at mca_base_components_register.c:116
#7  mca_base_framework_components_register (
    framework=framework@entry=0x7fe55c4c8500 <opal_shmem_base_framework>,
    flags=flags@entry=MCA_BASE_REGISTER_DEFAULT) at mca_base_components_register.c:67
#8  0x00007fe55c27ca46 in mca_base_framework_register (
    framework=0x7fe55c4c8500 <opal_shmem_base_framework>,
    flags=flags@entry=MCA_BASE_REGISTER_DEFAULT) at mca_base_framework.c:112
#9  0x00007fe55c27cac1 in mca_base_framework_open (
    framework=0x7fe55c4c8500 <opal_shmem_base_framework>,
    flags=flags@entry=MCA_BASE_OPEN_DEFAULT) at mca_base_framework.c:136
#10 0x00007fe55c25990c in opal_init (pargc=<optimized out>, pargv=<optimized out>)
    at runtime/opal_init.c:471
#11 0x00007fe55cc7321f in main (argc=1, argv=0x7fff2a23dc28) at dlopen_test.c:133

(gdb) up
#2  0x00007fe55c278204 in register_variable (framework_name=<optimized out>,
    component_name=0x7fe55a4a5178 <mca_shmem_mmap_component+56> "mmap",
    variable_name=<optimized out>, description=<optimized out>,
    type=MCA_BASE_VAR_TYPE_VERSION_STRING, enumerator=<optimized out>, bind=0,
    flags=(MCA_BASE_VAR_FLAG_SETTABLE | MCA_BASE_VAR_FLAG_DWG), info_lvl=OPAL_INFO_LVL_9,
    scope=MCA_BASE_VAR_SCOPE_LOCAL, synonym_for=-1,
    storage=0x7fe55a4a5100 <opal_shmem_mmap_nfs_warning>, project_name=0x0)
    at mca_base_var.c:1417
1417                ((char **)storage)[0] = strdup (((char **)storage)[0]);
(gdb) print storage
$1 = (void *) 0x7fe55a4a5100 <opal_shmem_mmap_nfs_warning>
(gdb) print ((char **)storage)[0]
$2 = 0x1 <error: Cannot access memory at address 0x1>
./configure --prefix=/usr/lib64/openmpi --mandir=/usr/share/man/openmpi-x86_64 --includedir=/u
sr/include/openmpi-x86_64 --sysconfdir=/etc/openmpi-x86_64 --disable-silent-rules --enable-mpi-j
ava --with-libevent=/usr --with-verbs=/usr --with-sge --with-valgrind --enable-memchecker --with
-hwloc=/usr --with-libltdl=/usr CC=gcc CXX=g++ 'LDFLAGS=-Wl,-z,relro -specs=/usr/lib/rpm/redhat/
redhat-hardened-ld' 'CFLAGS= -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic' 'CXXFLAGS= -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic' FC=gfortran 'FCFLAGS= -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic'
opoplawski commented 9 years ago

Okay, this turned out to be caused by having an already installed openmpi on the system. Apparently the dlopen() code is picking that up in preference to the build libraries. This seems like a bug as well, but not sure if this is new or not.

hjelmn commented 9 years ago

That is a long outstanding bug in Open MPI. We are discussing ways to correct this for 1.9.

opoplawski commented 9 years ago

Okay, I'll leave to this to close/reassign/etc. as you see fit then.

hjelmn commented 9 years ago

Thanks Howard.

To begin the discussion we could probably use the following scheme for plugin paths:

<prefix>/lib[64]/openmpi/<project>/<PROJECT_VERSION>/<framework>/<ABI_VERSION>/

This is the most flexible naming scheme. So for example a 1.9/2.0 btl 3.0 plugin might be found in:

<prefix>/lib/openmpi/opal/2.0/btl/3.0
jsquyres commented 9 years ago

I mostly agree, but will be slightly more pedantic:

<libdir>/openmpi/<project>/<PROJECT_VERSION>/<framework>/<ABI_VERSION>/

That being said, this is probably a little overkill -- do we need both PROJECT_VERSION and ABI_VERSION? I.e., won't those 2 be chained together?

jsquyres commented 9 years ago

Actually, I wasn't pedantic enough. :-)

This is more correct:

<pkglibdir>/<project>/<PROJECT_VERSION>/<framework>/<ABI_VERSION>/
bosilca commented 9 years ago

Our modules have a version number. Why simply discarding right after dlopen all modules with the wrong version number is not a adequate solution?

hjelmn commented 9 years ago

Hmm, that may be the way to go. We now have the mca version, project version, and type version in the mca component.

hjelmn commented 9 years ago

To make this work well I should probably version the frameworks themselves. Will investigate.

hppritcha commented 9 years ago

George do you mean the shared library version number?

Couldn't we use some versioned symbols similar to the way libfabric does it, and then check for the presence of a particular versioned symbol in a *.so using dlvsym? Are there any other projects that need all this type of subdirectory structure for shared libraries they use internally? It seems a little weird.

Is the goal to be able to install multiple versions of open mpi in the same location, or just make sure that an incompatible *.so in openmpi dir is not dlopen'd with subsequent badness as reported above?

hjelmn commented 9 years ago

Howard, the primary goal is to no open incompatible .so's but it would be a useful feature to be able to have multiple versions of a project (opal for example) installed in the same tree.

jsquyres commented 9 years ago

Don't forget that there are other reasons why we can't install two versions of OMPI into the same tree, such as:

The goal is to prevent a scenario like this:

  1. User installs version A.B.C into $prefix
  2. User later installs version D.E.F into the same $prefix
  3. User runs new D.E.F OMPI and Bad Things happen because some old A.B.C components were still in $prefix/lib/openmpi

That being said, perhaps just checking the version numbers in the .so is good enough -- perhaps a new directory structure is not worth it (since, even if you do that, you can't install multiple versions of OMPI into the same tree).

@hppritcha I think the symbol versioning stuff is a slightly different use case than what we're trying to protect against here...?

opoplawski commented 9 years ago

Just to be explicit - the problem I was running into was having openmpi 1.8.2 installed in /usr, then building newer versions in my home directories and running in-tree tests.

hppritcha commented 9 years ago

I thought the _.so's in openmpi directory lack version numbers, hence the --avoid-version in the laLDFLAGS in all the mca//Makefile.am's. I guess we'd have to pay attention to VERSION then and not just fill in 0.0.0? I'd be fine with that. As long as we kept true to the current/rev/age formula and have C-A really mean something, this would take care of the problem, including Opoplawski's problem.
It would get complicated if the version numbers for the different *.so's could vary.

hjelmn commented 9 years ago

Yes. The .so files have no version number in the filename. What George is referring to is the mca_base_component_t inside the .so. That contains version information for the plugin.

The only problem with using that structure is we may change it from release to release. We just did this by adding the project name and version.

hppritcha commented 9 years ago

sounds like a problem of introducing standard shared library versioning. just say no to -avoid-version and really use so versioning. On Mar 18, 2015 8:47 PM, "Nathan Hjelm" notifications@github.com wrote:

Yes. The .so files have no version number in the filename. What George is referring to is the mca_base_component_t inside the .so. That contains version information for the plugin.

The only problem with using that structure is we may change it from release to release. We just did this by adding the project name and version.

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/475#issuecomment-83301710.

hppritcha commented 9 years ago

@hjelmn is #449 good enough to close this bug.

jsquyres commented 9 years ago

We talked about this in person at the dev meeting in June 2015. We concluded:

  1. It is not sufficient to pass the framework (and/or framework version) to the framework open function, and only open components that match that version, because you could run into a scenario like this:
    • OMPI vA.B.C is installed
    • OMPI vA.B.(C+1) is installed
    • Component X in both of these has the same framework version.
    • But component X for vA.B.(C+1) uses a symbol in the framework base that exists in A.B.(C+1), but not in A.B.C.
  2. Hence, we really need to check the project version of components to decide whether we should open them or not.

@hjelmn says that he will get to this some time in the v2.x series.

hppritcha commented 7 years ago

Moving to 3.x