open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

Solaris does not work correctly with event port polling #36

Closed ompiteam closed 7 years ago

ompiteam commented 10 years ago

We have observed hangs when running applications on Solaris. It appears that this is because of the use of event ports.

Here is an example the stack trace when it hangs.

alamodome 43 =>pstack 1964 1966
1964:    IMB-MPI1.trunk barrier
fe6c060c lwp_yield (0, 1, fe25d134, fe25ce58, 4, 0) + 8
fef9e210 opal_progress (ff06f680, 0, ff06f688, 0, ff06f67c, 1) + 12c
fe5150f4 barrier  (0, fe52ce9c, fe52e9b9, fe51ab60, fe51aaa0, ff252c10) + 394
fe887ac0 ompi_mpi_init (1b4, fe2a7568, 0, 408, fee7ca4c, fed18d28) + 7e8
fea19ad4 MPI_Init (ffbff82c, ffbff830, fee8072d, b38, fee7ca4c, 35450) + 160
00012830 main     (2, ffbff84c, ffbff858, 2a800, ff3a0100, ff3a0140) + 10
000123f8 _start   (0, 0, 0, 0, 0, 0) + 108

Here is it running with an env var set so we can see the type of polling being used.

burl-ct-v440-2 140 =>mpirun -x EVENT_SHOW_METHOD -host burl-ct-v440-3 -np 4 -mca btl self,sm,tcp bcast
[msg] libevent using: poll
[msg] libevent using: event ports
[msg] libevent using: event ports
[msg] libevent using: event ports
[msg] libevent using: event ports

And if we change it to use devpoll, poll, or select, it works.

burl-ct-v440-2 141 =>mpirun -x EVENT_SHOW_METHOD -host burl-ct-v440-3 -np 4 -mca opal_event_include poll bcast
[msg] libevent using: poll
[msg] libevent using: poll
[msg] libevent using: poll
[msg] libevent using: poll
[msg] libevent using: poll
Starting MPI_Bcast...
All done.
All done.
All done.
All done. 

And here is case of disabling event port, and letting the library pick next available.

burl-ct-v440-2 147 =>setenv EVENT_NOEVPORT
burl-ct-v440-2 148 =>mpirun -x EVENT_NOEVPORT -x EVENT_SHOW_METHOD -host burl-ct-v440-3 -np 4 bcast
[msg] libevent using: poll
[msg] libevent using: devpoll
[msg] libevent using: devpoll
[msg] libevent using: devpoll
[msg] libevent using: devpoll
Starting MPI_Bcast...
All done.
All done.
All done.
All done.

We only saw this on our debuggable builds. We did not see it with our optimized builds. It is not clear what difference in the configure is triggering this.

Here is the configure line that triggers the problem.

../configure --with-sge --disable-io-romio --enable-orterun-prefix-by-default --enable-heterogeneous --enable-trace --enable-debug --enable-shared --enable-mpi-f90 --with-mpi-f90-size=trivial --without-threads --disable-mpi-threads --disable-progress-threads CFLAGS="-g" FFLAGS="-g" --prefix=/workspace/rolfv/ompi/sparc/trunk/release --libdir=/workspace/rolfv/ompi/sparc/trunk/release/lib --includedir=/workspace/rolfv/ompi/sparc/trunk/release/include --with-wrapper-ldflags="-R/workspace/rolfv/ompi/sparc/trunk/release/lib -R/workspace/rolfv/ompi/sparc/trunk/release/lib/sparcv9" CC=cc CXX=CC F77=f77 F90=f90 --enable-cxx-exceptions
ompiteam commented 10 years ago

Imported from trac issue 1273. Created by rolfv on 2008-04-17T15:26:54, last modified: 2008-07-07T17:29:56

ompiteam commented 10 years ago

Trac comment by jsquyres on 2008-04-17 16:08:57:

Per some off-ticket discussion, this is ''possibly'' a problem with libevent itself (but it could also be a problem with Solaris event ports).

Unfortunately, we don't really have time to look into this at the moment, so we're going to just temporarily disable Solaris event ports in libevent. We're leaving this ticket here as future documentation so that we can come back to it someday (post 1.3). Note: if it really is a problem in libevent, we should get it fixed upstream.

Disabling Solaris event ports requires a configure change, so I'll hold off committing until tonight.

ompiteam commented 10 years ago

Trac comment by jsquyres on 2008-04-17 19:14:44:

(In [18199]) Temporarily disable Solaris ports support in libevent. Refs https://svn.open-mpi.org/trac/ompi/ticket/1273

ompiteam commented 10 years ago

Trac comment by rolfv on 2008-07-07 17:29:56:

Here are a few links that have some information about the performance of /dev/poll in Solaris and the use of event ports in libevent.

http://blogs.sun.com/dap/entry/libevent_and_solaris_event_ports

http://blogs.sun.com/dap/entry/event_ports_and_performance_take

rhc54 commented 7 years ago

We've fixed this to the extent possible