pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
545 stars 281 forks source link

mpich-3.0.4 mpiexec breaks on freebsd #1820

Closed mpichbot closed 8 years ago

mpichbot commented 8 years ago

Originally by balay on 2013-04-26 16:11:50 -0500


mpich-3.0.4 mpiexec breaks on freebsd - whereas an identical build with mpich-3.0.4 works.

[balay@wii ~/petsc.clone-3/src/ksp/ksp/examples/tutorials]$ ./ex2
Norm of error 0.000156044 iterations 6
[balay@wii ~/petsc.clone-3/src/ksp/ksp/examples/tutorials]$ ~/petsc.clone-3/arch-freebsd-cxx-cmplx-pkgs-dbg/bin/mpiexec -n 1 ./ex2
Norm of error 0.000156044 iterations 6
[balay@wii ~/petsc.clone-3/src/ksp/ksp/examples/tutorials]$ ~/petsc.clone-3/arch-freebsd-cxx-cmplx-pkgs-dbg/bin/mpiexec -n 2 ./ex2
[proxy:0:0@wii] stdoe_cb (./pm/pmiserv/pmip_cb.c:51): assert (i < HYD_pmcd_pmip.local.proxy_process_count) failed
[proxy:0:0@wii] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@wii] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@wii] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@wii] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@wii] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@wii] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
[balay@wii ~/petsc.clone-3/src/ksp/ksp/examples/tutorials]$ 
mpichbot commented 8 years ago

Originally by balay on 2013-04-26 16:12:17 -0500


Attachment added: config.log (575.7 KiB) config.log

mpichbot commented 8 years ago

Originally by balaji on 2013-04-26 16:14:10 -0500


Satish, can you give me access to a FreeBSD machine?

mpichbot commented 8 years ago

Originally by balay on 2013-04-26 16:18:48 -0500


Replying to [#1820 balay]:

mpich-3.0.4 mpiexec breaks on freebsd - whereas an identical build with mpich-3.0.4 works.

I meant to say 'identical build with mpich-3.0.3 works'

mpichbot commented 8 years ago

Originally by balay on 2013-04-26 16:26:19 -0500


ok - e-mailed you the details.

BTW: I did not get an e-mail update for that comment you made. Does the bug tracker not automatically add me to cc: list?

mpichbot commented 8 years ago

Originally by balaji on 2013-04-28 05:14:47 -0500


Thanks. I think I have a handle on the error. Can you try the two attached patches? They are out for code-review by Dave.

mpichbot commented 8 years ago

Originally by balaji on 2013-04-28 05:15:04 -0500


Attachment added: 0001-Replace-weak-symbols-check-with-ROMIO-with-the-confd.patch (5.6 KiB)

mpichbot commented 8 years ago

Originally by balaji on 2013-04-28 05:15:10 -0500


Attachment added: 0002-Cleanup-confdb-macros.patch (45.7 KiB)

mpichbot commented 8 years ago

Originally by balay on 2013-04-28 16:43:27 -0500


Replying to balaji:

Thanks. I think I have a handle on the error. Can you try the two attached patches? They are out for code-review by Dave.

I still get errors.

[balay@wii ~/petsc.test/src/ksp/ksp/examples/tutorials]$ /usr/home/balay/petsc.test/arch-freebsd9-c-debug/bin/mpiexec -n 2 ./ex2
[proxy:0:0@wii] stdoe_cb (./pm/pmiserv/pmip_cb.c:51): assert (i < HYD_pmcd_pmip.local.proxy_process_count) failed
[proxy:0:0@wii] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@wii] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@wii] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@wii] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@wii] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@wii] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
[balay@wii ~/petsc.test/src/ksp/ksp/examples/tutorials]$ 
mpichbot commented 8 years ago

Originally by balaji on 2013-04-28 22:20:13 -0500


I just tried a fresh copy and it works fine for me on wii.mcs.anl.gov. There is still some problem with (1) VPATH builds and (2) strict builds, but I'm assuming you are doing neither, since you are able to build fine and are seeing an error at runtime.

What's the exact configure line you are using?

mpichbot commented 8 years ago

Originally by balaji on 2013-04-28 22:28:10 -0500


Oh, you'll need to run autogen.sh after applying the patches I provided. Sorry, I didn't mention that.

mpichbot commented 8 years ago

Originally by balay on 2013-04-29 10:06:26 -0500


Replying to balaji:

What's the exact configure line you are using?

from config.log

  $ ./configure --prefix=/usr/home/balay/petsc.test/arch-freebsd9-c-debug CC=gcc CFLAGS= -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g3 -fno-inline -O0  CXX=g++ CXXFLAGS= -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g   -fPIC   FC=gfortran FCFLAGS= -fPIC -Wall -Wno-unused-variable -Wno-unused-dummy-argument -g   F77=gfortran FFLAGS= -fPIC -Wall -Wno-unused-variable -Wno-unused-dummy-argument -g   --enable-shared --enable-sharedlibs=gcc --with-device=ch3:sock --without-mpe --with-pm=hydra --enable-g=meminit --enable-fast
mpichbot commented 8 years ago

Originally by balay on 2013-04-29 10:08:21 -0500


Replying to balaji:

Oh, you'll need to run autogen.sh after applying the patches I provided. Sorry, I didn't mention that.

I started off with the 3.0.4 tarball - and run autogen.sh on linux. Amd then copied over stuff to this testbox.

You can look at my buildfiles at /home/balay/petsc.test/externalpackages/mpich-3.0.4

If you have a prebuilt tarball with these fixes - I can try using that.

mpichbot commented 8 years ago

Originally by balaji on 2013-04-29 21:45:39 -0500


Try /home/balaji/mpich-master-v3.0.4-104-g6487b2b7.tar.gz on wii.

mpichbot commented 8 years ago

Originally by balay on 2013-04-29 23:34:43 -0500


Replying to balaji:

Try /home/balaji/mpich-master-v3.0.4-104-g6487b2b7.tar.gz on wii.

The tarball also gives errors


[balay@wii ~/petsc.test/src/ksp/ksp/examples/tutorials]$ /usr/home/balay/petsc.test/arch-freebsd9-c-debug/bin/mpiexec -n 2 ./ex2
[proxy:0:0@wii] stdoe_cb (./pm/pmiserv/pmip_cb.c:51): assert (i < HYD_pmcd_pmip.local.proxy_process_count) failed
[proxy:0:0@wii] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@wii] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@wii] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@wii] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@wii] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@wii] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
[balay@wii ~/petsc.test/src/ksp/ksp/examples/tutorials]$ 
mpichbot commented 8 years ago

Originally by balaji on 2013-04-30 07:45:13 -0500


The error seems to be coming from Hydra, so I tried using your mpiexec with my "cpi" example, and it worked fine:

/usr/home/balay/petsc.test/arch-freebsd9-c-debug/bin/mpiexec -n 4 ./examples/cpi
Process 2 of 4 is on wii
Process 0 of 4 is on wii
Process 1 of 4 is on wii
Process 3 of 4 is on wii
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.071336

I couldn't find where your build was. You can also look into my build here: /home/balaji/software/mpich/build

mpichbot commented 8 years ago

Originally by balay on 2013-04-30 08:01:27 -0500


My buildfiles are at /home/balay/petsc.test/externalpackages/mpich-master-v3.0.4-104-g6487b2b7 [configure is run inplace - within the source tree]

I don't see breakage with your build - but I do see it with my build. [for '-n 2' - but not '-n 4'

[balay@wii ~/junk]$ /home/balaji/software/mpich/build/install/bin/mpicc cpi.c
[balay@wii ~/junk]$ /home/balaji/software/mpich/build/install/bin/mpiexec -n 4 ./a.out 
Process 2 of 4 is on wii
Process 0 of 4 is on wii
Process 1 of 4 is on wii
Process 3 of 4 is on wii
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.002709
[balay@wii ~/junk]$ /home/balaji/software/mpich/build/install/bin/mpiexec -n 2 ./a.out 
Process 0 of 2 is on wii
Process 1 of 2 is on wii
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.001182
[balay@wii ~/junk]$ ~/petsc.test/arch-freebsd9-c-debug/bin/mpiexec -n 4 ./a.out
Process 2 of 4 is on wii
Process 0 of 4 is on wii
Process 1 of 4 is on wii
Process 3 of 4 is on wii
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.002318
[balay@wii ~/junk]$ ~/petsc.test/arch-freebsd9-c-debug/bin/mpiexec -n 2 ./a.out
[proxy:0:0@wii] stdoe_cb (./pm/pmiserv/pmip_cb.c:51): assert (i < HYD_pmcd_pmip.local.proxy_process_count) failed
[proxy:0:0@wii] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@wii] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@wii] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@wii] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@wii] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@wii] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
[balay@wii ~/junk]$ 

I see a couple of differences in our builds. ch3:nemesis vs ch3:sock and "--enable-spawn --enable-maintainer-mode --with-pmi=simple --disable-strict" [and a bunch of other options are different]

[balay@wii ~/junk]$ /home/balaji/software/mpich/build/install/bin/mpichversion 
MPICH Version:      3.0.4
MPICH Release date: unreleased development copy
MPICH Device:       ch3:nemesis
MPICH configure:    --prefix=/home/balaji/software/mpich/build/install --disable-mpe --disable-romio --enable-g=all --enable-spawn --enable-maintainer-mode --with-pm=hydra --with-pmi=simple --enable-cxx --enable-f77 --enable-fc --disable-strict --disable-fast CC=gcc CXX=g++ F77=gfortran FC=gfortran
MPICH CC:   gcc    -g
MPICH CXX:  g++   -g
MPICH F77:  gfortran   -g
MPICH FC:   gfortran   -g
[balay@wii ~/junk]$ ~/petsc.test/arch-freebsd9-c-debug/bin/mpichversion        
MPICH Version:      3.0.4
MPICH Release date: Mon Apr 29 21:00:42 CDT 2013
MPICH Device:       ch3:sock
MPICH configure:    --prefix=/usr/home/balay/petsc.test/arch-freebsd9-c-debug CC=gcc CFLAGS= -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g3 -fno-inline -O0 CXX=g++ CXXFLAGS= -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g -fPIC FC=gfortran FCFLAGS= -fPIC -Wall -Wno-unused-variable -Wno-unused-dummy-argument -g F77=gfortran FFLAGS= -fPIC -Wall -Wno-unused-variable -Wno-unused-dummy-argument -g --enable-shared --enable-sharedlibs=gcc --with-device=ch3:sock --without-mpe --with-pm=hydra --enable-g=meminit --enable-fast
MPICH CC:   gcc  -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g3 -fno-inline -O0    -DNDEBUG -DNVALGRIND -O2
MPICH CXX:  g++  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g   -fPIC    -DNDEBUG -DNVALGRIND -O2
MPICH F77:  gfortran  -fPIC -Wall -Wno-unused-variable -Wno-unused-dummy-argument -g    -O2
MPICH FC:   gfortran  -fPIC -Wall -Wno-unused-variable -Wno-unused-dummy-argument -g    -O2
mpichbot commented 8 years ago

Originally by balaji on 2013-04-30 08:10:29 -0500


I had tried your configure arguments yesterday. I can try them again later today.

mpichbot commented 8 years ago

Originally by balay on 2013-04-30 13:46:17 -0500


Replying to balaji:

I had tried your configure arguments yesterday. I can try them again later today.

A basic configure [with only '--prefix' option] is able to reproduce this problem.

I now see the difference is due to the option '--enable-g=all'. If I build mpich with it - I don't see the problem.

mpichbot commented 8 years ago

Originally by balaji on 2013-04-30 21:34:22 -0500


Thanks. I'm able to reproduce it now. I've tracked down the bug; it had nothing to do with FreeBSD, but somehow only got triggered on that platform and only for 2 processes, for some reason. I've committed a patch for it in [3adb59cb].

mpichbot commented 8 years ago

Originally by balaji on 2013-04-30 21:49:31 -0500


Leaving this ticket open for the VPATH and strict build problems mentioned above.

mpichbot commented 8 years ago

Originally by balay on 2013-05-01 14:16:44 -0500


Replying to balaji:

Thanks. I'm able to reproduce it now. I've tracked down the bug; it had nothing to do with FreeBSD, but somehow only got triggered on that platform and only for 2 processes, for some reason. I've committed a patch for it in [3adb59cb].

Great! mpich-master-v3.0.4-106-g3adb59c.tar.gz works now. Thanks!

mpichbot commented 8 years ago

Originally by balaji on 2013-05-03 13:22:54 -0500


Thanks. The VPATH fix has been committed in [f65916e1]. This ticket now only tracks the strict build issue.