pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
565 stars 279 forks source link

critical section macro initialization is incorrect #1900

Closed mpichbot closed 8 years ago

mpichbot commented 8 years ago

Originally by robl on 2013-07-15 08:56:03 -0500


Building stock MPICH with memory debugging results in a seg fault

Steps to reproduce:

 qsub -t 15 -n 8 -A SSSPP --env=BG_COREDUMPBINARY=0 ./coll_perf -fname blah

Program will fail with signal 11 and provide the following backtrace:

Program terminated with signal 11, Segmentation fault.
#0  __l2_load_op (str=0x1481460 "MPI_THREAD_SINGLE", lineno=726,
    fname=0x14ab180 "/home/robl/src/mpich/src/util/param/param_vals.c")
    at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:109
109         return *__l2_op_ptr(ptr, op);
(gdb) up
#1  L2_AtomicLoadIncrementBounded (str=0x1481460 "MPI_THREAD_SINGLE",
lineno=726,
    fname=0x14ab180 "/home/robl/src/mpich/src/util/param/param_vals.c")
    at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:186
186         return __l2_load_op(ptr,
L2_ATOMIC_OPCODE_LOAD_INCREMENT_BOUNDED);
(gdb) up
#2  MPIDI_Mutex_acquire (str=0x1481460 "MPI_THREAD_SINGLE", lineno=726,
    fname=0x14ab180 "/home/robl/src/mpich/src/util/param/param_vals.c")
    at /home/robl/src/mpich/src/mpid/pamid/include/mpidi_mutex.h:112
112         rc = L2_AtomicLoadIncrementBounded(&mutex->counter);
(gdb) up
#3  MPIU_trstrdup (str=0x1481460 "MPI_THREAD_SINGLE", lineno=726,
    fname=0x14ab180 "/home/robl/src/mpich/src/util/param/param_vals.c")
    at /home/robl/src/mpich/src/util/mem/trmem.c:99
99          MPIU_THREAD_CS_ENTER(MEMALLOC,);
(gdb) up
#4  0x0000000001036b1c in MPIR_Param_init_params ()
    at /home/robl/src/mpich/src/util/param/param_vals.c:726
726             MPIR_PARAM_DEFAULT_THREAD_LEVEL = MPIU_Strdup(tmp_str);
(gdb) up
#5  0x0000000001010494 in PMPI_Init (argc=0x1dbfffbad0, argv=0x1dbfffbad8)
    at /home/robl/src/mpich/src/mpi/init/init.c:100
100         mpi_errno = MPIR_Param_init_params();
(gdb) up
#6  0x0000000001000ac4 in main (argc=3, argv=0x1dbfffbec8)
    at /home/robl/src/mpich/src/mpi/romio/test/coll_perf.c:33
33          MPI_Init(&argc,&argv);
mpichbot commented 8 years ago

Originally by blocksom on 2013-07-15 09:28:25 -0500


Rob, could you post the entire configure command you used? Thanks!

Also, is this a blocker for mpich v3.1?

mpichbot commented 8 years ago

Originally by robl on 2013-07-15 09:32:52 -0500


Using the guidance here: https://wiki.mpich.org/mpich/index.php/BGQ

(so with the gcc-based bgq toolchain in path)

/home/robl/src/mpich/configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --with-atomic-primitives \
--enable-handle-allocation=tls --enable-refcount=lock-free \
--disable-predefined-refcount --prefix=/home/robl/soft/mpich-bgq-dbg \
--enable-g=all

This is not a blocker for 3.1 -- it's just me trying to find ways to replace valgrind.

mpichbot commented 8 years ago

Originally by blocksom on 2013-07-16 10:42:51 -0500


I configure'd like this:

../configure --host=powerpc64-bgq-linux --with-device=pamid --with-file-system=bg+bglockless --prefix=/bghome/blocksom/development/mpich/install --enable-g=all

and the coll_perf test failed, but due to a different error:

10:40:34 /bgusr/blocksom/tmp$ runjob --block R00-M0-N10 --np 4 --label --envs BG_COREDUMPONEXIT=1 : coll_perf -fname blah 2> runjob.out
stdout[0]: leaked context IDs detected: mask=0x166b668 mask[0]=0xfffffff8
stdout[2]: leaked context IDs detected: mask=0x166b668 mask[0]=0xfffffff8
stdout[3]: leaked context IDs detected: mask=0x166b668 mask[0]=0xfffffff8
stdout[1]: leaked context IDs detected: mask=0x166b668 mask[0]=0xfffffff8

10:40:42 /bgusr/blocksom/tmp$ grep std....0 runjob.out | head
stderr[0]: Global array size 128 x 128 x 128 integers
stderr[0]: Collective write time # 0.152554 sec, Collective write bandwidth52.440551 Mbytes/sec
stderr[0]: Collective read time # 0.036090 sec, Collective read bandwidth221.670332 Mbytes/sec
stderr[0]: [0] 16 at [0x00000019c58b0a38], [2] 16 at [0x00000019c58b0a38], [3] 16 at [0x00000019c58b0a38], [1] 16 at [0x00000019c58b0a38], ../../../src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c[103]
stderr[0]: [0] 80 at [0x00000019c582f958], [2] 80 at [0x00000019c582f958], [3] 80 at [0x00000019c582f958], [1] 80 at [0x00000019c582f958], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]
stderr[0]: [0] 8 at [0x00000019c582f898], [2] 8 at [0x00000019c582f898], [3] 8 at [0x00000019c582f898], [1] 8 at [0x00000019c582f898], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[624]
stderr[0]: [0] 80 at [0x00000019c582f798], [2] 80 at [0x00000019c582f798], [3] 80 at [0x00000019c582f798], [1] 80 at [0x00000019c582f798], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[622]
stderr[0]: [0] 8 at [0x00000019c582f6d8], [2] 8 at [0x00000019c582f6d8], [3] 8 at [0x00000019c582f6d8], [1] 8 at [0x00000019c582f6d8], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[620]
stderr[0]: [0] 80 at [0x00000019c582f5d8], [2] 80 at [0x00000019c582f5d8], [3] 80 at [0x00000019c582f5d8], [1] 80 at [0x00000019c582f5d8], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]
stderr[0]: [0] 8 at [0x00000019c582f518], [2] 8 at [0x00000019c582f518], [3] 8 at [0x00000019c582f518], [1] 8 at [0x00000019c582f518], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[624]

10:41:12 /bgusr/blocksom/tmp$ grep std....1 runjob.out | head
stderr[1]: ../../../src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c[103]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[624]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[622]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[620]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[624]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[622]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[620]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]

I'll try your specific configure next...

mpichbot commented 8 years ago

Originally by robl on 2013-07-16 10:54:44 -0500


Excellent. I guess it was the more aggressive '--with-atomic-primitives' and friends?

coll_perf did not fail. That spew you see is the MPICH memory allocation debugging telling us about memory and resource leaks. ad_bg_aggrs.c has one at line 103, for example

 100     if(s > aggrsInPsetSize)
 101     {
 102       if(aggrsInPset) ADIOI_Free(aggrsInPset);
 103       aggrsInPset   = (int *) ADIOI_Malloc (s *sizeof(int));
 104       aggrsInPsetSize = s;
 105     }

I doubt it's a huge contributor to overall memory usage, but it's something we would have found with valgrind (if we had it) that we now know about!

mpichbot commented 8 years ago

Originally by blocksom on 2013-07-16 11:08:13 -0500


With this configure:

../configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --with-atomic-primitives \
--enable-handle-allocation=tls --enable-refcount=lock-free \
--disable-predefined-refcount \
--prefix=/bghome/blocksom/development/mpich/install \
--enable-g=all

I get these warnings - which seem bad.

11:04:32 {ticket-1900} ~/development/mpich/build/src/mpi/romio/test$ make
cd .. && /bin/sh ./config.status test/fperf.f
config.status: creating test/fperf.f
  F77      fperf.o
  F77LD    fperf
/bgsys/drivers/toolchain/V1R2M1_base/gnu-linux/lib/gcc/powerpc64-bgq-linux/4.4.6/../../../../powerpc64-bgq-linux/bin/ld: Warning: size of symbol `mpifcmb1_' changed from 20 in fperf.o to 40 in /bghome/blocksom/development/mpich/install/lib/libmpich.a(setbot.o)
/bgsys/drivers/toolchain/V1R2M1_base/gnu-linux/lib/gcc/powerpc64-bgq-linux/4.4.6/../../../../powerpc64-bgq-linux/bin/ld: Warning: size of symbol `mpifcmb2_' changed from 20 in fperf.o to 40 in /bghome/blocksom/development/mpich/install/lib/libmpich.a(setbot.o)
mpichbot commented 8 years ago

Originally by robl on 2013-07-16 11:17:01 -0500


If i had to guess i'd say there were some stale objects in the romio/test directory? romio/test/Makefile is not nearly as clever about dependencies as the rest of MPICH

mpichbot commented 8 years ago

Originally by blocksom on 2013-07-16 11:26:46 -0500


Replying to robl:

If i had to guess i'd say there were some stale objects in the romio/test directory? romio/test/Makefile is not nearly as clever about dependencies as the rest of MPICH Nope .. this was from a "pristine" configure, make && make install, romio tests build.

mpichbot commented 8 years ago

Originally by blocksom on 2013-07-16 11:42:43 -0500


Uhg. If I configure like this (without --with-atomic-primitives) ...

../configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --enable-handle-allocation=tls \
--enable-refcount=lock-free --disable-predefined-refcount \
--prefix=/bghome/blocksom/development/mpich/install \
--enable-g=all

I get a branch-to-NULL in some parameter checking code (src/util/param/param_vals.c:726):

11:35:34 /bgusr/blocksom/tmp$ tail -n16 core.2 
+++STACK
Frame Address     Saved Link Reg
0000001dbfffb6c0  0000000000000000
0000001dbfffb740  000000000103707c
0000001dbfffb840  00000000010103e4
0000001dbfffb900  0000000001000590
0000001dbfffbaa0  00000000013fab48
0000001dbfffbd80  00000000013fae44
0000001dbfffbe40  0000000000000000
---STACK
Interrupt Summary:
  System Calls 68
  External Input Interrupts 1
  Data TLB Interrupts 1
---ID
---LCB

11:35:42 /bgusr/blocksom/tmp$ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-addr2line -ipfe coll_perf 
000000000103707c
MPIR_Param_init_params at /bghome/blocksom/development/mpich/build/../src/util/param/param_vals.c:726

11:39:55 /bgusr/blocksom/tmp$ nl -ba -s ': ' /bghome/blocksom/development/mpich/build/../src/util/param/param_vals.c | sed -n 720,730p
   720:     MPIR_PARAM_GET_DEFAULT_STRING(DEFAULT_THREAD_LEVEL, &tmp_str);
   721:     rc = MPL_env2str("MPICH_DEFAULT_THREAD_LEVEL", &tmp_str);
   722:     MPIU_ERR_CHKANDJUMP1((-1 == rc),mpi_errno,MPI_ERR_OTHER,"**envvarparse","**envvarparse %s","MPICH_DEFAULT_THREAD_LEVEL");
   723:     rc = MPL_env2str("MPIR_PARAM_DEFAULT_THREAD_LEVEL", &tmp_str);
   724:     MPIU_ERR_CHKANDJUMP1((-1 == rc),mpi_errno,MPI_ERR_OTHER,"**envvarparse","**envvarparse %s","MPIR_PARAM_DEFAULT_THREAD_LEVEL");
   725:     if (tmp_str != NULL) {
   726:         MPIR_PARAM_DEFAULT_THREAD_LEVEL = MPIU_Strdup(tmp_str);
   727:         MPIR_Param_assert(MPIR_PARAM_DEFAULT_THREAD_LEVEL);
   728:         if (MPIR_PARAM_DEFAULT_THREAD_LEVEL == NULL) {
   729:             MPIU_CHKMEM_SETERR(mpi_errno, strlen(tmp_str), "dup of string for MPIR_PARAM_DEFAULT_THREAD_LEVEL");
   730:             goto fn_fail;
mpichbot commented 8 years ago

Originally by blocksom on 2013-07-16 11:58:16 -0500


... and I get the same segfault at the same location with the --with-atomic-primitives configure option. Why can't I reproduce your runtime error?

mpichbot commented 8 years ago

Originally by robl on 2013-07-16 12:27:41 -0500


ok, i pull 2de997d, regenerate configure, re-configure, like you, from a brand new build directory

../configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --enable-handle-allocation=tls \
--enable-refcount=lock-free --disable-predefined-refcount \
--prefix=${HOME}/soft/mpich-dbg --enable-g=all && make -j -l 16 && make install

I run coll_perf with 8 nodes and now I think you and I are back in sync:

% coreprocessor.pl -nogui -c=. -b=./coll_perf   
Reading disassembly for ./coll_perf
filename: ./core.5   (filecount 1)
done reading cores
filename: ./core.3   (filecount 2)
done reading cores
filename: ./core.6   (filecount 3)
done reading cores
filename: ./core.1   (filecount 4)
done reading cores
filename: ./core.7   (filecount 5)
done reading cores
filename: ./core.2   (filecount 6)
done reading cores
filename: ./core.0   (filecount 7)
done reading cores
filename: ./core.4   (filecount 8)
done reading cores
0 :Node (9)
1 :    <traceback not fetched> (1)
1 :    0000000000000000 (8)
2 :        .__libc_start_main (8)
3 :            .generic_start_main (8)
4 :                .main (8)
5 :                    .PMPI_Init (8)
6 :                        .MPIR_Param_init_params (8)
7 :                            0000000000000000 (8)
% bgq_stack coll_perf core.0   
------------------------------------------------------------------------
Program   : /gpfs/vesta-home/robl/src/mpich/build-1900/src/mpi/romio/test/./coll_perf
------------------------------------------------------------------------
+++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN 

0000000000000000
??
??:0

0000000001036fac
MPIR_Param_init_params
/home/robl/src/mpich/build-1900/../src/util/param/param_vals.c:726

0000000001010924
PMPI_Init
/home/robl/src/mpich/build-1900/../src/mpi/init/init.c:100

0000000001000ad0
main
/home/robl/src/mpich/build-1900/src/mpi/romio/test/../../../../../src/mpi/romio/test/coll_perf.c:33

00000000013f6a28
generic_start_main
/bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226

00000000013f6d24
__libc_start_main
/bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194

0000000000000000
??
??:0
mpichbot commented 8 years ago

Originally by blocksom on 2013-07-18 12:12:51 -0500


I don't understand why this works for non-bgq platforms, but on bgq the root of the problem is that the MEMALLOC mutex is being used before it is initialized.

In the MPI_Init() function the MPIR_Param_init_params() function is invoked which calls MPIU_Strdup, which is defined to be the function MPIU_trstrdup() only when the configure option --enable-g=all is used.

MPIU_trstrdup() uses the MEMALLOC critical section macros, however these critical section mutexes are not yet defined at this point.

$ nl -ba -s ': ' src/util/mem/trmem.c | sed -n 95,104p
    95: 
    96: void *MPIU_trstrdup(const char *str, int lineno, const char fname[])
    97: {
    98:     void *retval;
    99:     MPIU_THREAD_CS_ENTER(MEMALLOC,);
   100:     retval = MPL_trstrdup(str, lineno, fname);
   101:     MPIU_THREAD_CS_EXIT(MEMALLOC,);
   102:     return retval;
   103: }
   104: 

These are defined when MPI_Init() invokes MPIR_Init_thread() on line 121 (after the call to MPIR_Param_init_params) in the file src/mpi/init/init.c ...

$ nl -ba -s ': ' src/mpi/init/init.c | sed -n 98,123p
    98:     /* ... body of routine ... */
    99: 
   100:     mpi_errno = MPIR_Param_init_params();
   101:     if (mpi_errno) MPIU_ERR_POP(mpi_errno);
   102: 
   103:     if (!strcmp(MPIR_PARAM_DEFAULT_THREAD_LEVEL, "MPI_THREAD_MULTIPLE"))
   104:         threadLevel = MPI_THREAD_MULTIPLE;
   105:     else if (!strcmp(MPIR_PARAM_DEFAULT_THREAD_LEVEL, "MPI_THREAD_SERIALIZED"))
   106:         threadLevel = MPI_THREAD_SERIALIZED;
   107:     else if (!strcmp(MPIR_PARAM_DEFAULT_THREAD_LEVEL, "MPI_THREAD_FUNNELED"))
   108:         threadLevel = MPI_THREAD_FUNNELED;
   109:     else if (!strcmp(MPIR_PARAM_DEFAULT_THREAD_LEVEL, "MPI_THREAD_SINGLE"))
   110:         threadLevel = MPI_THREAD_SINGLE;
   111:     else {
   112:         MPIU_Error_printf("Unrecognized thread level %s\n", MPIR_PARAM_DEFAULT_THREAD_LEVEL);
   113:         exit(1);
   114:     }
   115: 
   116:     /* If the user requested for asynchronous progress, request for
   117:      * THREAD_MULTIPLE. */
   118:     if (MPIR_PARAM_ASYNC_PROGRESS)
   119:         threadLevel = MPI_THREAD_MULTIPLE;
   120: 
   121:     mpi_errno = MPIR_Init_thread( argc, argv, threadLevel, &provided );
   122:     if (mpi_errno != MPI_SUCCESS) goto fn_fail;
   123: 

... which then calls the macro MPIU_THREAD_CS_INIT on line 272 of the file src/mpi/init/initthread.c.

$ nl -ba -s ': ' src/mpi/init/initthread.c | sed -n 256,273p
   256: int MPIR_Init_thread(int * argc, char ***argv, int required, int * provided)
   257: {
   258:     int mpi_errno = MPI_SUCCESS;
   259:     int has_args;
   260:     int has_env;
   261:     int thread_provided;
   262:     int exit_init_cs_on_failure = 0;
   263:     MPID_Info *info_ptr;
   264: 
   265:     /* For any code in the device that wants to check for runtime 
   266:        decisions on the value of isThreaded, set a provisional
   267:        value here. We could let the MPID_Init routine override this */
   268: #ifdef HAVE_RUNTIME_THREADCHECK
   269:     MPIR_ThreadInfo.isThreaded # required= MPI_THREAD_MULTIPLE;
   270: #endif
   271: 
   272:     MPIU_THREAD_CS_INIT;
   273: 

If I move the call to MPIU_THREAD_CS_INIT to line 99 of src/mpi/init/init.c then the romio coll_perf test will pass - but I don't think this is the correct way of fixing this problem, it's just a hack.

mpichbot commented 8 years ago

Originally by blocksom on 2013-07-19 08:50:54 -0500


I don't think this is actually related to anything specific to bgq or pamid .. only that the bug is revealed with pamid:bgq.

I'm going to remove the ibm-integ keyword and would like to reassign to Pavan, but I don't have authority to do trac ticket assignments.

mpichbot commented 8 years ago

Originally by robl on 2013-07-19 09:08:50 -0500


Thanks for investigating ,mike

mpichbot commented 8 years ago

Originally by balaji on 2014-01-01 20:56:25 -0600


Thanks for the analysis, Mike. It's very useful.

It looks like we need the cvar initialization to know what thread-level is required. But without knowing the thread-level, we don't have the right macros for cvar initialization. One possible option is to assume a thread-level for the cvar initialization, and then "fix it" based on the user-provided environment variables.

We could assume THREAD_FUNNELED, since multiple threads are not allowed to call MPI_INIT{_THREAD} anyway. Then we can use the environment information to upgrade it.

Alternatively, we could assume THREAD_MULTIPLE to be safe, and later downgrade it based on environment information.

There's a proposal for MPI-4 to force MPI_INIT{_THREAD} to always be thread-safe. The latter option might be better for that. I'll think about this some more and code this up tomorrow.

mpichbot commented 8 years ago

Originally by robl on 2014-01-02 15:17:38 -0600


I can't test Pavan's proposed change on Blue Gene (where I noticed the problem in the first place): when I configure mpich-review/ticket-1900-robl , I get the folowing error:

configure: =#### done with src/mpi/romio configure=
configure: sourcing src/mpi/romio/localdefs
checking size of OPA_ptr_t... 0
checking the sizeof MPI_Offset... configure: WARNING: Unable to determine the size of MPI_Offset
unknown
checking whether the Fortran Offset type works with Fortran 77... yes
checking whether the Fortran Offset type works with Fortran 90... yes
configure: error: Unable to convert MPI_SIZEOF_OFFSET to a hex string.  This is either because we are building on a very strange platform or there is a bug somewhere in configure.
mpichbot commented 8 years ago

Originally by balaji on 2014-01-02 15:42:03 -0600


Robl: did you run autogen.sh again?

mpichbot commented 8 years ago

Originally by blocksom on 2014-01-02 15:47:07 -0600


... and could you paste your configure command into the ticket?

mpichbot commented 8 years ago

Originally by robl on 2014-01-02 16:20:16 -0600


autogen was first thing i thought of. no dice.

Configure command is same as earlier in this ticket (but that was 6 months ago so ...)

../configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --with-atomic-primitives \
--enable-handle-allocation=tls --enable-refcount=lock-free \
--disable-predefined-refcount --prefix=/home/robl/soft/mpich-bgq-dbg \
--enable-g=all

no tweaks to CC or FC or anything like that. this is on one of the Mira login nodes.

mpichbot commented 8 years ago

Originally by robl on 2014-01-02 16:37:46 -0600


oh and that (standard compilers) was my problem . forgot to put /bgsys/drivers/V1R2M0/ppc64/gnu-linux/bin in my path. carrying on....

mpichbot commented 8 years ago

Originally by robl on 2014-01-03 13:27:21 -0600


I cannot ack this: I get a segfault super-early:

0 :Node (8)
1 :    <traceback not fetched> (1)
1 :    0000000000000000 (7)
2 :        .__libc_start_main (7)
3 :            .generic_start_main (7)
4 :                .main (7)
5 :                    .PMPI_Init (7)
6 :                        .MPIR_T_env_init (7)
7 :                            0000000000000000 (7)

I have a binary core file from Blue Gene that provides the following backtrace:

Program terminated with signal 11, Segmentation fault.
#0  __l2_load_op (a=24, lineno=31, 
    fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
    at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:109
109         return *__l2_op_ptr(ptr, op);
(gdb) where
#0  __l2_load_op (a=24, lineno=31, 
    fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
    at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:109
#1  L2_AtomicLoadIncrementBounded (a=24, lineno=31, 
    fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
    at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:186
#2  MPIDI_Mutex_acquire (a=24, lineno=31, 
    fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
    at /home/robl/src/mpich/src/mpid/pamid/include/mpidi_mutex.h:112
#3  MPIU_trmalloc (a=24, lineno=31, 
    fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
    at /home/robl/src/mpich/src/util/mem/trmem.c:28
#4  0x0000000001013a7c in MPIR_T_enum_env_init ()
    at /home/robl/src/mpich/src/mpi_t/mpit_initthread.c:31
#5  MPIR_T_env_init () at /home/robl/src/mpich/src/mpi_t/mpit_initthread.c:72
#6  0x0000000001010bc4 in PMPI_Init (argc=0x1dbfffbad0, argv=0x1dbfffbad8)
    at /home/robl/src/mpich/src/mpi/init/init.c:145
#7  0x0000000001000ac4 in main (argc=3, argv=0x1dbfffbec8)
    at /home/robl/src/mpich/src/mpi/romio/test/coll_perf.c:33

Need more information?

mpichbot commented 8 years ago

Originally by balaji on 2014-01-07 00:06:59 -0600


I've pushed a different patch to mpich-review/ticket-1900-robl. It also includes the win-req patch that I sent out to IBM. Can you try it out and let me know if it works correctly for you? It'll be great if someone from IBM can take a look at it and signoff as well, since it affects pamid as well.

Note that this patch only fixes the runtime thread-safety code, and not cases where runtime thread-safety is not enabled (--enable-threads=runtime). I'm not sure there's a good way to fix this when runtime thread-safety is not enabled, since we always need to check at runtime at least if the mutexes have been initialized or not.

For a complete fix, we might need to completely get rid of the ability to disable runtime checks. That is, we give these options for --enable-threads={single,funneled,serialized,multiple,device}. The "device" option will be default. The internal runtime check will always happen. The configure option will only behave as if the user initialized with that thread-safety level. This way, the MPI_INIT{_THREAD} part will always occur in MPI_THREAD_FUNNELED mode, irrespective of what the user requested. Sometime during INIT, we'll "upgrade" the thread-level to MULTIPLE if needed.

mpichbot commented 8 years ago

Originally by blocksom on 2014-01-07 08:31:30 -0600


I've reviewed these commits and I think it will work, but there will be a performance degradation because of the additional branches in the point-to-point critical path - both on the send and recv side.

Can this code be re-worked so that the "safe" critical section macros (the ones with the branch) are used only during initialization, but then we can continue to use the original critical section macros after initialization in the performance-critical code path?

mpichbot commented 8 years ago

Originally by balaji on 2014-01-07 10:02:53 -0600


Mike: is there really be a performance degradation with this on any modern processor? With an "unlikely" attribute help? I can't imagine the branch adding too much performance degradation compared to the mutexes and memory barriers the code is doing.

There are two reasons I did it this way:

  1. I was trying to avoid having two sets of macros, which is cumbersome to maintain.
  2. Without the branch, you'll always do the locks if someone configured MPI with THREAD_MULTIPLE support, even if the application didn't request for THREAD_MULTIPLE. That is, you'll need separate MULTIPLE and non-MULTIPLE MPI libraries.
mpichbot commented 8 years ago

Originally by robl on 2014-01-07 10:49:20 -0600


This branch runs on Blue Gene now, though it's still reporting leaked resources (and will until #1947 is closed). I've signed off on 7db1a818 (HEAD, 1900-robl-so) We always need the runtime thread-safety check but don't know the right invocation for adding a sign-off on older commits. Pushed to mpich-review/ticket-1900-robl-so

mpichbot commented 8 years ago

Originally by blocksom on 2014-01-07 11:04:32 -0600


Replying to balaji:

Mike: is there really be a performance degradation with this on any modern processor? With an "unlikely" attribute help? I can't imagine the branch adding too much performance degradation compared to the mutexes and memory barriers the code is doing.

Unfortunately, my personal experience working with the A2 processor on BG/Q is that branches do make a big performance difference. The unlikely attribute helps, but not much. We spent a ridiculous amount of time reworking the code to remove branches and msyncs so that we could meet the performance requirements for Mira.

That said ... The supported BG/Q product will not move to the master branch any time soon, if at all, so this change wouldn't impact the product. Of course, your colleagues who are interested in MPI3 on BG/Q have no choice but to build their own mpich library from the master branch. The bgq code on the master branch isn't stable enough for me to run any performance comparisons, but I am expecting the performance to be worse than the product branch.

If there _isn't_ an easy/obvious way to rework the code to remove this BG/Q performance penalty then I guess we will have to live with it and hope that the next exascale machine has a better branch predictor and icache than the A2. However, overly branchy code is bad - regardless of the processor - and I would recommend that mpich should strive to remove them wherever possible. This was one of the reasons that the BG/L team decided to make a new mpich device rather than extending the exiting ch3 device. This may not strictly be the case anymore, but that was one of the justifications given for the decision way back when.

At the minimum I'd like to see the unlikely attribute added, though. Can we do that?

mpichbot commented 8 years ago

Originally by balaji on 2014-01-07 11:33:58 -0600


I'll rework it to add unlikely.

I'm not doubting that branches cause performance degradation. I'm just comparing it against mutexes and memory barriers, since these branches only occur when the code has been configured to always run in THREAD_MULTIPLE. In other words, what we have right now is: always do mutexes and memory barriers (even for single-threaded applications when the library is configured this way); what I changed it to is: check a flag before doing mutexes and memory barriers.

For completeness, if we considered the two sets of macros approach, how would be do that without another branch? Wouldn't we anyway need a flag to tell us whether we are initialized or not?

As a secondary issue, I'm curious why this code path is being taken for coll_perf, since it doesn't initialize MPI in THREAD_MULTIPLE mode. Are you forcing THREAD_MULTIPLE on all applications?

mpichbot commented 8 years ago

Originally by balaji on 2014-01-07 11:42:00 -0600


Replying to robl:

This branch runs on Blue Gene now, though it's still reporting leaked resources (and will until #1947 is closed). I've signed off on 7db1a818 (HEAD, 1900-robl-so) We always need the runtime thread-safety check but don't know the right invocation for adding a sign-off on older commits. Pushed to mpich-review/ticket-1900-robl-so

You can cherry-pick them to a different branch with the --signoff flag, and then push this new branch back as the signedoff branch.

mpichbot commented 8 years ago

Originally by blocksom on 2014-01-07 11:50:33 -0600


Replying to balaji:

As a secondary issue, I'm curious why this code path is being taken for coll_perf, since it doesn't initialize MPI in THREAD_MULTIPLE mode. Are you forcing THREAD_MULTIPLE on all applications?

Hmmmm .. I'd have to look closer at the code. The product installs two builds of the code - one is the "per object" locking. in that mode we have to promote the thread level to multiple because of commthreads, etc, even if the user only invokes MPI_Init(). We don't necessarily do that for the libraries built with the "global lock" mode.

If I had time to rewrite the pamid device, I would register a different pami callback function for each permutation of the environment so that the recv code path had minimal branches. I'd try to do the same on the send side. You give up some inline when using function pointers, but then if you do it right you can eliminate lots of branches.

I'm fine with the code now. If we have performance problems in the future then we can tackle this issue again later.

mpichbot commented 8 years ago

Originally by balaji on 2014-01-07 12:03:14 -0600


OK. I did build it with per-object, so that makes sense.

I've pushed the updated patches to mpich-review/ticket-1900-blocksome. It contains the previous patches in this branch too, one of which was signedoff by Robl.

mpichbot commented 8 years ago

Originally by balaji on 2014-01-07 12:08:54 -0600


Btw, note that with unlikely, we prioritized non-THREAD_MULTIPLE codes. But that doesn't sound like what you need. For per-object, you probably want to prioritize THREAD_MULTIPLE codes since all programs are promoted to THREAD_MULTIPLE?

mpichbot commented 8 years ago

Originally by blocksom on 2014-01-07 13:50:08 -0600


Replying to balaji:

Btw, note that with unlikely, we prioritized non-THREAD_MULTIPLE codes. But that doesn't sound like what you need. For per-object, you probably want to prioritize THREAD_MULTIPLE codes since all programs are promoted to THREAD_MULTIPLE? Correct.

mpichbot commented 8 years ago

Originally by balaji on 2014-01-07 13:51:30 -0600


OK. I'll change it to likely. Also, there were some errors in the previous patch. The isThreaded variable should only come into play when the library is configured with MULTIPLE support. I'll push a new set of patches in.

mpichbot commented 8 years ago

Originally by balaji on 2014-01-07 15:24:01 -0600


I've pushed new patches into mpich-review/ticket-1900-blocksome.

Rob/Mike: can you guys review this branch?

Rob: I've removed your previous signoff, since there were several changes to the patch after you looked at it.

mpichbot commented 8 years ago

Originally by blocksom on 2014-01-08 10:24:01 -0600


Sorry, but I have another question here ..

Since it is only the MEMALLOC mutex that is being used by the string functions, could we just add this if (likely(MPIR_ThreadInfo.isThreaded)) check to the MPIU_THREAD_CS_MEMALLOC_ENTER and MPIU_THREAD_CS_MEMALLOC_EXIT macros and leave the others alone? Or could the checks in the other macros only be enabled if mpich is configured with "extra runtime checks" (whatever that configure option is)?

mpichbot commented 8 years ago

Originally by balaji on 2014-01-08 14:16:50 -0600


That seems like a hacky solution. For now, we are only using memory management routines outside of the critical section, but in the future we might use more.

Furthermore, MPIU_THREAD_CSMEMALLOC{ENTER,EXIT} can be device defined. The default implementation of MPIU_THREAD_CS_ENTER for MEMALLOC maps to MPIU_THREAD_CS_ENTER_MEMALLOC. But devices can map them to something else. For example, pamid maps it to MPIU_THREAD_CS_MEMALLOC_ENTER (note the reordering of words).

mpichbot commented 8 years ago

Originally by blocksom on 2014-01-08 14:19:56 -0600


Well .. ok. I still don't think that the current solution is perfect. But we can work to refine this later in 3.1.1 or 3.2 or whenever.

mpichbot commented 8 years ago

Originally by blocksom on 2014-01-08 14:44:00 -0600


I've signed-off on the commits and pushed everything to the mpich-ibm/ticket-1900-blocksome-signed branch.

It would be good to get a sign-off from Rob Latham as well.

mpichbot commented 8 years ago

Originally by balaji on 2014-01-08 14:53:01 -0600


Replying to blocksom:

Well .. ok. I still don't think that the current solution is perfect. But we can work to refine this later in 3.1.1 or 3.2 or whenever.

Alright. If you can actually show me a noticeable performance difference caused by this branch in the THREAD_MULTIPLE case, I'll buy you a beer. :-)

mpichbot commented 8 years ago

Originally by blocksom on 2014-01-08 14:59:18 -0600


Replying to balaji:

Replying to blocksom:

Well .. ok. I still don't think that the current solution is perfect. But we can work to refine this later in 3.1.1 or 3.2 or whenever.

Alright. If you can actually show me a noticeable performance difference caused by this branch in the THREAD_MULTIPLE case, I'll buy you a beer. :-)

I dunno .. that sounds suspiciously like work. How about we buy each other beers at the next Forum meeting instead? That would be a bit more fun. :)

mpichbot commented 8 years ago

Originally by robl on 2014-01-08 16:26:14 -0600


sigh. NACK. This backtrace looks so familiar that I had to double-check that it was not identical to last week's. Hm. It is identical. I pulled from mpich-ibm/ticket-1900-blocksome-signed

(gdb) where
#0  __l2_load_op (a=24, lineno=31, 
    fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
    at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:109
#1  L2_AtomicLoadIncrementBounded (a=24, lineno=31, 
    fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
    at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:186
#2  MPIDI_Mutex_acquire (a=24, lineno=31, 
    fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
    at /home/robl/src/mpich/src/mpid/pamid/include/mpidi_mutex.h:112
#3  MPIU_trmalloc (a=24, lineno=31, 
    fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
    at /home/robl/src/mpich/src/util/mem/trmem.c:28
#4  0x0000000001013a7c in MPIR_T_enum_env_init ()
    at /home/robl/src/mpich/src/mpi_t/mpit_initthread.c:31
#5  MPIR_T_env_init () at /home/robl/src/mpich/src/mpi_t/mpit_initthread.c:72
#6  0x0000000001010bc4 in PMPI_Init (argc=0x1dbfffbad0, argv=0x1dbfffbad8)
    at /home/robl/src/mpich/src/mpi/init/init.c:145
#7  0x0000000001000ac4 in main (argc=3, argv=0x1dbfffbec8)
    at /home/robl/src/mpich/src/mpi/romio/test/coll_perf.c:33
mpichbot commented 8 years ago

Originally by balaji on 2014-01-08 22:04:34 -0600


Hmm.. Weird.

I configured it this way:

../mpich/configure --host=powerpc64-bgq-linux --with-device=pamid \
    --with-file-system=bg+bglockless --disable-wrapper-rpath \
    --enable-thread-cs=per-object --with-atomic-primitives \
    --enable-handle-allocation=tls --enable-refcount=lock-free \
    --disable-predefined-refcount \
    --prefix=/home/balaji/software/mpich/build/install --enable-g=all

And ran it this way:

qsub -t 15 -n 8 --env=BG_COREDUMPBINARY=0 ./coll_perf -fname blah

The output file had some stuff:

leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated

The error file had some stuff:

[...snipped...]
Global array size 128 x 128 x 128 integers
Collective write time # 0.017283 sec, Collective write bandwidth462.889197 Mbytes/sec
Collective read time # 0.023002 sec, Collective read bandwidth347.789777 Mbytes/sec
[0] 504 at [0x00000019c583e858], [1] 504 at [0x00000019c583e858], [3] 504 at [0x00000019c583e858], [6] 504 at [0x00000019c583e858], [7] 504 at [0x00000019c583e858], [2] 504 at [0x00000019c583e858], ../mpich/src/mpi/comm/commutil.c[281]
[4] 504 at [0x00000019c583e858], ../mpich/src/mpi/comm/commutil.c[281]
[5] 504 at [0x00000019c583e858], ../mpich/src/mpi/comm/commutil.c[281]
[...snipped...]

Seems to be as expected (except the bandwidth output going into the error file part).

What am I missing?

mpichbot commented 8 years ago

Originally by robl on 2014-01-10 15:36:42 -0600


thanks for trying it out, Pavan. Today, no segfaults, so you get my signoff.

Patches pushed to mpich-ibm/ticket-1900-blocksome-and-robl-signed -- aw yeah, robl loves the long branch names.

mpichbot commented 8 years ago

Originally by Pavan Balaji balaji@mcs.anl.gov on 2014-01-10 22:36:57 -0600


In 99485af86e826a946fe038a77f0b98244526dc4d: Thread critical-section initialization fixes.

This patch has two related parts.

  1. We initialize the isThreaded runtime check to FALSE before doing the cvar checks. This is because the cvar initialization uses memory allocation and string duping, which internally use mutexes. But the mutexes are not initialized yet, so disabling isThreaded would make sure we don't use them.
  2. Updates to the pamid device to respect the isThreaded variable.

This is needed since we were initializing cvars before setting the thread-level, so the thread-safety macros are not initialized at that point. See #1900.

Signed-off-by: Michael Blocksome blocksom@us.ibm.com Signed-off-by: Rob Latham robl@mcs.anl.gov

mpichbot commented 8 years ago

Originally by Pavan Balaji balaji@mcs.anl.gov on 2014-01-10 22:36:58 -0600


In 1a0c3dbc9d8055314613016d5c64f801d9b87219: We always need the runtime thread-safety check.

Disabling runtime check for thread-safety is not a correct option. This would cause memory allocation and string manipulation routines to be unusable before the thread-safety level is set. Fixes #1900.

Signed-off-by: Michael Blocksome blocksom@us.ibm.com Signed-off-by: Rob Latham robl@mcs.anl.gov

mpichbot commented 8 years ago

Originally by Pavan Balaji balaji@mcs.anl.gov on 2014-01-10 22:36:58 -0600


In 2ce5a7e2786fd07ffb13dc4650fc231b7e9ef829: Prioritize thread-multiple branch.

Give a "likely" attribute to the branch that checks if an MPI process is initialized in THREAD_MULTIPLE or not. This should reduce the cost of the branch for applications that require THREAD_MULTIPLE.

The per-object mode in the pamid branch is intended for the case where comm-threads are used. In this case, the application is mostly in THREAD_MULTIPLE mode, even if it didn't explicitly initialize it as such. The only time it'll not be in that mode is during initialization before the appropriate mutexes have been setup.

See #1900.

Signed-off-by: Michael Blocksome blocksom@us.ibm.com Signed-off-by: Rob Latham robl@mcs.anl.gov