Closed mpichbot closed 8 years ago
Originally by blocksom on 2013-07-15 09:28:25 -0500
Rob, could you post the entire configure command you used? Thanks!
Also, is this a blocker for mpich v3.1?
Originally by robl on 2013-07-15 09:32:52 -0500
Using the guidance here: https://wiki.mpich.org/mpich/index.php/BGQ
(so with the gcc-based bgq toolchain in path)
/home/robl/src/mpich/configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --with-atomic-primitives \
--enable-handle-allocation=tls --enable-refcount=lock-free \
--disable-predefined-refcount --prefix=/home/robl/soft/mpich-bgq-dbg \
--enable-g=all
This is not a blocker for 3.1 -- it's just me trying to find ways to replace valgrind.
Originally by blocksom on 2013-07-16 10:42:51 -0500
I configure'd like this:
../configure --host=powerpc64-bgq-linux --with-device=pamid --with-file-system=bg+bglockless --prefix=/bghome/blocksom/development/mpich/install --enable-g=all
and the coll_perf
test failed, but due to a different error:
10:40:34 /bgusr/blocksom/tmp$ runjob --block R00-M0-N10 --np 4 --label --envs BG_COREDUMPONEXIT=1 : coll_perf -fname blah 2> runjob.out
stdout[0]: leaked context IDs detected: mask=0x166b668 mask[0]=0xfffffff8
stdout[2]: leaked context IDs detected: mask=0x166b668 mask[0]=0xfffffff8
stdout[3]: leaked context IDs detected: mask=0x166b668 mask[0]=0xfffffff8
stdout[1]: leaked context IDs detected: mask=0x166b668 mask[0]=0xfffffff8
10:40:42 /bgusr/blocksom/tmp$ grep std....0 runjob.out | head
stderr[0]: Global array size 128 x 128 x 128 integers
stderr[0]: Collective write time # 0.152554 sec, Collective write bandwidth52.440551 Mbytes/sec
stderr[0]: Collective read time # 0.036090 sec, Collective read bandwidth221.670332 Mbytes/sec
stderr[0]: [0] 16 at [0x00000019c58b0a38], [2] 16 at [0x00000019c58b0a38], [3] 16 at [0x00000019c58b0a38], [1] 16 at [0x00000019c58b0a38], ../../../src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c[103]
stderr[0]: [0] 80 at [0x00000019c582f958], [2] 80 at [0x00000019c582f958], [3] 80 at [0x00000019c582f958], [1] 80 at [0x00000019c582f958], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]
stderr[0]: [0] 8 at [0x00000019c582f898], [2] 8 at [0x00000019c582f898], [3] 8 at [0x00000019c582f898], [1] 8 at [0x00000019c582f898], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[624]
stderr[0]: [0] 80 at [0x00000019c582f798], [2] 80 at [0x00000019c582f798], [3] 80 at [0x00000019c582f798], [1] 80 at [0x00000019c582f798], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[622]
stderr[0]: [0] 8 at [0x00000019c582f6d8], [2] 8 at [0x00000019c582f6d8], [3] 8 at [0x00000019c582f6d8], [1] 8 at [0x00000019c582f6d8], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[620]
stderr[0]: [0] 80 at [0x00000019c582f5d8], [2] 80 at [0x00000019c582f5d8], [3] 80 at [0x00000019c582f5d8], [1] 80 at [0x00000019c582f5d8], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]
stderr[0]: [0] 8 at [0x00000019c582f518], [2] 8 at [0x00000019c582f518], [3] 8 at [0x00000019c582f518], [1] 8 at [0x00000019c582f518], ../src/mpid/pamid/src/comm/mpid_selectcolls.c[624]
10:41:12 /bgusr/blocksom/tmp$ grep std....1 runjob.out | head
stderr[1]: ../../../src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c[103]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[624]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[622]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[620]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[624]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[622]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[620]
stderr[1]: ../src/mpid/pamid/src/comm/mpid_selectcolls.c[626]
I'll try your specific configure next...
Originally by robl on 2013-07-16 10:54:44 -0500
Excellent. I guess it was the more aggressive '--with-atomic-primitives' and friends?
coll_perf did not fail. That spew you see is the MPICH memory allocation debugging telling us about memory and resource leaks. ad_bg_aggrs.c has one at line 103, for example
100 if(s > aggrsInPsetSize)
101 {
102 if(aggrsInPset) ADIOI_Free(aggrsInPset);
103 aggrsInPset = (int *) ADIOI_Malloc (s *sizeof(int));
104 aggrsInPsetSize = s;
105 }
I doubt it's a huge contributor to overall memory usage, but it's something we would have found with valgrind (if we had it) that we now know about!
Originally by blocksom on 2013-07-16 11:08:13 -0500
With this configure:
../configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --with-atomic-primitives \
--enable-handle-allocation=tls --enable-refcount=lock-free \
--disable-predefined-refcount \
--prefix=/bghome/blocksom/development/mpich/install \
--enable-g=all
I get these warnings - which seem bad.
11:04:32 {ticket-1900} ~/development/mpich/build/src/mpi/romio/test$ make
cd .. && /bin/sh ./config.status test/fperf.f
config.status: creating test/fperf.f
F77 fperf.o
F77LD fperf
/bgsys/drivers/toolchain/V1R2M1_base/gnu-linux/lib/gcc/powerpc64-bgq-linux/4.4.6/../../../../powerpc64-bgq-linux/bin/ld: Warning: size of symbol `mpifcmb1_' changed from 20 in fperf.o to 40 in /bghome/blocksom/development/mpich/install/lib/libmpich.a(setbot.o)
/bgsys/drivers/toolchain/V1R2M1_base/gnu-linux/lib/gcc/powerpc64-bgq-linux/4.4.6/../../../../powerpc64-bgq-linux/bin/ld: Warning: size of symbol `mpifcmb2_' changed from 20 in fperf.o to 40 in /bghome/blocksom/development/mpich/install/lib/libmpich.a(setbot.o)
Originally by robl on 2013-07-16 11:17:01 -0500
If i had to guess i'd say there were some stale objects in the romio/test directory? romio/test/Makefile is not nearly as clever about dependencies as the rest of MPICH
Originally by blocksom on 2013-07-16 11:26:46 -0500
Replying to robl:
If i had to guess i'd say there were some stale objects in the romio/test directory? romio/test/Makefile is not nearly as clever about dependencies as the rest of MPICH Nope .. this was from a "pristine" configure, make && make install, romio tests build.
Originally by blocksom on 2013-07-16 11:42:43 -0500
Uhg. If I configure like this (without --with-atomic-primitives
) ...
../configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --enable-handle-allocation=tls \
--enable-refcount=lock-free --disable-predefined-refcount \
--prefix=/bghome/blocksom/development/mpich/install \
--enable-g=all
I get a branch-to-NULL in some parameter checking code (src/util/param/param_vals.c:726
):
11:35:34 /bgusr/blocksom/tmp$ tail -n16 core.2
+++STACK
Frame Address Saved Link Reg
0000001dbfffb6c0 0000000000000000
0000001dbfffb740 000000000103707c
0000001dbfffb840 00000000010103e4
0000001dbfffb900 0000000001000590
0000001dbfffbaa0 00000000013fab48
0000001dbfffbd80 00000000013fae44
0000001dbfffbe40 0000000000000000
---STACK
Interrupt Summary:
System Calls 68
External Input Interrupts 1
Data TLB Interrupts 1
---ID
---LCB
11:35:42 /bgusr/blocksom/tmp$ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-addr2line -ipfe coll_perf
000000000103707c
MPIR_Param_init_params at /bghome/blocksom/development/mpich/build/../src/util/param/param_vals.c:726
11:39:55 /bgusr/blocksom/tmp$ nl -ba -s ': ' /bghome/blocksom/development/mpich/build/../src/util/param/param_vals.c | sed -n 720,730p
720: MPIR_PARAM_GET_DEFAULT_STRING(DEFAULT_THREAD_LEVEL, &tmp_str);
721: rc = MPL_env2str("MPICH_DEFAULT_THREAD_LEVEL", &tmp_str);
722: MPIU_ERR_CHKANDJUMP1((-1 == rc),mpi_errno,MPI_ERR_OTHER,"**envvarparse","**envvarparse %s","MPICH_DEFAULT_THREAD_LEVEL");
723: rc = MPL_env2str("MPIR_PARAM_DEFAULT_THREAD_LEVEL", &tmp_str);
724: MPIU_ERR_CHKANDJUMP1((-1 == rc),mpi_errno,MPI_ERR_OTHER,"**envvarparse","**envvarparse %s","MPIR_PARAM_DEFAULT_THREAD_LEVEL");
725: if (tmp_str != NULL) {
726: MPIR_PARAM_DEFAULT_THREAD_LEVEL = MPIU_Strdup(tmp_str);
727: MPIR_Param_assert(MPIR_PARAM_DEFAULT_THREAD_LEVEL);
728: if (MPIR_PARAM_DEFAULT_THREAD_LEVEL == NULL) {
729: MPIU_CHKMEM_SETERR(mpi_errno, strlen(tmp_str), "dup of string for MPIR_PARAM_DEFAULT_THREAD_LEVEL");
730: goto fn_fail;
Originally by blocksom on 2013-07-16 11:58:16 -0500
... and I get the same segfault at the same location with the --with-atomic-primitives
configure option. Why can't I reproduce your runtime error?
Originally by robl on 2013-07-16 12:27:41 -0500
ok, i pull 2de997d, regenerate configure, re-configure, like you, from a brand new build directory
../configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --enable-handle-allocation=tls \
--enable-refcount=lock-free --disable-predefined-refcount \
--prefix=${HOME}/soft/mpich-dbg --enable-g=all && make -j -l 16 && make install
I run coll_perf with 8 nodes and now I think you and I are back in sync:
% coreprocessor.pl -nogui -c=. -b=./coll_perf
Reading disassembly for ./coll_perf
filename: ./core.5 (filecount 1)
done reading cores
filename: ./core.3 (filecount 2)
done reading cores
filename: ./core.6 (filecount 3)
done reading cores
filename: ./core.1 (filecount 4)
done reading cores
filename: ./core.7 (filecount 5)
done reading cores
filename: ./core.2 (filecount 6)
done reading cores
filename: ./core.0 (filecount 7)
done reading cores
filename: ./core.4 (filecount 8)
done reading cores
0 :Node (9)
1 : <traceback not fetched> (1)
1 : 0000000000000000 (8)
2 : .__libc_start_main (8)
3 : .generic_start_main (8)
4 : .main (8)
5 : .PMPI_Init (8)
6 : .MPIR_Param_init_params (8)
7 : 0000000000000000 (8)
% bgq_stack coll_perf core.0
------------------------------------------------------------------------
Program : /gpfs/vesta-home/robl/src/mpich/build-1900/src/mpi/romio/test/./coll_perf
------------------------------------------------------------------------
+++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN
0000000000000000
??
??:0
0000000001036fac
MPIR_Param_init_params
/home/robl/src/mpich/build-1900/../src/util/param/param_vals.c:726
0000000001010924
PMPI_Init
/home/robl/src/mpich/build-1900/../src/mpi/init/init.c:100
0000000001000ad0
main
/home/robl/src/mpich/build-1900/src/mpi/romio/test/../../../../../src/mpi/romio/test/coll_perf.c:33
00000000013f6a28
generic_start_main
/bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226
00000000013f6d24
__libc_start_main
/bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194
0000000000000000
??
??:0
Originally by blocksom on 2013-07-18 12:12:51 -0500
I don't understand why this works for non-bgq platforms, but on bgq the root of the problem is that the MEMALLOC
mutex is being used before it is initialized.
In the MPI_Init()
function the MPIR_Param_init_params()
function is invoked which calls MPIU_Strdup
, which is defined to be the function MPIU_trstrdup()
only when the configure option --enable-g=all
is used.
MPIU_trstrdup()
uses the MEMALLOC
critical section macros, however these critical section mutexes are not yet defined at this point.
$ nl -ba -s ': ' src/util/mem/trmem.c | sed -n 95,104p
95:
96: void *MPIU_trstrdup(const char *str, int lineno, const char fname[])
97: {
98: void *retval;
99: MPIU_THREAD_CS_ENTER(MEMALLOC,);
100: retval = MPL_trstrdup(str, lineno, fname);
101: MPIU_THREAD_CS_EXIT(MEMALLOC,);
102: return retval;
103: }
104:
These are defined when MPI_Init()
invokes MPIR_Init_thread()
on line 121 (after the call to MPIR_Param_init_params) in the file src/mpi/init/init.c
...
$ nl -ba -s ': ' src/mpi/init/init.c | sed -n 98,123p
98: /* ... body of routine ... */
99:
100: mpi_errno = MPIR_Param_init_params();
101: if (mpi_errno) MPIU_ERR_POP(mpi_errno);
102:
103: if (!strcmp(MPIR_PARAM_DEFAULT_THREAD_LEVEL, "MPI_THREAD_MULTIPLE"))
104: threadLevel = MPI_THREAD_MULTIPLE;
105: else if (!strcmp(MPIR_PARAM_DEFAULT_THREAD_LEVEL, "MPI_THREAD_SERIALIZED"))
106: threadLevel = MPI_THREAD_SERIALIZED;
107: else if (!strcmp(MPIR_PARAM_DEFAULT_THREAD_LEVEL, "MPI_THREAD_FUNNELED"))
108: threadLevel = MPI_THREAD_FUNNELED;
109: else if (!strcmp(MPIR_PARAM_DEFAULT_THREAD_LEVEL, "MPI_THREAD_SINGLE"))
110: threadLevel = MPI_THREAD_SINGLE;
111: else {
112: MPIU_Error_printf("Unrecognized thread level %s\n", MPIR_PARAM_DEFAULT_THREAD_LEVEL);
113: exit(1);
114: }
115:
116: /* If the user requested for asynchronous progress, request for
117: * THREAD_MULTIPLE. */
118: if (MPIR_PARAM_ASYNC_PROGRESS)
119: threadLevel = MPI_THREAD_MULTIPLE;
120:
121: mpi_errno = MPIR_Init_thread( argc, argv, threadLevel, &provided );
122: if (mpi_errno != MPI_SUCCESS) goto fn_fail;
123:
... which then calls the macro MPIU_THREAD_CS_INIT
on line 272 of the file src/mpi/init/initthread.c
.
$ nl -ba -s ': ' src/mpi/init/initthread.c | sed -n 256,273p
256: int MPIR_Init_thread(int * argc, char ***argv, int required, int * provided)
257: {
258: int mpi_errno = MPI_SUCCESS;
259: int has_args;
260: int has_env;
261: int thread_provided;
262: int exit_init_cs_on_failure = 0;
263: MPID_Info *info_ptr;
264:
265: /* For any code in the device that wants to check for runtime
266: decisions on the value of isThreaded, set a provisional
267: value here. We could let the MPID_Init routine override this */
268: #ifdef HAVE_RUNTIME_THREADCHECK
269: MPIR_ThreadInfo.isThreaded # required= MPI_THREAD_MULTIPLE;
270: #endif
271:
272: MPIU_THREAD_CS_INIT;
273:
If I move the call to MPIU_THREAD_CS_INIT
to line 99 of src/mpi/init/init.c
then the romio coll_perf test will pass - but I don't think this is the correct way of fixing this problem, it's just a hack.
Originally by blocksom on 2013-07-19 08:50:54 -0500
I don't think this is actually related to anything specific to bgq or pamid .. only that the bug is revealed with pamid:bgq.
I'm going to remove the ibm-integ
keyword and would like to reassign to Pavan, but I don't have authority to do trac ticket assignments.
Originally by robl on 2013-07-19 09:08:50 -0500
Thanks for investigating ,mike
Originally by balaji on 2014-01-01 20:56:25 -0600
Thanks for the analysis, Mike. It's very useful.
It looks like we need the cvar initialization to know what thread-level is required. But without knowing the thread-level, we don't have the right macros for cvar initialization. One possible option is to assume a thread-level for the cvar initialization, and then "fix it" based on the user-provided environment variables.
We could assume THREAD_FUNNELED, since multiple threads are not allowed to call MPI_INIT{_THREAD} anyway. Then we can use the environment information to upgrade it.
Alternatively, we could assume THREAD_MULTIPLE to be safe, and later downgrade it based on environment information.
There's a proposal for MPI-4 to force MPI_INIT{_THREAD} to always be thread-safe. The latter option might be better for that. I'll think about this some more and code this up tomorrow.
Originally by robl on 2014-01-02 15:17:38 -0600
I can't test Pavan's proposed change on Blue Gene (where I noticed the problem in the first place): when I configure mpich-review/ticket-1900-robl
, I get the folowing error:
configure: =#### done with src/mpi/romio configure=
configure: sourcing src/mpi/romio/localdefs
checking size of OPA_ptr_t... 0
checking the sizeof MPI_Offset... configure: WARNING: Unable to determine the size of MPI_Offset
unknown
checking whether the Fortran Offset type works with Fortran 77... yes
checking whether the Fortran Offset type works with Fortran 90... yes
configure: error: Unable to convert MPI_SIZEOF_OFFSET to a hex string. This is either because we are building on a very strange platform or there is a bug somewhere in configure.
Originally by balaji on 2014-01-02 15:42:03 -0600
Robl: did you run autogen.sh again?
Originally by blocksom on 2014-01-02 15:47:07 -0600
... and could you paste your configure command into the ticket?
Originally by robl on 2014-01-02 16:20:16 -0600
autogen was first thing i thought of. no dice.
Configure command is same as earlier in this ticket (but that was 6 months ago so ...)
../configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --with-atomic-primitives \
--enable-handle-allocation=tls --enable-refcount=lock-free \
--disable-predefined-refcount --prefix=/home/robl/soft/mpich-bgq-dbg \
--enable-g=all
no tweaks to CC or FC or anything like that. this is on one of the Mira login nodes.
Originally by robl on 2014-01-02 16:37:46 -0600
oh and that (standard compilers) was my problem . forgot to put /bgsys/drivers/V1R2M0/ppc64/gnu-linux/bin in my path. carrying on....
Originally by robl on 2014-01-03 13:27:21 -0600
I cannot ack this: I get a segfault super-early:
0 :Node (8)
1 : <traceback not fetched> (1)
1 : 0000000000000000 (7)
2 : .__libc_start_main (7)
3 : .generic_start_main (7)
4 : .main (7)
5 : .PMPI_Init (7)
6 : .MPIR_T_env_init (7)
7 : 0000000000000000 (7)
I have a binary core file from Blue Gene that provides the following backtrace:
Program terminated with signal 11, Segmentation fault.
#0 __l2_load_op (a=24, lineno=31,
fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:109
109 return *__l2_op_ptr(ptr, op);
(gdb) where
#0 __l2_load_op (a=24, lineno=31,
fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:109
#1 L2_AtomicLoadIncrementBounded (a=24, lineno=31,
fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:186
#2 MPIDI_Mutex_acquire (a=24, lineno=31,
fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
at /home/robl/src/mpich/src/mpid/pamid/include/mpidi_mutex.h:112
#3 MPIU_trmalloc (a=24, lineno=31,
fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
at /home/robl/src/mpich/src/util/mem/trmem.c:28
#4 0x0000000001013a7c in MPIR_T_enum_env_init ()
at /home/robl/src/mpich/src/mpi_t/mpit_initthread.c:31
#5 MPIR_T_env_init () at /home/robl/src/mpich/src/mpi_t/mpit_initthread.c:72
#6 0x0000000001010bc4 in PMPI_Init (argc=0x1dbfffbad0, argv=0x1dbfffbad8)
at /home/robl/src/mpich/src/mpi/init/init.c:145
#7 0x0000000001000ac4 in main (argc=3, argv=0x1dbfffbec8)
at /home/robl/src/mpich/src/mpi/romio/test/coll_perf.c:33
Need more information?
Originally by balaji on 2014-01-07 00:06:59 -0600
I've pushed a different patch to mpich-review/ticket-1900-robl
. It also includes the win-req patch that I sent out to IBM. Can you try it out and let me know if it works correctly for you? It'll be great if someone from IBM can take a look at it and signoff as well, since it affects pamid as well.
Note that this patch only fixes the runtime thread-safety code, and not cases where runtime thread-safety is not enabled (--enable-threads=runtime). I'm not sure there's a good way to fix this when runtime thread-safety is not enabled, since we always need to check at runtime at least if the mutexes have been initialized or not.
For a complete fix, we might need to completely get rid of the ability to disable runtime checks. That is, we give these options for --enable-threads={single,funneled,serialized,multiple,device}. The "device" option will be default. The internal runtime check will always happen. The configure option will only behave as if the user initialized with that thread-safety level. This way, the MPI_INIT{_THREAD} part will always occur in MPI_THREAD_FUNNELED mode, irrespective of what the user requested. Sometime during INIT, we'll "upgrade" the thread-level to MULTIPLE if needed.
Originally by blocksom on 2014-01-07 08:31:30 -0600
I've reviewed these commits and I think it will work, but there will be a performance degradation because of the additional branches in the point-to-point critical path - both on the send and recv side.
Can this code be re-worked so that the "safe" critical section macros (the ones with the branch) are used only during initialization, but then we can continue to use the original critical section macros after initialization in the performance-critical code path?
Originally by balaji on 2014-01-07 10:02:53 -0600
Mike: is there really be a performance degradation with this on any modern processor? With an "unlikely" attribute help? I can't imagine the branch adding too much performance degradation compared to the mutexes and memory barriers the code is doing.
There are two reasons I did it this way:
Originally by robl on 2014-01-07 10:49:20 -0600
This branch runs on Blue Gene now, though it's still reporting leaked resources (and will until #1947 is closed). I've signed off on 7db1a818 (HEAD, 1900-robl-so) We always need the runtime thread-safety check
but don't know the right invocation for adding a sign-off on older commits. Pushed to mpich-review/ticket-1900-robl-so
Originally by blocksom on 2014-01-07 11:04:32 -0600
Replying to balaji:
Mike: is there really be a performance degradation with this on any modern processor? With an "unlikely" attribute help? I can't imagine the branch adding too much performance degradation compared to the mutexes and memory barriers the code is doing.
Unfortunately, my personal experience working with the A2 processor on BG/Q is that branches do make a big performance difference. The unlikely
attribute helps, but not much. We spent a ridiculous amount of time reworking the code to remove branches and msyncs so that we could meet the performance requirements for Mira.
That said ... The supported BG/Q product will not move to the master branch any time soon, if at all, so this change wouldn't impact the product. Of course, your colleagues who are interested in MPI3 on BG/Q have no choice but to build their own mpich library from the master branch. The bgq code on the master branch isn't stable enough for me to run any performance comparisons, but I am expecting the performance to be worse than the product branch.
At the minimum I'd like to see the unlikely
attribute added, though. Can we do that?
Originally by balaji on 2014-01-07 11:33:58 -0600
I'll rework it to add unlikely
.
I'm not doubting that branches cause performance degradation. I'm just comparing it against mutexes and memory barriers, since these branches only occur when the code has been configured to always run in THREAD_MULTIPLE. In other words, what we have right now is: always do mutexes and memory barriers (even for single-threaded applications when the library is configured this way); what I changed it to is: check a flag before doing mutexes and memory barriers.
For completeness, if we considered the two sets of macros approach, how would be do that without another branch? Wouldn't we anyway need a flag to tell us whether we are initialized or not?
As a secondary issue, I'm curious why this code path is being taken for coll_perf, since it doesn't initialize MPI in THREAD_MULTIPLE mode. Are you forcing THREAD_MULTIPLE on all applications?
Originally by balaji on 2014-01-07 11:42:00 -0600
Replying to robl:
This branch runs on Blue Gene now, though it's still reporting leaked resources (and will until #1947 is closed). I've signed off on
7db1a818 (HEAD, 1900-robl-so) We always need the runtime thread-safety check
but don't know the right invocation for adding a sign-off on older commits. Pushed tompich-review/ticket-1900-robl-so
You can cherry-pick them to a different branch with the --signoff flag, and then push this new branch back as the signedoff branch.
Originally by blocksom on 2014-01-07 11:50:33 -0600
Replying to balaji:
As a secondary issue, I'm curious why this code path is being taken for coll_perf, since it doesn't initialize MPI in THREAD_MULTIPLE mode. Are you forcing THREAD_MULTIPLE on all applications?
Hmmmm .. I'd have to look closer at the code. The product installs two builds of the code - one is the "per object" locking. in that mode we have to promote the thread level to multiple because of commthreads, etc, even if the user only invokes MPI_Init(). We don't necessarily do that for the libraries built with the "global lock" mode.
If I had time to rewrite the pamid device, I would register a different pami callback function for each permutation of the environment so that the recv code path had minimal branches. I'd try to do the same on the send side. You give up some inline when using function pointers, but then if you do it right you can eliminate lots of branches.
I'm fine with the code now. If we have performance problems in the future then we can tackle this issue again later.
Originally by balaji on 2014-01-07 12:03:14 -0600
OK. I did build it with per-object, so that makes sense.
I've pushed the updated patches to mpich-review/ticket-1900-blocksome
. It contains the previous patches in this branch too, one of which was signedoff by Robl.
Originally by balaji on 2014-01-07 12:08:54 -0600
Btw, note that with unlikely
, we prioritized non-THREAD_MULTIPLE codes. But that doesn't sound like what you need. For per-object, you probably want to prioritize THREAD_MULTIPLE codes since all programs are promoted to THREAD_MULTIPLE?
Originally by blocksom on 2014-01-07 13:50:08 -0600
Replying to balaji:
Btw, note that with
unlikely
, we prioritized non-THREAD_MULTIPLE codes. But that doesn't sound like what you need. For per-object, you probably want to prioritize THREAD_MULTIPLE codes since all programs are promoted to THREAD_MULTIPLE? Correct.
Originally by balaji on 2014-01-07 13:51:30 -0600
OK. I'll change it to likely
. Also, there were some errors in the previous patch. The isThreaded variable should only come into play when the library is configured with MULTIPLE support. I'll push a new set of patches in.
Originally by balaji on 2014-01-07 15:24:01 -0600
I've pushed new patches into mpich-review/ticket-1900-blocksome
.
Rob/Mike: can you guys review this branch?
Rob: I've removed your previous signoff, since there were several changes to the patch after you looked at it.
Originally by blocksom on 2014-01-08 10:24:01 -0600
Sorry, but I have another question here ..
Since it is only the MEMALLOC
mutex that is being used by the string functions, could we just add this if (likely(MPIR_ThreadInfo.isThreaded))
check to the MPIU_THREAD_CS_MEMALLOC_ENTER
and MPIU_THREAD_CS_MEMALLOC_EXIT
macros and leave the others alone? Or could the checks in the other macros only be enabled if mpich is configured with "extra runtime checks" (whatever that configure option is)?
Originally by balaji on 2014-01-08 14:16:50 -0600
That seems like a hacky solution. For now, we are only using memory management routines outside of the critical section, but in the future we might use more.
Furthermore, MPIU_THREAD_CSMEMALLOC{ENTER,EXIT} can be device defined. The default implementation of MPIU_THREAD_CS_ENTER for MEMALLOC maps to MPIU_THREAD_CS_ENTER_MEMALLOC. But devices can map them to something else. For example, pamid maps it to MPIU_THREAD_CS_MEMALLOC_ENTER (note the reordering of words).
Originally by blocksom on 2014-01-08 14:19:56 -0600
Well .. ok. I still don't think that the current solution is perfect. But we can work to refine this later in 3.1.1 or 3.2 or whenever.
Originally by blocksom on 2014-01-08 14:44:00 -0600
I've signed-off on the commits and pushed everything to the mpich-ibm/ticket-1900-blocksome-signed
branch.
It would be good to get a sign-off from Rob Latham as well.
Originally by balaji on 2014-01-08 14:53:01 -0600
Replying to blocksom:
Well .. ok. I still don't think that the current solution is perfect. But we can work to refine this later in 3.1.1 or 3.2 or whenever.
Alright. If you can actually show me a noticeable performance difference caused by this branch in the THREAD_MULTIPLE case, I'll buy you a beer. :-)
Originally by blocksom on 2014-01-08 14:59:18 -0600
Replying to balaji:
Replying to blocksom:
Well .. ok. I still don't think that the current solution is perfect. But we can work to refine this later in 3.1.1 or 3.2 or whenever.
Alright. If you can actually show me a noticeable performance difference caused by this branch in the THREAD_MULTIPLE case, I'll buy you a beer. :-)
I dunno .. that sounds suspiciously like work. How about we buy each other beers at the next Forum meeting instead? That would be a bit more fun. :)
Originally by robl on 2014-01-08 16:26:14 -0600
sigh. NACK. This backtrace looks so familiar that I had to double-check that it was not identical to last week's. Hm. It is identical. I pulled from mpich-ibm/ticket-1900-blocksome-signed
(gdb) where
#0 __l2_load_op (a=24, lineno=31,
fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:109
#1 L2_AtomicLoadIncrementBounded (a=24, lineno=31,
fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
at /bgsys/drivers/V1R2M0/ppc64/spi/include/l2/atomic.h:186
#2 MPIDI_Mutex_acquire (a=24, lineno=31,
fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
at /home/robl/src/mpich/src/mpid/pamid/include/mpidi_mutex.h:112
#3 MPIU_trmalloc (a=24, lineno=31,
fname=0x1490b08 "/home/robl/src/mpich/src/mpi_t/mpit_initthread.c")
at /home/robl/src/mpich/src/util/mem/trmem.c:28
#4 0x0000000001013a7c in MPIR_T_enum_env_init ()
at /home/robl/src/mpich/src/mpi_t/mpit_initthread.c:31
#5 MPIR_T_env_init () at /home/robl/src/mpich/src/mpi_t/mpit_initthread.c:72
#6 0x0000000001010bc4 in PMPI_Init (argc=0x1dbfffbad0, argv=0x1dbfffbad8)
at /home/robl/src/mpich/src/mpi/init/init.c:145
#7 0x0000000001000ac4 in main (argc=3, argv=0x1dbfffbec8)
at /home/robl/src/mpich/src/mpi/romio/test/coll_perf.c:33
Originally by balaji on 2014-01-08 22:04:34 -0600
Hmm.. Weird.
I configured it this way:
../mpich/configure --host=powerpc64-bgq-linux --with-device=pamid \
--with-file-system=bg+bglockless --disable-wrapper-rpath \
--enable-thread-cs=per-object --with-atomic-primitives \
--enable-handle-allocation=tls --enable-refcount=lock-free \
--disable-predefined-refcount \
--prefix=/home/balaji/software/mpich/build/install --enable-g=all
And ran it this way:
qsub -t 15 -n 8 --env=BG_COREDUMPBINARY=0 ./coll_perf -fname blah
The output file had some stuff:
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
leaked context IDs detected: mask=0x166b440 mask[0]=0xfffffffc
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
In direct memory block for handle type REQUEST, 128 handles are still allocated
The error file had some stuff:
[...snipped...]
Global array size 128 x 128 x 128 integers
Collective write time # 0.017283 sec, Collective write bandwidth462.889197 Mbytes/sec
Collective read time # 0.023002 sec, Collective read bandwidth347.789777 Mbytes/sec
[0] 504 at [0x00000019c583e858], [1] 504 at [0x00000019c583e858], [3] 504 at [0x00000019c583e858], [6] 504 at [0x00000019c583e858], [7] 504 at [0x00000019c583e858], [2] 504 at [0x00000019c583e858], ../mpich/src/mpi/comm/commutil.c[281]
[4] 504 at [0x00000019c583e858], ../mpich/src/mpi/comm/commutil.c[281]
[5] 504 at [0x00000019c583e858], ../mpich/src/mpi/comm/commutil.c[281]
[...snipped...]
Seems to be as expected (except the bandwidth output going into the error file part).
What am I missing?
Originally by robl on 2014-01-10 15:36:42 -0600
thanks for trying it out, Pavan. Today, no segfaults, so you get my signoff.
Patches pushed to mpich-ibm/ticket-1900-blocksome-and-robl-signed
-- aw yeah, robl loves the long branch names.
Originally by Pavan Balaji balaji@mcs.anl.gov on 2014-01-10 22:36:57 -0600
In 99485af86e826a946fe038a77f0b98244526dc4d: Thread critical-section initialization fixes.
This patch has two related parts.
This is needed since we were initializing cvars before setting the thread-level, so the thread-safety macros are not initialized at that point. See #1900.
Signed-off-by: Michael Blocksome blocksom@us.ibm.com Signed-off-by: Rob Latham robl@mcs.anl.gov
Originally by Pavan Balaji balaji@mcs.anl.gov on 2014-01-10 22:36:58 -0600
In 1a0c3dbc9d8055314613016d5c64f801d9b87219: We always need the runtime thread-safety check.
Disabling runtime check for thread-safety is not a correct option. This would cause memory allocation and string manipulation routines to be unusable before the thread-safety level is set. Fixes #1900.
Signed-off-by: Michael Blocksome blocksom@us.ibm.com Signed-off-by: Rob Latham robl@mcs.anl.gov
Originally by Pavan Balaji balaji@mcs.anl.gov on 2014-01-10 22:36:58 -0600
In 2ce5a7e2786fd07ffb13dc4650fc231b7e9ef829: Prioritize thread-multiple branch.
Give a "likely" attribute to the branch that checks if an MPI process is initialized in THREAD_MULTIPLE or not. This should reduce the cost of the branch for applications that require THREAD_MULTIPLE.
The per-object mode in the pamid branch is intended for the case where comm-threads are used. In this case, the application is mostly in THREAD_MULTIPLE mode, even if it didn't explicitly initialize it as such. The only time it'll not be in that mode is during initialization before the appropriate mutexes have been setup.
See #1900.
Signed-off-by: Michael Blocksome blocksom@us.ibm.com Signed-off-by: Rob Latham robl@mcs.anl.gov
Originally by robl on 2013-07-15 08:56:03 -0500
Building stock MPICH with memory debugging results in a seg fault
Steps to reproduce:
Program will fail with signal 11 and provide the following backtrace: