pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
535 stars 280 forks source link

MPI_Comm_split fails with memory error on Blue Gene at scale #2004

Closed mpichbot closed 7 years ago

mpichbot commented 7 years ago

Originally by robl on 2014-01-22 14:41:23 -0600


a simple comm_split testcase will fail with glibc memory corruption.

*** glibc detected *** /gpfs/mira-home/robl/src/mpi-md-test/./comm_split_testcase2: malloc(): memory corruption: 0x0000001fc17f6ce0 ***

The attached testcase mimics the way the "deferred open" optimization uses MPI_Comm_split. Some background: In deferred open, ROMIO will have just the I/O aggregators open the file. The "deferred" part is that the optimization only happens if hints request it. should someone do independent i/o despite saying they would not, ROMIO will open the file at the time of the independent operation.

ROMIO uses MPI_Comm_split to create an "aggregator communicator".

mpichbot commented 7 years ago

Originally by robl on 2014-01-22 14:41:45 -0600


Attachment added: comm_split_testcase2.c (1.5 KiB) testcase for MPI_comm_split

mpichbot commented 7 years ago

Originally by blocksom on 2014-01-22 15:13:57 -0600


Could you paste your runjob command (ranks-per-node, np, etc) and any environment variables used? What is your configure command?

And - just to confirm - this is on the master branch, correct?

Thanks!

mpichbot commented 7 years ago

Originally by robl on 2014-02-03 16:04:40 -0600


sorry for the latency: 'robl' doesn't get emails from trac!

The test is very simple: doesn't even need a directory: qsub -A SSSPPg -t 10 -n 8192 --mode c16 ./comm_split_testcase2

mpichbot commented 7 years ago

Originally by robl on 2014-02-03 16:08:02 -0600


Replying to blocksom:

Could you paste your runjob command (ranks-per-node, np, etc) and any environment variables used? What is your configure command?

And - just to confirm - this is on the master branch, correct?

This is on the master branch, configured like this:

export PATH=/bgsys/drivers/ppcfloor/gnu-linux/bin/:${PATH}
${HOME}/src/mpich/configure \
    --host=powerpc64-bgq-linux --with-device=pamid \
    --with-file-system=bg+bglockless --disable-wrapper-rpath \
    --prefix=/home/robl/soft/mpich-bgq-do
mpichbot commented 7 years ago

Originally by blocksom on 2014-02-03 16:15:06 -0600


could someone change the owner of this ticket to 'blocksom' .. not the 'blocksom@...' email address? Thanks!

mpichbot commented 7 years ago

Originally by blocksom on 2014-02-03 16:18:01 -0600


CC: Nysal

mpichbot commented 7 years ago

Originally by robl on 2014-02-03 16:28:42 -0600


Replying to blocksom:

could someone change the owner of this ticket to 'blocksom' .. not the 'blocksom@...' email address? Thanks!

You got it, Mike. Should I do that for any ticket I think you might be interested in? If 'robl' is the owner, I don't get emails, so I hope 'blocksom' gets notified of this update.

mpichbot commented 7 years ago

Originally by blocksom on 2014-02-03 16:33:25 -0600


Replying to robl:

Replying to blocksom:

could someone change the owner of this ticket to 'blocksom' .. not the 'blocksom@...' email address? Thanks!

You got it, Mike. Should I do that for any ticket I think you might be interested in? If 'robl' is the owner, I don't get emails, so I hope 'blocksom' gets notified of this update.

I got this update in my email.

If I'm the owner, then it should use my trac account, blocksom, but the CC field needs to be an actual email address - not a trac account.

mpichbot commented 7 years ago

Originally by robl on 2014-02-05 08:50:57 -0600


I re-ran the experiment to capture a backtrace. Here's the coreprocessor output:

0 :Node (479)
1 :    <traceback not fetched> (1)
1 :    0000000000000000 (478)
2 :        .__libc_start_main (478)
3 :            .generic_start_main (478)
4 :                .main (478)
5 :                    .PMPI_Comm_split (478)
6 :                        .MPIR_Comm_split_impl (478)
7 :                            .MPIR_Comm_commit (478)
8 :                                .MPIDI_Comm_create (478)
9 :                                    .MPIDI_Coll_comm_create (478)
10:                                        .geom_tasklist_create_wrapper (478)
11:                                            .PAMI_Geometry_create_tasklist (478)
12:                                               .PAMI::Client::start_barrier(PAMI::Geometry::Common*, PAMI::Geometry::Common*, unsigned long, void*, pami_attribute_name_t) (478)
13:                                                    .PAMI::Geometry::Geometry<PAMI::Geometry::Common>::ue_barrier(void (*)(void*, void*, pami_result_t), void*, unsigned long, void*) (478)
14:                                                        .CCMI::Adaptor::Barrier::BarrierT<CCMI::Schedule::MultinomialTreeT<CCMI::Schedule::TopologyMap, 4>, &(CCMI::Adaptor::P2PBarrier::binomial_analyze(PAMI::Geometry::Common*)), (PAMI::Geometry::topologyIndex_t)0, (PAMI::Geometry::ckeys_t)11>::start() (478)
15:                                                            .CCMI::Executor::BarrierExec::sendNext() (478)
16:                                                                .PAMI::BGQNativeInterfaceAS<PAMI::Device::MU::Context, PAMI::Device::MU::ShortAMMulticastModel, PAMI::Device::MU::NullMultisyncModel, PAMI::Device::MU::NullMulticombineModel, PAMI::MemoryAllocator<4160u, 64u, 1u, PAMI::Mutex::Noop> >::multicast(pami_multicast_t*, void*) (478)
17:                                                                    0000001fc0e86eb0 (27)
17:                                                                    0000001fc1ec5d30 (29)
17:                                                                    0000001fc1e86d10 (30)
17:                                                                    0000001fc5886d10 (30)
17:                                                                    0000001fc0e86d10 (362)

Here's the interesting parts of the stack trace from rank 52745, the one that actually encountered memory errors:

------------------------------------------------------------------------
Program   : /gpfs/mira-home/robl/src/mpi-md-test/./comm_split_testcase2
------------------------------------------------------------------------
+++ID Rank: 52745, TGID: 217, Core: 9, HWTID:0 TID: 217 State: RUN 

0000001fc0e86d10
??
??:0

00000000012a6590
PAMI::Device::Interface::MulticastModel<PAMI::Device::MU::ShortAMMulticastModel, PAMI::Device::MU::Context, 800u>::postMulticastImmediate(unsigned long, unsigned long, pami_multicast_t*, void*)
/bgsys/source/srcV1R2M0.14091/comm/sys/buildtools/pami/components/devices/bgq/mu2/model/AMMulticastModel.h:297

00000000012b334c  
CCMI::Executor::BarrierExec::sendNext()  
/bgsys/source/srcV1R2M0.14091/comm/sys/buildtools/pami/algorithms/executor/Barrier.h:253

00000000012b46c0
CCMI::Executor::BarrierExec::start()
/bgsys/source/srcV1R2M0.14091/comm/sys/buildtools/pami/algorithms/executor/Barrier.h:269

000000000111b674
PAMI::Geometry::Algorithm<PAMI::Geometry::Geometry<PAMI::Geometry::Common> >::generate(pami_xfer_t*)
/bgsys/source/srcV1R2M0.14091/comm/sys/buildtools/pami/algorithms/geometry/Algorithm.h:56

00000000012b0228
00000a62.long_branch_r2off._pami_core_uint64_lor+0
/bgsys/source/srcV1R2M0.14091/comm/sys/buildtools/pami/common/bgq/Client.h:1560  

00000000010f2df4
PAMI::Client::geometry_create_tasklist_impl(void**, unsigned long, pami_configuration_t*, unsigned long, void*, unsigned int, unsigned int*, unsigned long, void*, void (*)(void*, void*, pami_result_t), void*)
/bgsys/source/srcV1R2M0.14091/comm/sys/buildtools/pami/common/bgq/Client.h:1057

000000000101adc0
00000012.long_branch_r2off.__libc_start_main+0
:0

Lastly, here's the addr2line of the glibc backtrace displayed to stdout:

% addr2line -e comm_split_testcase2 -C -f 0x13c3778 0x13c6e6c 0x13c9028 0x13c9ac0 0x12c7adc 0x1127114 0x11c189c
malloc_printerr
/bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/malloc/malloc.c:6327
_int_malloc
/bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/malloc/malloc.c:4438
__libc_malloc
/bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/malloc/malloc.c:3698
__posix_memalign
/bgsys/drivers/V1R2M0/ppc64/toolchain/gnu/glibc-2.12.2/malloc/malloc.c:6359
00003855.long_branch_r2off._ZN4PAMI6Memory13MemoryManager17MemoryManagerMetaINS1_18MemoryManagerAllocEE4initEPS1_PKc+0
/bgsys/source/srcV1R2M0.14091/comm/sys/buildtools/pami/components/memory/heap/HeapMemoryManager.h:119
PAMI::Topology::__subTopologyLocalToMe(PAMI::Topology*)
/bgsys/source/srcV1R2M0.14091/comm/sys/buildtools/pami/common/default/Topology.h:267
00000a62.long_branch_r2off._pami_core_uint64_lor+0
/bgsys/source/srcV1R2M0.14091/comm/sys/buildtools/pami/algorithms/geometry/Geometry.h:798
mpichbot commented 7 years ago

Originally by blocksom on 2014-02-13 09:51:27 -0600


I can't recreate this problem on a midplane in Rochester. now I'm trying a midplane on vesta .. does it really require 8 racks to reproduce this problem?

My qsub command failed when I tried 8k nodes because my node count request was "unrealistic" :P

mpichbot commented 7 years ago

Originally by robl on 2014-02-13 21:39:56 -0600


Replying to blocksom:

does it really require 8 racks to reproduce this problem?

The title of this bug report contains the phrase "...at scale". I think i could get this to happen sometimes at 4 racks, but consistently only at 8 racks.

Testing cetus-sized jobs seemed to work. I doubt vesta will work, but you might get lucky with two racks worth (might need a short reservation for that. I don't know vesta's queueing policy).

You're going to need a bigger boat.

mpichbot commented 7 years ago

Originally by robl on 2014-04-25 08:58:42 -0500


We finally upgraded Mira to V1R2M1 and re-ran the experiment. My "reliably triggers memory corruption" test case does not trigger memory corruption with V1R2M1. No change from the older driver.

I built master (from revision [fdb733f39]) and this test case also passed.

so, I guess something was fixed in PAMI ?

The folks on the early-users list are still tracking down a comm_split error. Jeff Hammond is driving a PMR on that one, though, so maybe we close this bug and re-open if they are somehow related.

mpichbot commented 7 years ago

Originally by blocksom on 2014-04-28 09:47:52 -0500


Yes .. there was a BGQ pami fix in V1R2M1 efix 27.

From: Sameh S. Sharkawi <sssharka@us.ibm.com>
Date: Fri, 28 Feb 2014 19:24:06 +0000 (-0600)
Subject: CPS 9F4PRV: Fix for classroute hang when running out of MU resources

CPS 9F4PRV: Fix for classroute hang when running out of MU resources

1 - Release mutex before returning if decided to abort
2 - start_over before setting any IDs
3 - start_over for so many tries, then aborting when running out of resources

Signed-off-by: Michael Blocksome <blocksom@us.ibm.com>