idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.72k stars 1.04k forks source link

MPI hangs on HPC #20060

Open fdkong opened 2 years ago

fdkong commented 2 years ago

Bug Description

There is a bug that possibly is in MVAPICH, TMPI, or MOOSE. For some applications during the mesh partitioning, all ranks are waiting for messages, and computations are hanging.

gstack 25923
Thread 3 (Thread 0x2aaadb745700 (LWP 26662)):
#0  0x00002aaab48f5de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00002aaaad1e701f in PerfGraphLivePrint::start() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#2  0x00002aaaafe4c2d0 in execute_native_thread_routine () at /tmp/menlkj/spack-stage/spack-stage-gcc-9.2.0-bxc7mvbmrfcrusa6ij7ux3exfqabmq5y/spack-src/libstdc++-v3/src/c++11/thread.cc:80
#3  0x00002aaab48f1ea5 in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00002aaab5e36b0d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x2aaadb946700 (LWP 26665)):
#0  0x00002aaab48f5de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00002aaaad1e701f in PerfGraphLivePrint::start() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#2  0x00002aaaafe4c2d0 in execute_native_thread_routine () at /tmp/menlkj/spack-stage/spack-stage-gcc-9.2.0-bxc7mvbmrfcrusa6ij7ux3exfqabmq5y/spack-src/libstdc++-v3/src/c++11/thread.cc:80
#3  0x00002aaab48f1ea5 in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00002aaab5e36b0d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x2aaaaab05b80 (LWP 25923)):
#0  0x00002aaab35122c9 in MPIDI_CH3I_SMP_writev_rndv_header () from /apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpi.so.12
#1  0x00002aaab3513cb5 in MPIDI_CH3I_SMP_write_progress () from /apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpi.so.12
#2  0x00002aaab350cd2f in MPIDI_CH3I_Progress_test () from /apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpi.so.12
#3  0x00002aaab34795c0 in MPIR_Test_impl () from /apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpi.so.12
#4  0x00002aaab3479862 in PMPI_Test () from /apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpi.so.12
#5  0x00002aaaaf957fab in TIMPI::Request::test() () from /home/kongf/workhome/sawtooth/moosemvap/scripts/../libmesh/installed/lib/libtimpi_opt.so.6
#6  0x00002aaaae94880c in _ZN5TIMPI6detail24push_parallel_nbx_helperIKSt3mapIjSt6vectorISt4pairIN7Hilbert14HilbertIndicesEmESaIS7_EESt4lessIjESaIS4_IKjS9_EEEZNS_25push_parallel_vector_dataIRSG_ZNS_25pull_parallel_vector_dataImSF_ZNK7libMesh17MeshCommunication19find_global_indicesINSK_8MeshBase16element_iteratorEEEvRKNSK_8Parallel12CommunicatorERKNSK_11BoundingBoxERKT_SY_RS3_ImSaImEEEUljRKS9_S11_E_ZNKSM_ISO_EEvSS_SV_SY_SY_S11_EUljS13_RKS10_E0_EEvRKNS_12CommunicatorERKT0_RT1_RT2_PSX_EUljS9_E_EEvS1A_OSW_S1D_EUljS13_RNS_7RequestENS_10MessageTagEE_ZNSH_ISI_S1J_EEvS1A_S1K_S1D_EUlRjRS9_S1M_S1N_E0_S1J_EEvS1A_RSW_S1D_RKS1E_RKS1G_ () from /home/kongf/workhome/sawtooth/moosemvap/scripts/../libmesh/installed/lib/libmesh_opt.so.0
#7  0x00002aaaae9503bc in void TIMPI::pull_parallel_vector_data<unsigned long, std::map<unsigned int, std::vector<std::pair<Hilbert::HilbertIndices, unsigned long>, std::allocator<std::pair<Hilbert::HilbertIndices, unsigned long> > >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<std::pair<Hilbert::HilbertIndices, unsigned long>, std::allocator<std::pair<Hilbert::HilbertIndices, unsigned long> > > > > >, void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::Parallel::Communicator const&, libMesh::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned long, std::allocator<unsigned long> >&) const::{lambda(unsigned int, std::vector<std::pair<Hilbert::HilbertIndices, unsigned long>, std::allocator<std::pair<Hilbert::HilbertIndices, unsigned long> > > const&, std::vector<unsigned long, std::allocator<unsigned long> >&)#1}, void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::Parallel::Communicator const&, libMesh::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned long, std::allocator<unsigned long> >&) const::{lambda(unsigned int, std::vector<std::pair<Hilbert::HilbertIndices, unsigned long>, std::allocator<std::pair<Hilbert::HilbertIndices, unsigned long> > > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&)#2}>(TIMPI::Communicator const&, std::map<unsigned int, std::vector<std::pair<Hilbert::HilbertIndices, unsigned long>, std::allocator<std::pair<Hilbert::HilbertIndices, unsigned long> > >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<std::pair<Hilbert::HilbertIndices, unsigned long>, std::allocator<std::pair<Hilbert::HilbertIndices, unsigned long> > > > > > const&, void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::Parallel::Communicator const&, libMesh::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned long, std::allocator<unsigned long> >&) const::{lambda(unsigned int, std::vector<std::pair<Hilbert::HilbertIndices, unsigned long>, std::allocator<std::pair<Hilbert::HilbertIndices, unsigned long> > > const&, std::vector<unsigned long, std::allocator<unsigned long> >&)#1}&, void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::Parallel::Communicator const&, libMesh::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned long, std::allocator<unsigned long> >&) const::{lambda(unsigned int, std::vector<std::pair<Hilbert::HilbertIndices, unsigned long>, std::allocator<std::pair<Hilbert::HilbertIndices, unsigned long> > > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&)#2}&, unsigned long const*) () from /home/kongf/workhome/sawtooth/moosemvap/scripts/../libmesh/installed/lib/libmesh_opt.so.0
#8  0x00002aaaae951032 in void libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::Parallel::Communicator const&, libMesh::BoundingBox const&, libMesh::MeshBase::element_iterator const&, libMesh::MeshBase::element_iterator const&, std::vector<unsigned long, std::allocator<unsigned long> >&) const () from /home/kongf/workhome/sawtooth/moosemvap/scripts/../libmesh/installed/lib/libmesh_opt.so.0
#9  0x00002aaaaeb402e8 in libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, unsigned int) () from /home/kongf/workhome/sawtooth/moosemvap/scripts/../libmesh/installed/lib/libmesh_opt.so.0
#10 0x00002aaaaeb4d452 in libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from /home/kongf/workhome/sawtooth/moosemvap/scripts/../libmesh/installed/lib/libmesh_opt.so.0
#11 0x00002aaaae8fa3f4 in libMesh::MeshBase::prepare_for_use() () from /home/kongf/workhome/sawtooth/moosemvap/scripts/../libmesh/installed/lib/libmesh_opt.so.0
#12 0x00002aaaacdd6153 in FileMeshGenerator::generate() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#13 0x00002aaaacdd35af in MeshGenerator::generateInternal() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#14 0x00002aaaad3e8408 in MooseApp::executeMeshGenerators() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#15 0x00002aaaac8c0d0d in Action::timedAct() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#16 0x00002aaaac8cc2f1 in ActionWarehouse::executeActionsWithAction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#17 0x00002aaaac8ce627 in ActionWarehouse::executeAllActions() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#18 0x00002aaaad3d8ad2 in MooseApp::runInputFile() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#19 0x00002aaaac49b3f6 in MultiApp::createApp(unsigned int, double) () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#20 0x00002aaaac49c554 in MultiApp::createApps() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#21 0x00002aaaacb0b793 in FEProblemBase::addMultiApp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, InputParameters&) () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#22 0x00002aaaac8c0d0d in Action::timedAct() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#23 0x00002aaaac8cc2f1 in ActionWarehouse::executeActionsWithAction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#24 0x00002aaaac8ce5a9 in ActionWarehouse::executeAllActions() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#25 0x00002aaaad3d8ad2 in MooseApp::runInputFile() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#26 0x00002aaaad3dd274 in MooseApp::run() () from /home/kongf/workhome/sawtooth/moosemvap/framework/libmoose-opt.so.0
#27 0x000000000040241e in main ()

Steps to Reproduce

This can be reproduced on Sawtooth.

1)module load use.moose PETSc

2) git clone moose

3) build libmesh: ./script/update_and_rebuild_libmesh.sh

4) complie moose: make -j24

4) run the following using an interactive node:mpirun -n 44 ../moose_test-opt -i master.i

Input-files are too large to load them here. Please contact the ticket creator to get the input files that will reproduce the issue.

Impact

Users can not run large scale simulations INL HPC.

roystgnr commented 2 years ago

We're fairly confident we've hit bugs in MVAPICH before. Nothing that manifests like this exactly ... but I absolutely would love to know if the same code runs or fails if run on sawtooth with, say, OpenMPI.

fdkong commented 2 years ago

The test from has the problem when using mvapich2/2.3.3-gcc-9.2.0

Occasionally, the same test ran just fine with mvapich2/2.3.5-gcc-8.4.0.

We need to dig into MPI to figure out what was the root cause of that. And then we could improve parallel_vector_push/pull by using more-robust MPI APIs.

roystgnr commented 2 years ago

There is such a thing as less robust MPI APIs but they're not the problem here. Either we've got some kind of race condition or mvapich has got some kind of bug. From what @friedmud was telling us about leaks, I suspect it's the latter.

But at this point it seems clear we can't just hope to mostly be unscathed by it. Either we need to demonstrate that it's a real implementation problem and pressure them for a fix, or we need to switch to a different MPI implementation (this might not be a good option? IIRC there are some fabrics where MVAPICH is at least twice as fast as competitors) or we need to switch to APIs that aren't subject to leaks (this might be the worst option of all, on large runs? our old implementation here was asymptotically slower as the number of processors increases)

fdkong commented 2 years ago

we need to switch to a different MPI implementation (this might not be a good option? IIRC there are some fabrics where MVAPICH is at least twice as fast as competitors)

I would like that MOOSE works for all reasonable MPI implementations.

At this point, I still think the bug(s) can come from MOOSE, TMPI, or MPI. I do not want to say it is the fault of MVAPICH because a ton of other packages are working well on MVAPICH.

But anyway, we will figure that out once we dig into the bottom..

roystgnr commented 2 years ago

The most relevant parts from Slack discussion a few months ago:


Derek Gaston:

The "eager sends" ARE messed up in mvapich - and they refuse to fix it (I've been talking to them for a year about it)... it could easily run out of eager send memory. We may want to try disabling eager sends to see if that fixes it.

Roy Stogner:

That sounds almost too good to be true if it works ... but I am seeing a lot of MPIDI_CH3_PKT_EAGER_SEND in those error messages. And "Recv desc error" and "completion with error" sure sound like they're coming from a lower level than us.

Derek Gaston:

Yes - it could literally be the eager send pool running out of device memory.

(On the infiniband cards)

Basically: mvapich is trying to do too many eager sends (it shouldn't do any since we're doing asynchronous anyway!)


Derek Gaston:

This may take some tinkering. Looking at the messages that were failing to receive correctly - it looks like they may be just below the default eager cutoff (which is normally 12KB). What we could try to do is lower the eager limit using: export MV2_IBA_EAGER_THRESHOLD=2k export MV2_VBUF_TOTAL_SIZE=2k You will want those in your launch script

To give some idea of what's going on: mvapich attempts to send really small messages via a different communication algorithm called the "eager" algorithm. The way it works is that it copies your data into an internal buffer (on the infiniband card if it can) and then tries to push the data. As soon as the copy is done mvapich reports back to the program that the send is complete (even if it's not). This means that a program can stack up millions of tiny messages to send without knowing that they are just piling up and not getting sent. This whole idea is antiquated... and came about because it "sped up" old programs that use blocking MPI calls. By acting like the data is "sent" MPI can let the program continue to operate... somewhat overlapping communication and computation (and gaining efficiency). But - since we do everything asynchronously anyway... we just end up with a whole bunch of unnecessary copying and tiny messages filling up memory. So: if we lower the limit... then only really tiny messages would be sent this way. We could set the above to 0k to turn off eager sends all together. I have no idea what that would do to overall efficiency of our code (I'm sure there are still many places where we benefit from eager sends in libMesh/PETSc/Hypre)


Derek Gaston:

Ok - so this is about mvapich running out of vbuf memory because of the way they do eager sends and "leak" the buffer memory (It's not actually leaking... it's just never freed until the end of the simulation). We need to try OpenMPI (or some other MPI) to double check that it's fine. Like I say - I've been talking to the mvapich people about this for nearly a year. I even created a simple gist that shows the issue: https://gist.github.com/friedmud/9533d5997f06414c25f8c5c57a1eaf37


Derek Gaston:

They have a vbuf pool - that they continually "check out" memory from... but never return it to.

It is only returned during MPI_Finalize.

(So, it is not leaked... it is kept track of... it's just not returned to the pool and not reused)

Roy Stogner:

Ha!

"Not returned and not reused" is the definition of "leaked", to me. Doing it on purpose is an aggravating factor, not a mitigating one.

Derek Gaston:

The reason that no one else sees this is because they are not doing completely asynchronous algorithms like we are. I guess I should clarify that this only happens with isends. With regular sends - the eager memory is "freed"

Roy Stogner:

You can demonstrate this in about a hundred lines of code and they've still dragged their feet on fixing it for a year?

That's awful.

Derek Gaston:

They continue to deny there is a problem

(and drag their feet - sometimes months between communications - and they took the conversation off of their regular user list because they said it was generating too much noise)

So: let's try using a different MPI implementation and see if that clears it up.

Roy Stogner:

There's got to be a better way to frame this for the mvapich people... Instead of a single batch of sends, would it be possible for your test code to send batch after batch in such a way that other MPI implementations can run indefinitely but theirs will eventually run out of pool memory and die?

Derek Gaston:

I mean - just put a loop around my code

Roy Stogner:

And that'd be all it would take to do it? They're just never reusing that memory? I can come up with no words that would be both politic and accurate here.


Larry Aagesen:

So I found that with more memory per core, both the openmpi run and a run with the buffer setting size set to 128k ran all the way to completion

roystgnr commented 2 years ago

So based on that,

I would like that MOOSE works for all reasonable MPI implementations.

I would too. But I don't think "if you keep using asynchronous I/O indefinitely you will eventually die" qualifies as reasonable.

At this point, I still think the bug(s) can come from MOOSE, TMPI, or MPI

I certainly wouldn't be surprised if we had bugs in addition. But this certainly seems to be a bug, not in MPI, but in that one particular MPI implementation.

I do not want to say it is the fault of MVAPICH because a ton of other packages are working well on MVAPICH.

That's the right prior to have, isn't it? But take a look at Derek's gist. Just a couple hundred lines, no obvious bugs.

roystgnr commented 2 years ago

Your test case runs fine for me with both opt and dbg modes. I'm using the newest libMesh, though, and I'm trying it locally rather than on sawtooth.

maeneas commented 2 years ago

Trying on Sawtooth now...

maeneas commented 2 years ago

duplicated issue on Sawtooth

maeneas commented 2 years ago

export MV2_IBA_EAGER_THRESHOLD=2k export MV2_VBUF_TOTAL_SIZE=2k has no impact on issue on Sawtooth.

maeneas commented 2 years ago

mvapich team is looking into this now

friedmud commented 2 years ago

Relatedly... see if you can get them to fix the memory "leak" they have with isend. See my gist here: https://gist.github.com/friedmud/9533d5997f06414c25f8c5c57a1eaf37

It's not a true leak because the memory is not "lost" - it's just that a new temporary buffer gets created for each isend and doesn't get released until MPI_finalize.

I tried to talk to them about it but couldn't get them to do anything about it.

friedmud commented 2 years ago

I believe that the above buffer issue is the root of the problem we're seeing here...

permcody commented 2 years ago

@maeneas - Do we have TotalView available these days for MPI debugging? I'm not sure if that would help here or not, but we could support purchasing licenses if it makes sense.

maeneas commented 2 years ago

export MV2_USE_SHARED_MEM=0 solves the hang problem. Still investigating further.

maeneas commented 2 years ago

We don't have any totalview licenses available anymore. I just debug with gdb anyways.

maeneas commented 2 years ago

Adjusting MV2_SMP_SEND_BUF_SIZE has some impact on whether this bug shows up. Setting MV2_USE_SHARED_MEM=0 solves the problem uniformly in all tests but probably negatively impacts performance.

permcody commented 1 year ago

@roystgnr, @friedmud - This is the same issue that Patrick is hitting last week. We could try these workarounds again. Also @loganharbour