su2code / SU2

SU2: An Open-Source Suite for Multiphysics Simulation and Design
https://su2code.github.io
Other
1.31k stars 836 forks source link

MPI Failure when Running with 16+ Cores #550

Closed clarkpede closed 5 years ago

clarkpede commented 6 years ago

I've recently run into a problem with periodic geometry when I run a RANS problem on 16 cores or more (256+ MPI tasks). While initializing the Jacobian structure for the turbulence model, I run into one of two errors, depending on the core count.

The first error results in the following error message:

Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(249).................: MPI_Sendrecv(sbuf=0x2ee74f0, scount=10, MPI_DOUBLE, dest=19, stag=0, rbuf=0x2ee68e0, rcount=385, MPI_MPIDI_CH3U_Receive_data_found(144): Message from rank 25 and tag 0 truncated; 3200 bytes received but buffer size is 3080
aborting job

The second error just leads to the solver hanging indefinitely at the Initialize Jacobian structure (SA model) step. I'm guessing that an MPI send/receive is left dangling.

I have not seen these problems at lower core counts (2-4 cores with 2-32 MPI tasks).

The errors seem to be tied to the way the periodic send/receives are set up. If I change the periodic boundaries to far-field boundaries, the error vanishes.

I've also done a lot of work to weed out possible causes:

I've got a minimal example that you can use to test this for yourself, in the attached files. It should be self-explanatory.

MPI_Failure_Example.tar.gz

economon commented 6 years ago

@clarkpede : Thank you for the detailed post, and your hunch is likely correct. Unfortunately, the existing periodic BC implementation has some limitations due to how closely it is coupled to the old mesh partitioning routines during setup.

Those original partitioning routines had become difficult to maintain or expand (very hard to make quick fixes), so they were rewritten from scratch in PR #513. Now, the periodic BC is also being rewritten cleanly (hopefully for the last time :) ). A prototype can be seen in feature_periodic that is already working for Euler problems, and the rest is in progress now. I am aware that several folks are in need of this, but know that a new version is coming.

clarkpede commented 6 years ago

Alright. Thanks for the update. I'll wait to close this issue until the periodic BCs are fixed.

economon commented 5 years ago

Resolved by #652 - @clarkpede : if you still find any issues with the new implementation, do not hesitate to open a new issue.