jessr92 / LES-WRF-MPI

Exploring coupling the distributed Large Eddy Simulation and Weather Research and Forecasting model
4 stars 3 forks source link

MPI Send/Recv Issues #1

Closed jessr92 closed 9 years ago

jessr92 commented 9 years ago

The errors are

[Gordons-MacBook-Pro:70529] *** An error occurred in MPI_Wait
[Gordons-MacBook-Pro:70529] *** reported by process [140333196050433,140733193388037]
[Gordons-MacBook-Pro:70529] *** on communicator MPI_COMM_WORLD
[Gordons-MacBook-Pro:70529] *** MPI_ERR_TRUNCATE: message truncated
[Gordons-MacBook-Pro:70529] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Gordons-MacBook-Pro:70529] ***    and potentially your MPI job)

on a Mac, and, on Linux:

Fatal error in MPI_Irecv: Message truncated, error stack:
MPI_Irecv(148)......................: MPI_Irecv(buf=0x7fff26356c90, count=3496, MPI_REAL, src=7, tag=1, MPI_COMM_WORLD, request=0x7fff2637c698) failed
MPIDI_CH3U_Request_unpack_uebuf(605): Message truncated; 14136 bytes received but buffer size is 13984
Fatal error in MPI_Irecv: Message truncated, error stack:
MPI_Irecv(148)......................: MPI_Irecv(buf=0x7fff376235a0, count=3496, MPI_REAL, src=6, tag=1, MPI_COMM_WORLD, request=0x7fff37648fa8) failed
MPIDI_CH3U_Request_unpack_uebuf(605): Message truncated; 14136 bytes received but buffer size is 13984
jessr92 commented 9 years ago

Looks like the non-blocking and wait code is a bit off for the halo exchange such that some processes will finish for, say, array v and go onto array w. Array w is a slightly different size and the messages get mixed up such that messages for w to go processes waiting on v or something. Seems quite odd since it doesn't happen all of the time...

jessr92 commented 9 years ago

If the topSend/bottomRecv and topRecv/bottomSend code is commented out, then the code works for PROC_PER_COL > 3 (tried 4 with PROC_PER_ROW=2).

jessr92 commented 9 years ago

Renamed since I've found how to cause it with any values of PROC_PER_COL and PROC_PER_ROW

jessr92 commented 9 years ago

Seems to have been fixed... investigating...

jessr92 commented 9 years ago

Closing for now... row/col 4/1, 1/4, and 2/2 have no send/recv issues. Will test on bigger numbers once togian becomes free.

jessr92 commented 9 years ago

Reopening since there seems to be other issues. Rank 0 to 5 seem to be doing halos for one array (say p in boundp2) whereas Rank 6 and 7 seem to be doing halos for f which doesn't make sense given the MPI_Barrier() calls...

jessr92 commented 9 years ago

Related to the logic (one if statement and the do loops) in press.f95 that call boundary subroutines. Disabled boundary routines in press for now.