Closed ericch1 closed 5 years ago
The source code...
FYI: The last working commit was on may 23 : 4743a00. Eric
Thanks, @ericch1. I don't think the commit you pointed to above is the culprit. However, my somewhat large PR (#3622) got merged in that day, which is the more likely culprit. I'll try to reproduce the error on our end and report back.
Thanks @pavanbalaji. You're right: the commit 60035bb, has probably nothing to do with the bug, but it is the one that our nighlty validations have tested on may 24... Also, the commit 4743a00 has been tested the day before and was good.
So the problem is in between... but still remain...
Anyway, I hope it is nothing too difficult to find now...
I added a verification at ending of the program:
it prints all the errors and the total of errors encountered is used as the return value...
So it can be used as a kind of validation test... Now a successful run will print nothing and return "0", but master is giving:
mpirun -n 2 bugmaster
ERROR at element #34, node 0: expected: 61 != received: 52
ERROR at element #34, node 1: expected: 105 != received: 76
ERROR at element #34, node 2: expected: 106 != received: 104
ERROR at element #34, node 3: expected: 201 != received: 74
ERROR at element #34, node 4: expected: 208 != received: 197
ERROR at element #34, node 5: expected: 205 != received: 195
ERROR at element #35, node 0: expected: 40 != received: 62
ERROR at element #35, node 2: expected: 107 != received: 65
ERROR at element #35, node 3: expected: 200 != received: 202
ERROR at element #35, node 4: expected: 212 != received: 203
ERROR at element #35, node 5: expected: 209 != received: 40
ERROR at element #36, node 0: expected: 62 != received: 63
ERROR at element #36, node 1: expected: 107 != received: 66
ERROR at element #36, node 2: expected: 105 != received: 106
ERROR at element #36, node 3: expected: 210 != received: 41
ERROR at element #36, node 4: expected: 212 != received: 207
ERROR at element #36, node 5: expected: 202 != received: 206
ERROR at element #37, node 1: expected: 108 != received: 106
ERROR at element #37, node 2: expected: 106 != received: 105
ERROR at element #37, node 3: expected: 213 != received: 204
ERROR at element #37, node 4: expected: 216 != received: 208
ERROR at element #37, node 5: expected: 204 != received: 200
echo $?
22
Just added the faulty rank also... updated the file above...
@ericch1 I just tested your example with the latest master (029f843ede8a5c0b6877e62f87c30c3d5fba2938), and it went fine. Could you provide your config options (try head config.log
)?
That's strange! Hear is my config.log: http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2019.06.07.05h41m16s_config.log
most important:
./configure --prefix=/opt/mpich-3.x_debug --enable-debuginfo --enable-g=dbg,meminit CPPFLAGS=-I/usr/include/valgrind --with-device=ch3:sock --enable-romio
I confirmed the bug with --with-device=ch3:sock
.
I narrowed the example down to:
#include "mpi.h"
#include "stdio.h"
#include "stdlib.h"
// #define SHOWBUG
int main(int argc, char *argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Datatype MPI_TYPE_XXX;
#ifdef SHOWBUG
MPI_Type_contiguous(1, MPI_LONG, &MPI_TYPE_XXX);
MPI_Type_commit(&MPI_TYPE_XXX);
#else
MPI_TYPE_XXX = MPI_LONG;
#endif
int tag = 1;
if (rank==0){
long send_buf[121];
for(int i=0;i<121;i++){
send_buf[i]=(long)i;
}
MPI_Datatype send_type;
int cnt_1[] = { 1,1,1,1,1,1,2,2,2,2,1,1,4,6,7,5,5,5,5,5};
int off_1[] = { 1,3,5,7,9,11,14,18,22,26,32,35,40,50,62,70,82,93,103,113};
MPI_Type_indexed(20,cnt_1,off_1,MPI_TYPE_XXX,&send_type);
MPI_Type_commit(&send_type);
MPI_Send(send_buf, 1, send_type, 1, tag, MPI_COMM_WORLD);
}
else{
MPI_Recv(recv_buf, 58, MPI_TYPE_XXX, 0, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for (int i = 33; i<39; ++i) {
printf("Rank 1: Element # %d : %ld\n",i, recv_buf[i]);
}
}
MPI_Finalize();
return 0;
}
Comment out the SHOWBUG
macro, it uses MPI_LONG
to build the indexed type, it get correct answer: 70, 71, 72, 73, 74, 82
, where 70-74
are from the block of 5 at displacement 70.
Remove the comment, it uses derived type that is identical to MPI_LONG
, the error shows up: 70,67,68,69,70,82
, where 67,68,69,70
are from an error block.
EDIT: found the culprit: MPL_IOV_LIMIT
is 16, cause MPIR_Typerep_pack
run twice, first 15 (plus header), then the rest. MPII_Segment_seek
(bug in MPII_Segment_manipulate
) didn't leave segp
at a clean boundary.
Hi, everything is fine on mpich/master for us this morning! Thanks a lot! :) Eric
@ericch1 Thanks for your report!
As reported on the list: Hi,
I have worked to reproduce the bug in a somewhat small example.
On on any good MPI version, the output of the attached example is:
Rank 1: Element # 33 : {40,106,105,204,208,200},8 Rank 1: Element # 34 : {61,105,106,201,208,205},8 Rank 1: Element # 35 : {40,105,107,200,212,209},8 Rank 1: Element # 36 : {62,107,105,210,212,202},8 Rank 1: Element # 37 : {40,108,106,213,216,204},8 Rank 1: Element # 38 : {42,110,85,223,227,108},8
On the mpich/master branch, since may 24, commit 60035bb7b9766db15041033959e4209f0ad936b, the output is:
Rank 1: Element # 33 : {40,106,105,204,208,200},8 Rank 1: Element # 34 : {52,76,104,74,197,195},8 Rank 1: Element # 35 : {62,105,65,202,203,40},8 Rank 1: Element # 36 : {63,66,106,41,207,206},8 Rank 1: Element # 37 : {40,106,105,204,208,200},8 Rank 1: Element # 38 : {42,110,85,223,227,108},8
I hope you wil be able to work with my example to reproduce and solve this problem!
Compile & run with:
mpic++ -o bugmaster mpich_bug_indexed_type.cc mpirun -n 2 bugmaster
Thanks a lot,
Eric