pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
551 stars 281 forks source link

Something got wrong since may 24 commit 60035bb with MPI_Type_indexed #3830

Closed ericch1 closed 5 years ago

ericch1 commented 5 years ago

As reported on the list: Hi,

I have worked to reproduce the bug in a somewhat small example.

On on any good MPI version, the output of the attached example is:

Rank 1: Element # 33 : {40,106,105,204,208,200},8 Rank 1: Element # 34 : {61,105,106,201,208,205},8 Rank 1: Element # 35 : {40,105,107,200,212,209},8 Rank 1: Element # 36 : {62,107,105,210,212,202},8 Rank 1: Element # 37 : {40,108,106,213,216,204},8 Rank 1: Element # 38 : {42,110,85,223,227,108},8

On the mpich/master branch, since may 24, commit 60035bb7b9766db15041033959e4209f0ad936b, the output is:

Rank 1: Element # 33 : {40,106,105,204,208,200},8 Rank 1: Element # 34 : {52,76,104,74,197,195},8 Rank 1: Element # 35 : {62,105,65,202,203,40},8 Rank 1: Element # 36 : {63,66,106,41,207,206},8 Rank 1: Element # 37 : {40,106,105,204,208,200},8 Rank 1: Element # 38 : {42,110,85,223,227,108},8

I hope you wil be able to work with my example to reproduce and solve this problem!

Compile & run with:

mpic++ -o bugmaster mpich_bug_indexed_type.cc mpirun -n 2 bugmaster

Thanks a lot,

Eric

ericch1 commented 5 years ago

The source code...

mpich_bug_indexed_type.cc.txt

ericch1 commented 5 years ago

FYI: The last working commit was on may 23 : 4743a00. Eric

pavanbalaji commented 5 years ago

Thanks, @ericch1. I don't think the commit you pointed to above is the culprit. However, my somewhat large PR (#3622) got merged in that day, which is the more likely culprit. I'll try to reproduce the error on our end and report back.

ericch1 commented 5 years ago

Thanks @pavanbalaji. You're right: the commit 60035bb, has probably nothing to do with the bug, but it is the one that our nighlty validations have tested on may 24... Also, the commit 4743a00 has been tested the day before and was good.

So the problem is in between... but still remain...

Anyway, I hope it is nothing too difficult to find now...

ericch1 commented 5 years ago

I added a verification at ending of the program:

it prints all the errors and the total of errors encountered is used as the return value...

So it can be used as a kind of validation test... Now a successful run will print nothing and return "0", but master is giving:

 mpirun -n 2 bugmaster 
ERROR at element #34, node 0: expected: 61 != received: 52
ERROR at element #34, node 1: expected: 105 != received: 76
ERROR at element #34, node 2: expected: 106 != received: 104
ERROR at element #34, node 3: expected: 201 != received: 74
ERROR at element #34, node 4: expected: 208 != received: 197
ERROR at element #34, node 5: expected: 205 != received: 195
ERROR at element #35, node 0: expected: 40 != received: 62
ERROR at element #35, node 2: expected: 107 != received: 65
ERROR at element #35, node 3: expected: 200 != received: 202
ERROR at element #35, node 4: expected: 212 != received: 203
ERROR at element #35, node 5: expected: 209 != received: 40
ERROR at element #36, node 0: expected: 62 != received: 63
ERROR at element #36, node 1: expected: 107 != received: 66
ERROR at element #36, node 2: expected: 105 != received: 106
ERROR at element #36, node 3: expected: 210 != received: 41
ERROR at element #36, node 4: expected: 212 != received: 207
ERROR at element #36, node 5: expected: 202 != received: 206
ERROR at element #37, node 1: expected: 108 != received: 106
ERROR at element #37, node 2: expected: 106 != received: 105
ERROR at element #37, node 3: expected: 213 != received: 204
ERROR at element #37, node 4: expected: 216 != received: 208
ERROR at element #37, node 5: expected: 204 != received: 200

echo $?
22
ericch1 commented 5 years ago

mpich_bug_indexed_type.cc.txt

ericch1 commented 5 years ago

Just added the faulty rank also... updated the file above...

hzhou commented 5 years ago

@ericch1 I just tested your example with the latest master (029f843ede8a5c0b6877e62f87c30c3d5fba2938), and it went fine. Could you provide your config options (try head config.log)?

ericch1 commented 5 years ago

That's strange! Hear is my config.log: http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2019.06.07.05h41m16s_config.log

most important:

./configure --prefix=/opt/mpich-3.x_debug --enable-debuginfo --enable-g=dbg,meminit CPPFLAGS=-I/usr/include/valgrind --with-device=ch3:sock --enable-romio

hzhou commented 5 years ago

I confirmed the bug with --with-device=ch3:sock.

hzhou commented 5 years ago

I narrowed the example down to:

#include "mpi.h"
#include "stdio.h"
#include "stdlib.h"

// #define SHOWBUG

int main(int argc, char *argv[])
{
  int rank, size;
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  MPI_Datatype MPI_TYPE_XXX;
#ifdef SHOWBUG
  MPI_Type_contiguous(1, MPI_LONG, &MPI_TYPE_XXX);
  MPI_Type_commit(&MPI_TYPE_XXX);
#else
  MPI_TYPE_XXX = MPI_LONG;
#endif

    int tag = 1;
    if (rank==0){
        long send_buf[121];
        for(int i=0;i<121;i++){
            send_buf[i]=(long)i;
        }

        MPI_Datatype send_type;
        int cnt_1[] = { 1,1,1,1,1,1,2,2,2,2,1,1,4,6,7,5,5,5,5,5};
        int off_1[] = { 1,3,5,7,9,11,14,18,22,26,32,35,40,50,62,70,82,93,103,113};
        MPI_Type_indexed(20,cnt_1,off_1,MPI_TYPE_XXX,&send_type);
        MPI_Type_commit(&send_type);

        MPI_Send(send_buf, 1, send_type, 1, tag, MPI_COMM_WORLD);
    }
    else{
        MPI_Recv(recv_buf, 58, MPI_TYPE_XXX, 0, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        for (int i = 33; i<39; ++i) {
            printf("Rank 1: Element # %d : %ld\n",i, recv_buf[i]);
        }
    }

  MPI_Finalize();
  return 0;
}

Comment out the SHOWBUG macro, it uses MPI_LONG to build the indexed type, it get correct answer: 70, 71, 72, 73, 74, 82, where 70-74 are from the block of 5 at displacement 70. Remove the comment, it uses derived type that is identical to MPI_LONG, the error shows up: 70,67,68,69,70,82, where 67,68,69,70 are from an error block.

EDIT: found the culprit: MPL_IOV_LIMIT is 16, cause MPIR_Typerep_pack run twice, first 15 (plus header), then the rest. MPII_Segment_seek (bug in MPII_Segment_manipulate) didn't leave segp at a clean boundary.

ericch1 commented 5 years ago

Hi, everything is fine on mpich/master for us this morning! Thanks a lot! :) Eric

hzhou commented 5 years ago

@ericch1 Thanks for your report!