Open mpichbot opened 7 years ago
Originally by robl on 2014-06-20 15:01:15 -0500
I can't get this to reproduce on my laptop or on fusion...
Originally by robl on 2014-06-23 13:14:44 -0500
Paul adds:
Assuming you have a small number of nodes you should be able to get a failure if you do this:
array_of_gsizes[0] = 128*17;
array_of_gsizes[1] = 128*9;
array_of_gsizes[2] = 128*11;
to this:
array_of_gsizes[0] = 64*17;
array_of_gsizes[1] = 64*9;
array_of_gsizes[2] = 64*11;
Originally by robl on 2014-06-16 15:16:16 -0500
Paul Coffman experimented with ROMIO on power:
So I wanted to cut it down to 1/8 so I set it to this:
But then it blows up in the write
In a subsequent email:
I dug into this with one of the developers, and he was certain there is some sort of memory corruption going on above the pami/lapi layer in mpich. Basically in the ADIOI_GPFS_Calc_others_req call in ad_gpfs_aggrs.c on line 785:
the address for the sendBufForOffsets is invalid, causing a sig-11 segfault. So the simplest way I can reproduce this is with 8 ranks spread across 2 PE nodes - 4 ranks on each node. If I run all 8 ranks on 1 node it works, so maybe this has something to do with the internode tables or something in the MPID_Comm struct. I don't understand how the logic above this that determines the sendBufForOffsets (includes addition of my debug print statement):
If we use the p2p algorithm, there is no problem. so what's off about the alltoallv approach? RobL will try to replicate on a two processor system.