pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
555 stars 281 forks source link

bug/jenkins: timeouts for noncontiguous data transfer in am path #3886

Open pavanbalaji opened 5 years ago

pavanbalaji commented 5 years ago

OFI's noncontiguous data movement management currently uses IOVs instead of packing data for RMA operations (all configurations) and send/recv operations (in the AM-only configuration). This needs to be improved to use pack/unpack when the data density is low (average size of the contiguous buffers is small).

hzhou commented 5 years ago

We are having consistent failures in direct-nm configuration that looks very similar to the am-only failures reported in this issue. The details are pasted here and we are going to assume it is of same issue. We need open a new issue if we discover it unrelated upon investigation.

Jenkins - mpich-review-ch4-ofi - #147 - gnu,direct-nm,centos64

summary_junit_xml.1059 - ./rma/lockall_dt_flush 4 -type=MPI_INT -count=65530 -seed=209 -testsize=16
 Error Details

not ok 1059 - ./rma/lockall_dt_flush 4

 Stack Trace

not ok 1059 - ./rma/lockall_dt_flush 4
  ---
  Directory: ./rma
  File: lockall_dt_flush
  Num-procs: 4
  Timeout: 180
  Date: "Tue Aug  6 11:29:40 2019"
  ...
## Test output (expected 'No Errors'):
## [mpiexec@pmrs-centos64-240-01.cels.anl.gov] APPLICATION TIMED OUT, TIMEOUT = 180s
hzhou commented 4 years ago

For send-recv am opearations, the solution is to have an alternative lmt protocol where the sender sends data in multiple segments and the recver do MPIR_Typerep_unpack for each segment.

It is much easier to implement this new lmt protocol once the refactoring PR #4323 gets merged.