Open pavanbalaji opened 5 years ago
We are having consistent failures in direct-nm
configuration that looks very similar to the am-only
failures reported in this issue. The details are pasted here and we are going to assume it is of same issue. We need open a new issue if we discover it unrelated upon investigation.
Jenkins - mpich-review-ch4-ofi - #147 - gnu,direct-nm,centos64
summary_junit_xml.1059 - ./rma/lockall_dt_flush 4 -type=MPI_INT -count=65530 -seed=209 -testsize=16
Error Details
not ok 1059 - ./rma/lockall_dt_flush 4
Stack Trace
not ok 1059 - ./rma/lockall_dt_flush 4
---
Directory: ./rma
File: lockall_dt_flush
Num-procs: 4
Timeout: 180
Date: "Tue Aug 6 11:29:40 2019"
...
## Test output (expected 'No Errors'):
## [mpiexec@pmrs-centos64-240-01.cels.anl.gov] APPLICATION TIMED OUT, TIMEOUT = 180s
For send-recv am opearations, the solution is to have an alternative lmt protocol where the sender sends data in multiple segments and the recver do MPIR_Typerep_unpack
for each segment.
It is much easier to implement this new lmt protocol once the refactoring PR #4323 gets merged.
OFI's noncontiguous data movement management currently uses IOVs instead of packing data for RMA operations (all configurations) and send/recv operations (in the AM-only configuration). This needs to be improved to use pack/unpack when the data density is low (average size of the contiguous buffers is small).