ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
547 stars 375 forks source link

prov/gni: Inject path is experiencing intermittent failures #3694

Closed jswaro closed 3 years ago

jswaro commented 6 years ago

From within the jenkin's job directory, the following occurrences of failed tests within the last 30 days or last 25 builds.

grep -R "[FAIL]" . 2>&1 | grep -v "No such file" | awk '{print $2}' | sort | uniq -c 1 dgram_rma_basic::inject_writedata: 5 dgram_rma_scalable::inject_write: 5 dgram_rma_scalable::inject_writedata: 2 dgram_rma_stx_basic::inject_write: 6 dgram_rma_stx_basic::inject_writedata: 9 dgram_rma_stx_scalable::inject_write: 8 dgram_rma_stx_scalable::inject_writedata: 6 mr_notifier::multiple: 1 rdm_rma_basic::inject_write: 2 rdm_rma_scalable::inject_write: 7 rdm_rma_scalable::inject_writedata: 1 rdm_rma_scalable::inject_writedata_retrans: 8 rdm_rma_scalable::inject_write_retrans: 2 rdm_rma_stx_basic::inject_write: 3 rdm_rma_stx_basic::inject_writedata: 3 rdm_rma_stx_basic::inject_writedata_retrans: 1 rdm_rma_stx_basic::inject_write_retrans: 1 rdm_rma_stx_scalable::inject_write: 5 rdm_rma_stx_scalable::inject_writedata: 5 rdm_rma_stx_scalable::inject_writedata_retrans: 9 rdm_rma_stx_scalable::inject_write_retrans: 1 rdm_rma_stx_scalable::trigger: 1 rdm_sr_eager_auto::senddata_eager_auto: 1 scalablem_basic::all:

tonyzinger commented 6 years ago

In my test case, I am using the fi_tinject API.

For example, if I do 6 transfers of 192 bytes with a max_inject_size of 64. This will cause me to do 18 tinject transfers of 64 bytes. The threshold counter value that I need to get my data to land successfully is 24 (6 transfers (192 / 64) + 6 transfers). I would have expected my threshold counter to be 18 (6 transfers (192 / 64)).

In another example, if I do 6 transfers of 160 bytes with max_inject_size of 64. This will cause me to do 12 tinject transfers of 64 bytes and 6 tinject transfers of 32 bytes. If I set the threshold counter value to the value above, i.e. 24, the counter wait will not reach my threshold of 24. However, if I set the threshold counter to 18 (6 transfers * ((160 / 64) + 1)), the counter will reach my threshold of 18 but the data for the last partial buffer will not land in my buffer. The first 2 full buffers will land successfully.

shefty commented 3 years ago

no activity on issue in over 3 years -- either is no longer relevant or I think we can say will not be fixed.