ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

GNI provider: Remote counters are not being incremented some of the times #1420

Open tonyzinger opened 6 years ago

tonyzinger commented 6 years ago

I have a test that is doing an all-to-one communication pattern. On the receiving rank, the counter does not receive the correct number of transfers. I use fi_cntr_read() to get the current value. I then increment the counter value by the number of transfers that I am receiving. This value is then used in the fi_cntr_wait() as the threshold value. In my program the wait value is 500 ms and I retry the fi_cntr_wait() 20 times before I return an error. The sending ranks are using the fi_inject_write() and fi_inject_writedata() APIs.

I am attaching the debug output for the receiving rank. RMA_gni_rank_0_pid_5606.log

hppritcha commented 6 years ago

@tonyzinger could you supply a test case?

tonyzinger commented 6 years ago

On jupiter in the directory: /home/users/ajz/issue_1420, read the README file and then execute the test case via the execute_test_case.sh script.