I have a test that is doing an all-to-one communication pattern. On the receiving rank, the counter does not receive the correct number of transfers.
I use fi_cntr_read() to get the current value. I then increment the counter value by the number of transfers that I am receiving. This value is then used in the fi_cntr_wait() as the threshold value. In my program the wait value is 500 ms and I retry the fi_cntr_wait() 20 times before I return an error.
The sending ranks are using the fi_inject_write() and fi_inject_writedata() APIs.
I have a test that is doing an all-to-one communication pattern. On the receiving rank, the counter does not receive the correct number of transfers. I use fi_cntr_read() to get the current value. I then increment the counter value by the number of transfers that I am receiving. This value is then used in the fi_cntr_wait() as the threshold value. In my program the wait value is 500 ms and I retry the fi_cntr_wait() 20 times before I return an error. The sending ranks are using the fi_inject_write() and fi_inject_writedata() APIs.
I am attaching the debug output for the receiving rank. RMA_gni_rank_0_pid_5606.log