ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
547 stars 375 forks source link

Peer cntr support in peer API #9082

Closed shijin-aws closed 1 year ago

shijin-aws commented 1 year ago

Currently, peer API doesn't cover cntr, which makes the cntr completion cannot be shared between owner and peer providers.

There was some discussion to potentially deprecate counter support, but currently MPICH and Intel MPI still use fi_cntr* API in one-sided operations.

I had some offline discussion with @aingerson that we could either add counter support to the peer API, or make the owner handle that, i.e. peer calls owner's cq write and owner sees that it also would increment a counter and increments it in addition to writing the completion.

Any feedback & insights? @shefty

shefty commented 1 year ago

So, I don't foresee us dropping counter support. Best case is we might be able to limit their use, but even that's open.

Since counters are independent from writing CQ entries, it makes more sense to add them to the peer APIs IMO.

shijin-aws commented 1 year ago

Thanks @shefty. I think the peer counter API & workflow will be similar to peer CQ. I need to look into it and get back when I have a proposal of it.

shijin-aws commented 1 year ago

Can we make this issue a blocker for 1.19 release? I got the impression that the current release timeline is end of July, and we cannot make efa released for 1.19 before this is fixed.

shefty commented 1 year ago

Is the current implementation broken in this regard? I'm fine making this a blocker. The release schedule isn't driven by anything other than trying to be somewhat predictable.

shijin-aws commented 1 year ago

Yes, efa has onboarded peer API in main branch, but in recent MPICH/Intel MPI test we found this bug that running OMB's one-sided benchmarks will hang on single node because the cntr completion between shm and efa could not share.

We missed the cntr support when making efa onboard the peer API. And we don't want to revert the peer API change, so we have to get this fixed.

shijin-aws commented 1 year ago

I opened a PR to introduce the API: https://github.com/ofiwg/libfabric/pull/9111. I will close this issue and move the discussion to the PR.