ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

Use of multiple domains per fabric can cause missed kdreg events #1064

Open ztiffany opened 7 years ago

ztiffany commented 7 years ago

Each domain holds an MR cache. Each MR cache needs notification of all kdreg events.

kdreg only allows a single subscriber to mm events. Due to that restriction, a process opens a single kdreg descriptor which is shared by all domains in that process. All domains race to dequeue events from the single stream of events on the shared kdreg descriptor. That leaves the potential for events to be missed by a domain.

jswaro commented 7 years ago

Each MR cache only needs notifications of KDREG events that it explicitly called notify for. In some events, it may be possible that both domains will have registered the same region, and will notify for the same region. It is unclear how this should be handled.

hppritcha commented 7 years ago

I think the best thing here is to fix kdreg. Originally udreg/kdreg was meant to be a single entity, but we're using kdreg standalone and seeing the impact here. I think each kdreg context can use its own mmu_notifier handle, it should be pretty straightforward to relax the artificial restriction of one kdreg context per process.

jswaro commented 7 years ago

This is causing the issue with #1350