ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

gni provider does not fill in buf and len on CQ (fi_cq_data_entry) entry #1009

Closed tonyzinger closed 7 years ago

tonyzinger commented 8 years ago

When I execute my test using the gni provider the CQ entry does not have the buf and len fields set to the appropriate values. Whereas, using the sockets provider, these fields are filled in.

Output from my test case using the gni and sockets providers: Provider: gni 0: [nid00166, Rank: 0] Using Provider: 'gni' 1: [nid00167, Rank: 1] Using Provider: 'gni' 1: [nid00167, Rank: 1] fi_writedata: Sent to: 0, Index: 0, Data: 0x0000000100010101 0: [nid00166, Rank: 0] fi_writedata: Sent to: 1, Index: 0, Data: 0x0000000000000000 1: [nid00167, Rank: 1] fi_cq_read(source), flags: 0x204, buffer: 0x0, length: 0, data: 0x0 0: [nid00166, Rank: 0] fi_cq_read(source), flags: 0x204, buffer: 0x0, length: 0, data: 0x0 0: [nid00166, Rank: 0] fi_cq_read(receive), flags: 0x22004, buffer: 0x0, length: 0, data: 0x100010196 1: [nid00167, Rank: 1] fi_cq_read(receive), flags: 0x22004, buffer: 0x0, length: 0, data: 0x96

Provider: sockets 0: [nid00036, Rank: 0] Using Provider: 'sockets' 1: [nid00037, Rank: 1] Using Provider: 'sockets' 0: [nid00036, Rank: 0] fi_writedata: Sent to: 1, Index: 0, Data: 0x0000000000000000 1: [nid00037, Rank: 1] fi_writedata: Sent to: 0, Index: 0, Data: 0x0000000100010101 1: [nid00037, Rank: 1] fi_cq_read(source), flags: 0x20204, buffer: 0x67e080, length: 16, data: 0x100010196 0: [nid00036, Rank: 0] fi_cq_read(source), flags: 0x20204, buffer: 0x67cf40, length: 16, data: 0x96 0: [nid00036, Rank: 0] fi_cq_read(receive), flags: 0x22004, buffer: 0x67d100, length: 16, data: 0x100010196 1: [nid00037, Rank: 1] fi_cq_read(receive), flags: 0x22004, buffer: 0x67e240, length: 16, data: 0x96

hppritcha commented 8 years ago

@tonyzinger is this when using the FI_MULTI_RECV option for the receive buffer?

tonyzinger commented 8 years ago

No. My test does not use FI_MULTI_RECV.

This same test using the sockets provider has these fields filled in as documented when I created this issue.

sungeunchoi commented 8 years ago

It seems we pass 0's for len, buf, data, and tag for rma and atomic operations, except in the case of rma with FI_REMOTE_CQ_DATA where we pass in non-zero data.

ztiffany commented 7 years ago

According to the fi_cq.3 man page, "len" only applies to receive operations and "buf" only applies to FI_MULT_RECV operations.

sungeunchoi commented 7 years ago

What @ztiffany says is for FI_CQ_FORMAT_DATA, which I think we implement correctly. Tony is using FI_CQ_FORMAT_TAGGED, which by the name sounds like it should only be for tagged messaging. I think the question is what should happen in the case when you use FI_CQ_FORMAT_TAGGED for RMA (or atomic) operations.

tonyzinger commented 7 years ago

In the test case output, I am using FI_CQ_FORMANT_DATA. The cq, that is being processed, is on the receiving rank. That is, the fi_writedata is initiated on a remote (transmitting) rank.

Also, this test case was executed using the gni and sockets providers. For the sockets provider, the "buf" and "len" fields are filled in. Is there really a reason that the gni provider does not fill in these fields? As a user, the more I know about an event the better it is.

ztiffany commented 7 years ago

The reason is just to match the man page. If we want to match sockets then wouldn't it be best to start by proposing a change to the man page (which is effectively the API spec) ?

sungeunchoi commented 7 years ago

@tonyzinger I think Zach Is right. See the COMPLETION FIELDS section of fi_cq(3) man page.

tonyzinger commented 7 years ago

The fi_cq(3) man page states for "buf" - buf : The buf field is only valid for completed receive operations, and only applies when the receive buffer was posted with the FI_MULTI_RECV flag. In this case, buf points to the starting location where the receive data was placed.

So the gni provider is doing what is stated in the man page.

However, for "len", it states - len : This len field only applies to completed receive operations. It indicates the size of received message data -- i.e. how many data bytes were actually placed into the associated receive buffer. If an endpoint has been configured with the FI_MSG_PREFIX mode, the len also reflects the size of the prefix buffer.

For my test case this is a completed receive operation, so it should be filled in.

tonyzinger commented 7 years ago

This is the reply I got from Sean Hefty for my ofiwg issue that I created for clarification in regards to CQ len field completion information:

This field applies to completions for fi_recv, fi_trecv, and related calls. It represents "how many data bytes were actually placed into the associated receive buffer". It is for message transfers only (send, tsend), not RMA or atomics.

So this issue can be closed.