Proposal for enhancement to support additional Persistent Memory use cases

jswaro commented 4 years ago

Introduction

Libfabric requires modifications to support RMA and atomic operations targeted at remote memory registrations backed by persistent memory devices. These modifications should be made with the intent to drive support for persistent memory usage by applications that rely on communications middleware such as SHMEM in a manner that is consistent with byte-based/stream-based addressable memory formats. Existing proposals (initial proposal) support NVMe/PMoF approaches, which this approach should support flat memory, non-block addressed memory structures and devices.

Changes may be required in as many as three areas:

Memory registration calls
- This allows a memory region to be registered as being capable of persistence. This has already been introduced into the upstream libfabric GITHUB, but should be reviewed to ensure it matches use case requirements.
Completion semantics
- These changes allow a completion event or notification to be deferred until the referenced data has reached the persistence domain at the target. This has already been introduced into the upstream libfabric GITHUB, but should be reviewed to ensure it matches use case requirements.
Consumer control of persistence
- As presently implemented in the upstream libfabric GITHUB, persistence is determined on a transaction-by-transaction basis. It was acknowledged at the time that this is a simplistic implementation. We need to reach consensus on the following:
- Should persistence be signaled on the basis of the target memory region? For example, one can imagine a scheme where data targeted at a particular memory region is automatically pushed into the persistence domain by the target, obviating the need for any sort of commit operation.
- Is an explicit 'commit' operation of some type required, and if so, what is the scope of that commit operation? Is there a persistence fence defined such that every operation prior to the fence is made persistent by a commit operation?

Proposal

The experimental work in the OFIWG/libfabric branch is sufficient for the needs of SHMEM, with exception to the granularity of event generation. When the current implementation generates events, it would generate commit-level completion events with every operation. That type of operation would make the delivery of completion events take longer than necessary for most operations, so SHMEM would need finer control over commit flushing behavior.

To satisfy this, the following is being proposed:

A new API: fi_commit (See definitions: fi_commit) The new API would be used to generate a commit instruction to a target peer. The instruction would be defined by a set of memory registration keys, or regions by which the target could issue a commit to persistent memory.
- A single request to fi_commit should generate a control message to target hardware or software emulation environment to flush the contents of memory targets. Memory targets are defined by the iov structures, and key fields – and the number of memory targets are defined by the count field. The destination address is handled by the dest_addr field. The flags field is held reserved at this time to allow for flexibility in the API design to future proof against options we might not conceive of until after the prototype is complete, and the context available for the user and returned with the completion
- Since this API behaves like a data transfer API, it is expected that this API would generate a completion event to the local completion queue associated with the EP from which the transaction was initiated against.
- At the target, this should generate an event to the target's event queue – if and only if the provider supports software emulated events. If a provider is capable of hardware level commits to persistent memory, the transaction should be consumed transparently by the hardware, and does not need to generate an event at the target. This will require an additional event definition in libfabric (See definition for fi_eq_commit_entry)
A new EQ event definition (fi_eq_commit_entry) to support software-emulated persistence for devices that cannot provide hardware support
- The iov, and count variables mirror the original iov, and count contents of the originating request.
- The flags may be a diminished set of flags from the original transaction under the assumption that only some flags would have meaning at the target and sending originator-only flags to the target would have little value to the target process.
Additional flags or capabilities
- A provider should be able to indicate whether they support software emulated notifications of fi_commit, or whether they can handle hardware requests for commits to persistent memory
- An additional flag should be introduced to the fi_info structure under modes: FI_COMMIT_MANUAL (or something else)
  - This flag would indicate to the application that events may be generated to the event queue for consumption by the application. Commit events would be generated upon receipt of a commit message from a remote peer, and the application would be responsible for handling the event.
  - Lack of the FI_COMMIT_MANUAL flag, and the presence of the FI_RMA_PMEM (or FI_PMEM) flag in the info structure should imply that the hardware is capable of handling the commit requests to persistent memory and the application does not need to read the event queue for commit events.
Change of flag definition
- The FI_RMA_PMEM flag should be changed to FI_PMEM to indicate that the provider is PMEM aware, and supports RMA/AMO/MSG operations to and from persistent memory.
- There may be little value in supporting messaging interfaces, but it is something that could supported.
Addition of an event handler registration for handling event queue entries within the provider context (See Definition: fi_eq_event_handler)
- Essentially, this becomes a registered callback for the target application to handle specific event types. We can use this mechanism with the target application to allow the provider to handle events internally using a function provided by the application. The function would contain the logic necessary to handle the event
- Specific to PMEM, a function handler would be used by the target application to handle commits to persistent memory as they were delivered without requiring a fi_eq_read and some form of acknowledgement around the commit action. With the handler, the commit could be handled entirely by the function provided by the application, and the return code from the application provided call-back would be sufficient for a software emulation in the provider to produce the return message to the sender that the commit transaction is fully complete. The use of a handler allows us to make the commit transaction as light-weight, or heavy-weight as necessary.

Definitions:

_ficommit

ssize_t fi_commit(struct fid_ep *ep, 
                             const struct fi_rma_iov *iov,
                             size_t count, 
                             fi_addr_t dest_addr, 
                             uint64_t flags, 
                             void *context);

_fi_eq_commitentry

struct fi_eq_commit_entry {
    fid_t                       fid;            /* fid associated with request */
    const struct fi_rma_iov    *iov;            /* iovec of memory regions to be committed to persistent memory */
    size_t                      count;          /* number of iovec/key entries */
    uint64_t                    flags;          /* operation-specific flags */
};

_fi_eq_eventhandler

typedef ssize_t (*fi_eq_event_handler_t)(struct fid_eq *eq, 
    uint64_t event_type, 
    void *event_data, 
    uint64_t len, 
    void *context);

ssize_t fi_eq_register_handler(struct fid_eq *eq, 
    uint64_t event_type, 
    fi_eq_event_handler_t handler, 
    void *context);

Use cases supported by this proposal:

As an application writer, I need to commit multiple previously-sent data transfers to the persistence domain
- Previous functionality allows for a commit for every message as is the case for FI_COMMIT_COMPLETE, or the use of FI_COMMIT on a per-transaction basis. The need in this use case is performance-oriented, to allow less strict delivery model to the NIC for most messages followed up with a 'flush' of the NIC to the persistence domain. This allows most messages targeted to the persistence domain to complete with a less strict delivery model, and provides a mechanism to ensure that those data transfers are eventually persisted.
As an application writer, I would like to be able to support persistent data models with libfabric over providers that do not provide hardware support for persistent memory devices
- The GNI provider, and other providers won't be able to support PMEM use cases, or at least not right away. To provide the support for PMEM in prototypes, or in providers that will never have PMEM support, a software-emulated approach was suggested to bridge the gap in functionality. In order for the target to know that something needs to be flushed to the persistence domain, the new EQ event was created. In addition to the EQ event, it was discussed that it could be useful for applications to provide function handlers that could be called in the event that a EQ would be delivered to facilitate a more passive libfabric application. If a handler was provided to libfabric, then the application itself could focus on serving requests for access to the persistence domain and sharing of the persistent memory.

jswaro commented 4 years ago

@shefty

shefty commented 4 years ago

I think it would be best if this topic were also posted to the mailing list for broader discussion. And I'm sure we'll want to discuss it in a future ofiwg.

jswaro commented 4 years ago

Agreed. I'm slowly getting this material up like I said I would.

jswaro commented 4 years ago

There is a pull request that I'll hopefully generate later today which provides a functional prototype of a subset of the material discussed here for the GNI provider. Most of the work will be in rebasing the work to the current state of master.

tschuett commented 4 years ago

https://tools.ietf.org/html/draft-talpey-rdma-commit-01

mblockso commented 4 years ago

I think it would be best if this topic were also posted to the mailing list for broader discussion. And I'm sure we'll want to discuss it in a future ofiwg.

@shefty and @jswaro .. do you want comments on the proposal added to this issue, the associated PR, or the mailing list?

shefty commented 4 years ago

IMO, email discussion has the best chance of gathering the most input.

shefty commented 4 years ago

I'm proposing the following API changes to expand persistent memory support. This is in addition to the existing API definitions, and is an alternative to other changes being discussed..

#define FI_SAVE    (1ULL << 32)

We can work on the name, but this is basically it. :)

This is an operational flag that can be passed into fi_writemsg. When specified, it indicates that the target memory region(s) should be updated to reflect all prior data transfers, such that they have the same completion semantic as the save operation.

E.g. fi_writemsg(..., FI_SAVE | FI_COMMIT_COMPLETE) to a persistent memory region behaves the same as a lower-level flush operation.

An FI_SAVE operation does not transfer data to the target region. It acts as a limited fencing operation for operations of the same type to the same region. E.g. a save write command does not complete until all previous writes to the same region have completed.

CQ data may be included as part of the operation. Data and message ordering is unchanged.

The flag can be added to fi_readmsg using a similar approach, but that can be deferred.

Likewise, we can extend this to fi_atomicmsg and fi_fetch_atomicmsg by defining FI_ATOMIC_PMEM. When used with an atomic operation, the data is 'saved' atomically using the data type specified with the save command. Updates to the data still use the data type specified in the previous atomic calls. This too can be deferred.

The flag can also apply to msg and tagged operations by defining that all prior messages to the specified peer reach the same completion semantic.

shefty commented 4 years ago

This specific comment will try to capture the features to support, and the current API mapping in order to highlight the gaps. The proposal(s) to address the gaps are captured elsewhere. Comment will be updated as needed to make it easier to track the features.

8-byte atomic write ordered with RDMA writes OFI defines a more generic atomic write. Message ordering is controlled through fi_tx_attr::msg_order flags. Data ordering is controlled through fi_ep_attr::max_order_waw_size. The existing API should be sufficient.
flush data for persistency The low-level flush operation ensures previous RDMA and atomic write operations to a given target region are persistent prior to completing. The target region may be accessible through multiple endpoints and NIC ports. Also, low-level transports require write after write message and data ordering, which is assumed by the flush operation. OFI defines FI_COMMIT_COMPLETE for persistent completion semantics. This provides limited support, handling only the following mapping: RMA write followed by a matching flush. A more generic mechanism needs to be defined, which would allow for a less strict completion on the RMA writes, with the persistent command following. This is possible today through the FI_FENCE flag, but that could result in stalls in the messaging.
flush data for global visibility This is similar to 2, with application and fabric visibility replacing persistency. OFI defines FI_DELIVERY_COMPLETE as a visibility completion semantic. This has similar limits as mentioned above.
data verification of a specified region There is no equivalent existing functionality, but it is aligned with discussions around SmartNIC and FPGA support, which defines generic offload functionality.
batching of data persistency / visibility This is a more generic description for 2 and 3 above. The request is to allow operations to complete at a lower, higher performing, completion level, which is followed by a single operation that moves the previous transfers to a higher completion level together.
RMA operations targeting persistent memory The current application use case is limited to RMA writes to persistent memory. Support for RMA read operations would be an extension.
atomic operations targeting persistent memory The low-level transports defined messages to support an 8-byte atomic write. The SHMEM use case could make use of other atomic operations to persistent memory regions. This may require that the flushing semantics mentioned in step 2 be data type aware when atomics are concerned.
global visibility of atomic operations This is similar to 7, but only requires that atomic results be visible. This may require that the flushing semantics mentioned in step 3 be data type aware when atomics are used.
application can indicate if it will access received data Applications may receive data over RMA or atomic transfers that the local application itself may not access, or at least not access immediately. Conveying this information to the provider may allow optimizations in network buffering and caching. The current request is to define this behavior per memory region. As different regions can be defined for the same virtual addresses, this likely seems sufficient. There is no current OFI semantic for this, unless it is interpreted indirectly as part of another flag.

grom72 commented 4 years ago

8-byte atomic write ordered with RDMA writes OFI defines a more generic atomic write. Message ordering is controlled through fi_tx_attr::msg_order flags. Data ordering is controlled through fi_ep_attr::max_order_waw_size.
The existing API should be sufficient. RDMA Atomic Write has different ordering rules than RDMA Write. RDMA Write can bypass RDMA Atomic Write and RDMA Flush but RDMA Atomic Write could not bypass RDMA Write and also could not bypass Flush.

Why do not we use here fi_atomic with FI_UINT64 and FI_ATOMIC_WRITE?

flush data for persistency The low-level flush operation ensures previous RDMA and atomic write operations to a given target region are persistent prior to completing. The target region may be accessible through multiple endpoints and NIC ports. Also, low-level transports require write after write message and data ordering, which is assumed by the flush operation. OFI defines FI_COMMIT_COMPLETE for persistent completion semantics. This provides limited support, handling only the following mapping: RMA write followed by a matching flush. A more generic mechanism needs to be defined, which would allow for a less strict completion on the RMA writes, with the persistent command following. This is possible today through the FI_FENCE > flag, but that could result in stalls in the messaging.

RDMA supports the following scenario - popular in database log update:

a sequence of Writes (without completion)
followed by Flush (without completion) to the same memory region that Writes use
followed by Atomic Write (without completion) to another memory region
followed by Flush with completion expected to the same memory region that the Atomic Write uses.

How can we express step 2 using FI_SAVE, FI_FENCE, FI_COMMIT_COMPLETE ( the last flag is not needed here as completion is not needed at this step yet)?

flush data for global visibility This is similar to 2, with application and fabric visibility replacing persistency. OFI defines FI_DELIVERY_COMPLETE as a visibility completion semantic. This has similar limits as > mentioned above.

Additionally to FI_DELIVERY_COMPLETE and FI_COMMIT_COMPLETE, we need also a mechanism that tells RDMA HW during memory registration either to use or not use cache when accessing (persistent) memory. That could be FI_UNCACHED flag to fi_mr_reg. This should work together with the existing fi_mr_reg FI_RMA_PMEM flag indicating RNIC either that data are willing to be stored only (uncached) or stored but also be ready for further processing. This functionality would be OK per memory region registration.

Data verify There is no equivalent existing functionality, but it is aligned with discussions around SmartNIC and FPGA support, which defines generic offload functionality.

AN SW alternative for this needed functionality would be a read operation that produced locally expected CRC value.

shefty commented 4 years ago

8-byte atomic write ordered with RDMA writes Why do not we use here fi_atomic with FI_UINT64 and FI_ATOMIC_WRITE?

That is the proposal

RDMA supports the following scenario - popular in database log update:

a sequence of Writes (without completion)

followed by Flush (without completion) to the same memory region that Writes use

followed by Atomic Write (without completion) to another memory region

followed by Flush with completion expected to the same memory region that the Atomic Write uses.

How can we express step 2 using FI_SAVE, FI_FENCE, FI_COMMIT_COMPLETE ( the last flag is not needed here as completion is not needed at this step yet)?

Selective completions must be enabled. Atomic write after RMA write order is true.

fi_write, fi_write, ...
fi_writemsg(ranges from step 1, FI_SAVE | FI_COMMIT_COMPLETE) 3/4. fi_atomic(FI_ATOMIC_WRITE, FI_UINT64, new destination, FI_COMMIT_COMPLETE | FI_COMPLETE)

The fi_atomic in step 3/4 can be split into 2 separate function calls by the app, but based on the desired semantic, that isn't needed.

we need also a mechanism that tells RDMA HW during memory registration either to use or not use cache when accessing (persistent) memory. That could be FI_UNCACHED flag to fi_mr_reg.

Caching RMA data is independent of PMEM. There should be an application use case driving this, and any flag should be defined relative to the application semantic. E.g. target will not access the data, or access will be deferred.

grom72 commented 4 years ago

Selective completions must be enabled. Atomic write after RMA write order is true. 1. fi_write, fi_write, ... 2. fi_writemsg(ranges from step 1, FI_SAVE | FI_COMMIT_COMPLETE) 3/4. fi_atomic(FI_ATOMIC_WRITE, FI_UINT64, new destination, FI_COMMIT_COMPLETE | FI_COMPLETE) The fi_atomic in step 3/4 can be split into 2 separate function calls by the app, but based on the desired semantic, that isn't needed.

Can we have any flag to make the Step 2 a non-blocking one? With the given set of flags the pipeline will not move to Step 3/4 until Step 2 is completed. In the case of raw RDMA verbs, we are able to post Atomic Write immediately after Flush not waiting for Flush completion.

we need also a mechanism that tells RDMA HW during memory registration either to use or not use cache when accessing (persistent) memory. That could be FI_UNCACHED flag to fi_mr_reg. Caching RMA data is independent of PMEM. There should be an application use case driving this, and any flag should be defined relative to the application semantic. E.g. target will not access the data, or access will be deferred.

Caching is not an issue with volatile memory. Typically data are send to a target node for further processing and cache is a primary destination in such a case. This is not true in the case of PMEM. Data are transferred to PMEM mainly to be stored securely. They are processed later, so by default caching PMEM data shall be disabled.

There is only one well-documented scenario where caching is required for PMEM. This is PMEM caching used in OLTP scenario:

A compute node sends an update to be processed by a storage node.
Update (write(s), atomic write) is followed by Flush to ensure data are securely stored in PMEM
Compute node could continue processing as soon as Flush (FI_COMMIT_COMPLETE) is received - no need to wait for the final storage node update.
The storage node starts processing as soon as Flush has been confirmed toward the compute node.
Data are expected to be already in the cache, otherwise, data must be read from memory. PMEM Caching shall be enabled in this scenario but shall be disabled in other ones.

shefty commented 4 years ago

Stalls are related to transport ordering behavior. If the transport guarantees write-after-write and atomic-write-after-write ordering, then this flow should not produce a stall:

fi_write(...)
fi_writemsg(..., FI_COMMIT_COMPLETE | FI_SAVE)
fi_atomic(...)

If the transport allows writes to bypass writes, then, yes, there could be a stall with fi_writemsg(), as that must wait for the prior writes to complete. The stall is not necessarily at the sender, however. It could handled at the target NIC by delaying the the save operation until earlier write complete. Since the fi_write calls can be handled out of order and flow across different network paths, the overall performance may still be better than if the stall were avoided.

I don't want to assume that a specific implementation path is better. Any solution should allow for both relaxed message ordering and relaxed data placement, up to the point where persistency or visibility is requested. Transports without those capabilities (e.g. InfiniBand or iWarp) obviously lose those benefits, but may have an easier time implementing the persistent and visibility semantics.

grom72 commented 4 years ago

fi_writemsg(...,Fi_COMMIT_COMPLETE | FI_SAVE) is OK and could be implemented by ibv_post_send(... IBV_WR_FLUSH...) / new opcode /. We do not have a mechanism to ask a verbs provider not to deliver completion in case of fi_writemsg is followed by fi_atomic.
Relaxed message ordering will be an issue with fi_atomic write as RDMA.Atomic Write has built-in fencing mechanism and waits for all previously initiated write and flush operations. Does it mean that we should use RDMA.Atomic Write only when FI_FENCE flag is used? Or is it OK that in case of verbs provider fi_atomic(..., FI_COMMITE_COMPLETE) will always be mapped to fencing RDMA.Atomic Write?
UINT_64 type is the closest one to this implemented by RDMA.Atomic Write (8 bytes payload). Shall we assume that only UINT_64 type will be mapped to RDMA.Atomic Write or shall we also support other (smaller datatypes)? Or shall we define new data type e.g. FI_UINT8_ARRAY_8 to says that we are not assume anything about endianness on both sides of the connection?

shefty commented 4 years ago

The generation of a completion event to an application is not the desired optimization here. All requests will "complete", and by that I mean that the request is considered done by the hardware and its state flushed from the transmit queue. Hiding the completion from the application has almost no impact on performance, and based on past experience, it actually results in a negative impact overall because of HW implementation details.

The purpose behind the flush operation (FI_SAVE proposal) is to strengthen the completion semantic of previous transfers. It doesn't matter if previous RDMA write operations generate completion events, what matters is when those operations are done and can be retired from processing.

If the app requires atomic write after rma write ordering, they can specify that through the API. The FI_FENCE flag would provide ordering, but is only needed if WAW ordering is not already guaranteed by the provider. Otherwise FI_COMMIT_COMPLETE is sufficient. Committing an atomic operation does not result in committing prior RMA writes.
The API allows for atomic writes to an array of any datatype, subject to provider restrictions. I agree that we'll need some mechanism to indicate to the provider to ignore endianness. Because this affects u16, u32, and u64, we may want to use a flag here. This would impact both atomic read and write operations.

github-actions[bot] commented 2 years ago

There has been no activity on this issue for more than 360 days. Marking it stale.

ofiwg / libfabric

Proposal for enhancement to support additional Persistent Memory use cases #5874