ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
567 stars 380 forks source link

Atomic Consistency Question #8380

Closed a-szegel closed 1 year ago

a-szegel commented 1 year ago

From the man pages:

https://ofiwg.github.io/libfabric/v1.17.0/man/fi_atomic.3.html

The correctness of atomic operations on a target memory region is guaranteed only when performed by a single actor for a given window of time. An actor is defined as a single libfabric domain (identified by the domain name, and not an open instance of that domain),

An actor is tied to a specific domain name (NIC). Is the domain indicated on the requesting side or the receiving side?

shefty commented 1 year ago

If the domain is a NIC, this is associated with the hardware NIC that is performing the operation, which is the target. It's possible for a NIC to be both the initiator and target.

shefty commented 1 year ago

I think what we may want is 'remote computation' semantics, where atomicity isn't guaranteed, but message and data ordering still applies. That would give atomic like semantics as long as the application restricted their computations to a single target endpoint.

a-szegel commented 1 year ago

https://github.com/ofiwg/libfabric/pull/8381

a-szegel commented 1 year ago

Would remote computation semantics be a different API, or a user info flag?

a-szegel commented 1 year ago

I think the current definition is aligned with what built-in hardware atomics provide. If we are emulating atomics... they are going to be slow no matter what. I don't think that it is a bad thing that we have to do more work to correctly emulate something that a user is expecting. I would be hesitant to make our definition of atomics looser/provide a new api, unless we know that it is useful to our user.

shefty commented 1 year ago

I haven't given it any though beyond the above suggestion. It would be easiest to reuse the same APIs, at least for the providers (i.e. ep->atomic->write/readwrite/compwrite()). Those could have new inline wrappers, if needed. For the app, these could be defined as endpoint-based atomics, though I'm hesitant to call these atomics. This would need to be exposed to the app somehow, and we'd need to decide if a provider would ever need to support both current atomics and remote compute semantics.

For apps that use MSG endpoints, they likely need atomic support between endpoints. For apps using RDM endpoints, remote compute may be sufficient.

shefty commented 1 year ago

I wouldn't say we're emulating atomics. We're using CPU atomics. If the target memory region is host memory, using the CPU instead of a NIC accessing the region over PCI, may be faster. So, I wouldn't even go as far as to say it's slower. I don't know if any NIC can perform atomic updates to GPU memory. If the target region is on the GPU, scheduling the operation on the GPU may be the preferred path.

What we don't have is an industry standard for how a NIC can launch some sort of GPU kernel to execute.

a-szegel commented 1 year ago

The MPI Spec defines Atomic Consistency as follows:

The outcome of concurrent accumulate operations to the same location with the same predefined datatype is as if the accumulates were done at that location in some serial order. Additional restrictions on the operation apply; see the info key accumulate_ops in Section 12.2.1. Concurrent accumulate operations with different origin and target pairs are not ordered. Thus, there is no guarantee that the entire call to an accumulate operation is executed atomically. The effect of this lack of atomicity is limited: The previous correctness conditions imply that a location updated by a call to an accumulate operation cannot be accessed by a load or an RMA call other than accumulate until the accumulate operation has completed (at the target). Different interleavings can lead to different results only to the extent that computer arithmetics are not truly associative or commutative. The outcome of accumulate operations with overlapping types of different sizes or target displacements is undefined.

It appears that MPI has a tighter consistency requirement for atomic operations than libfabric. Libfabric promises that atomicity is only valid per NIC on the target, and MPI promises that concurrent operations on a target will always execute in some serial order. Since Atomic operations are one-sided, MPI implementations can attempt to serialize on the sending side, because they have no control on the target side. A multi-NIC target adhering to the Libfabric API has no way of meeting the MPI consistency requirements (if I am reading this correctly).

The CPU atomics in libfabric are tighter than the Libfabric API requires, which means that our host atomics do currently work for MPI.

I think we may need to tighten the definition of Libfabric atomics so we can meet the MPI requirements (or have the MPI standard make their definition looser so they can support hardware atomics on multi-NIC instances).

shefty commented 1 year ago

libfabric defines atomicity per domain at the target. A domain does not necessarily map to a NIC. (We frequently describe a domain as a NIC, but that's only to help people understand what a domain is.) In multi-rail cases, a single domain can span multiple NICs. It is the responsibility of the provider to ensure that atomic operations meet the atomic requirements. How the provider accomplishes this is implementation specific. It could use a single NIC for all atomic operations, or always use CPU atomics at the peer.

The initiator does not have control over the target implementation, but the initiator and target must agree on the message protocol used for atomics, and both sides must agree to the atomicity definition.

a-szegel commented 1 year ago

So on multi-NIC systems, MPI needs to use a multi-rail provider (every NIC on the instance in one domain) for Libfabric to be able to meet their consistency requirements? How do we communicate to MPI which providers are safe to use for atomics that meet their tighter definition of consistency (serialized atomic operations across every rank on the instance)?

shefty commented 1 year ago

No, MPI does not need to use a multi-rail provider. A multi-rail provider is only one option. If MPI chooses to use multiple providers, there are no atomicity guarantees between providers.

I don't see where libfabric does not meet MPI requirements.

a-szegel commented 1 year ago

I see what you are saying... MPI can use multiple instances of one domain name of a provider to meet their atomicity requirements.

a-szegel commented 1 year ago

@lrbison just identified the missing piece of MPI docs:

The same area in memory may appear in multiple windows, each associated with a different window object. However, concurrent communications to distinct, overlapping windows may lead to undefined results.

It appears that if two ranks expose the same memory, concurrent communications to that memory is undefined. The endpoints progress engine will synchronize calls from 1 rank, and if multiple ranks expose the endpoint, we get undefined behavior. This means that libfabrics definition of atomicity is stricter than MPI's (which is very good).