Add new capability/modifier/mode which allows users to tell providers that they do/don't want the peer provider being used

a-szegel commented 10 months ago

Inject Size is set at fi_info time. The EFA Provider has 8k inject size, and the SHM provider has a 4k inject size (plan on making this configurable in the future). I want a way to know at fi_info time if the user wants the peer provider to be used, and I don't want to use an environmental variable https://github.com/ofiwg/libfabric/issues/9450.

Proposal I think a Primary Capability Modifier best fits what we need. I propose adding a new primary modifier such as FI_NO_PEER_PROVIDER.

Capabilities are defined by the libfabric API here:

Capabilities may be grouped into three general categories: primary, secondary, and primary modifiers. Primary capabilities must explicitly be requested by an application, and a provider must enable support for only those primary capabilities which were selected. Primary modifiers are used to limit a primary capability, such as restricting an endpoint to being send-only. If no modifiers are specified for an applicable capability, all relevant modifiers are assumed. See above definitions for details. Secondary capabilities may optionally be requested by an application. If requested, a provider must support the capability or fail the fi_getinfo request (FI_ENODATA).

Primary capabilities: FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_MULTICAST, FI_NAMED_RX_CTX, FI_DIRECTED_RECV, FI_VARIABLE_MSG, FI_HMEM, FI_COLLECTIVE, FI_XPU, FI_AV_USER_ID

Primary modifiers: FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE

Secondary capabilities: FI_MULTI_RECV, FI_SOURCE, FI_RMA_EVENT, FI_SHARED_AV, FI_TRIGGER, FI_FENCE, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SOURCE_ERR, FI_RMA_PMEM.

Describe alternatives you've considered

Using the endpoint level setopt
- CON: EP setopt() gets called after ep init, and SHM EP is already created, and must be destroyed
- CON: too late to properly set inject size
Using domain level fi_open_ops()
- CON: fi_open_ops() gets called after domain init, and some shared SHM resources are created on domain init, and must be destroyed
- CON: too late to properly set inject size
Returning multiple fi_info structures based on whether FI_LOCAL_COMM is set
- CON: EFA Provider supports local comm without SHM (its just slow), so the pivot on FI_LOCAL_COMM doesn't work well
Using ENV VAR
- CON: https://github.com/ofiwg/libfabric/issues/9450
Hard coding based on environment factors
- CON: Does not support future unknown use cases well

aingerson commented 10 months ago

@a-szegel Is the inject size mismatch the primary reason behind wanting to toggle the offload on and off? Whether a provider uses a peer provider internally is a decision within that provider and doesn't make sense to me to expose it as a capability.

j-xiong commented 10 months ago

Agree. If a provider decides to expose peer provider usage as an option to the user, the best way is to use different provider names.

a-szegel commented 10 months ago

Inject size mismatch is the reason I want to change it at fi_info time. Otherwise, it would be ok setting it at fi_domain creation time... but I don't want to use an env variable, so I need some way of programmatically passing what I want into the provider (back to fi_info time).

a-szegel commented 10 months ago

I understand the inject_size mismatch can be solved by making peer's inject size configurable too.

aingerson commented 10 months ago

Until it is configurable, would it be possible for efa to do a dummy call into fi_getinfo(shm) during efa getinfo to query its inject size in order to properly select it if the inject size works? Otherwise, I think it's ok for efa to hard code that internally

shijin-aws commented 10 months ago

Migrate some offline conversation here

Sean Hefty The use of peer providers should be hidden from the app. But if provider control is desired, the way to indicate that would be through the prov_name attribute, which would allow greater flexibility in how providers are selected.

Shi Jin But if application doesn't know the existence of peer providers, how could they know what prov_name to add beyond the owner provider they are using? (edited)

Sean Hefty I said "should be hidden". Setting FI_NO_PEER_PROVIDER indicates that the app knows of the peer provider architecture, but also has insight into which providers pair with which others. The app then somehow decides to disable peers for some reason. I don't know how that gets done without using some environment variable. If the app somehow already has implicit knowledge of how providers are constructed and whether the peer APIs are being used for that, versus a provider like EFA simply building the shared memory support in directly (such that it's not a peer), then prov_name would allow for more explicit provider composition. For example, use exactly these 3 providers as peers, or layer provider X over Y and use Z as a peer with X.

shijin-aws commented 10 months ago

@j-xiong

Agree. If a provider decides to expose peer provider usage as an option to the user, the best way is to use different provider names.

That's the tricky part, we never expose peer provider usage to application and application only use efa as provider name. But some application assumed that by making such configuration, all the transmission (including local comm) are through the NIC, which is not valid if efa uses shm as a peer provider implicitly.

I am even not sure if such assumption is legal, but I am seeking ways that we can help them in this edge case without setting env

aingerson commented 10 months ago

@shijin-aws I think the suggestion in regards to using provider names is to have behavior similar to what we do with FI_HMEM or rxm where you have multiple fi_info entries for efa with and without shm where the provider names are different ie efa and efa;shm. The way for an efa application to explicitly avoid shm would be to select the efa only fi_info but you could order them efa;shm, efa so that by default you would select the shm offload (if you default for shm is on). Would that target the case you're consider with?

shijin-aws commented 10 months ago

@aingerson The suggestion is feasible, but rxm is not a core provider, so the case is not the same: both efa and shm are core providers, shm can be an offload for intra-node comm in some situation... I think the question is actually: if an application only has efa in prov_name, is it required that only efa provider is being used? I think the answer is yes and currently efa provider is not obeying such requirement.

aingerson commented 10 months ago

@shijin-aws I think if the application just set efa you could still return both options. The proposal in this issue is to add an option for the user to select efa with or without shm. Returning two different fi_infos for with or without shm is still allowing them to select efa with or without efa. The difference is that instead of adding a capability passed into the fi_getinfo call (as suggested), you would return both options and allow the application to select which one it needs. The old case would still work as before (user requests efa and expects efa+shm which is the first fi_info). The new case to disable shm would require application modification either way (it would need to pass in the capability bit the other suggested method) and would skip the efa;shm info and use the efa only info. The other option, though feels hacky in my opinion, would be to have two fi_infos: efa (which uses shm) and efa^shm (with shm disabled) and then the application could request prov_name="efa^shm" specifically which would not return the regular efa+shm. Whichever solution you choose you're going to need something on the application level to trigger the selection (whether you set hints->caps, set the provider name to "efa^shm", or filter through the list to skip the efa+shm option). But I think it needs to be something efa-specific since the internal shm usage is an efa-specific option.

a-szegel commented 9 months ago

We plan on adding a new enum FI_OPT_SHARED_MEMORY_PERMITTED to the endpoint setopt enum FI_OPT_FI_HMEM_P2P. We also plan on moving all of SHM initialization inside the efa provider to EP creation.

j-xiong commented 8 months ago

Addressed by #9750.

ofiwg / libfabric

Add new capability/modifier/mode which allows users to tell providers that they do/don't want the peer provider being used #9630