Open shefty opened 1 year ago
First step is to report whether the collective implementation is of interest to the app (offload vs sw). Future might be to report tunable values.
Are flags sufficient to report algorithm / protocol?
Need to consider link provider, which may combine multiple providers as 'one', but only one may have collective acceleration.
This is indirectly related to PR #8264.
There is currently no mechanism for an app to distinguish between how a provider may implement collectives. For example, are the collective calls implemented in software or offloaded to a switch? Additionally, there's no knowledge of what algorithm a collective implementation may use. Software could have a dozen options available. And although current hardware may only support one algorithm today, that may not always be the case.
The request is to expose more details on the collective algorithms or protocols that a provider may support. Paired with that would be the ability of an application to control which algorithm/protocol a specific collective call should use.