Many functionalities in the MPI standard require that the location of the underlying physical processes associated with MPI processes is defined and that processes do not migrate between hardware resources.
The standard however neglects this fact in many places leaving the users and developers with unclear ways to use MPI interfaces or unmet requirements for optimizations leaving MPI implementations with their own solutions.
virtual Topologies:
Optimized mapping of MPI processes to physical processes may only be valid as long as physical processes do not migrate.
If a communicator is created - and reordering was allowed for optimization - communication paths may be taken into account by the implementations.
Process migration will prevent or may make optimizations useless, e.g., if an implementation takes care of NUMA properties.
Communicators:
The creation of communicators with MPI_Comm_split_type() relies on locality properties of the processes not changing during the call.
Further, the new communicator is not guaranteed to have these properties - so users cannot rely on without additional checks/code.
The standard mentions here:
Advice to users. Since the location of some of the MPI processes may change
during the application execution, the communicators created with the value
MPI_COMM_TYPE_SHARED before this change may not reflect an actual ability
to share memory between MPI processes after this change. (End of advice to
users.)
(MPI 4.0, p.339)
or
Advice to users. The set of hardware resources that an MPI process is able to
utilize may change during the application execution (e.g., because of the reloca-
tion of an MPI process), in which case the communicators created with the value
MPI_COMM_TYPE_HW_GUIDED before this change may not reflect the utiliza-
tion of hardware resources of such process at any time after the communicator
creation. (End of advice to users.)
(MPI 4.0, p.340)
or
The set of hardware
resources an MPI process utilizes may change during the application execution
(e.g., because of process relocation), in which case the communicators created
with the value MPI_COMM_TYPE_HW_UNGUIDED before this change may not
reflect the utilization of hardware resources for such process at any time after
the communicator creation. (End of advice to users.)
(MPI 4.0, p.341)
MPI shared memory:
MPI shared memory allows local and remote processes access via load/store accesses by providing a baseptr via, e.g., MPI_Win_allocate_shared or MPI_Win_shared_query.
Especially in the case of load/store access from remote processes this requires the target not to migrate and invalidate the baseptr as this is used outside MPI.
The standard puts here the burden onto the user - without help of MPI, cf. MPI_Comm_split_type + MPI_COMM_TYPE_SHARED Atu p.339:
It is the user’s responsibility to ensure that the communicator comm represents a group of processes that
can create a shared memory segment that can be accessed by all processes in the group.
(MPI 4.0, p.559)
MPI_Get_processor_name:
Is allowed to return different values with each call - but no negative effect, as this is just informative for the user.
MPI_Sessions:
TBD
Proposal
We should address this in some way. Possible ways:
Add new procedure to bind MPI processes to physical resources
int MPI_Process_bind(resource_specifier, info)
TODO: This seems to have been proposed before, add reference to the old proposal
Handle at communicator creation via info argument
Provide binding specification in info object to procedures that create communicators
info { 'resources_bound': 'true' }
PRO: MPI_Win_allocate_shared can check if provided communicator was created with the appropriate info/property
PRO: user has control
PRO: uses existing info interface
CON: binding is optional by the user and will require checks in relevant code paths even for intended communicators
CON: requires boilerplate code for users to create info object
CON: not all communicator creation functions provide an info argument as input
Handle at the communicator level as a hidden communicator property
communicators provide a hidden binding feature/property
-> "High quality implementations will ensure that the created communicator ...
by binding the MPI processes in the group of the communicator to appropriate physical processes/resources."
PRO: MPI_Win_allocate_shared can check if provided communicator was created with the appropriate info/property
CON: Not clear if MPI_Win_allocate_shared should fail or show the old behaviour
CON: Users cannot
Changes to the Text
TBD
processe with same naem in a pset?
Impact on Implementations
Will have already some kind of related hardware query functionality
Can allow optimizations in several functionalities if they ensure binding
If necessary add new interfaces
Impact on Users
Clarify and ensure intended usage of MPI functions
References and Pull Requests
TBD
Notes from discussions in the WG so far:
Solving the binding question at the global level seem not possible as it will lead to a too wide scope for sessions, etc.
Solving at the communicator level may be to narrow and lead to conflicts, e.g., MPI processes in two communicators where ranks can be translated between both - and both would have competing binding requirements.
MPI_Win_shared is broken in general - so should we care? Issues with it are also negelected in the Fault tolerance approaches.
The binding problem seems to be one part of a larger conceptual Problem of MPI following the CSP model facing requirements that ask for an agent based model.
Problem
Many functionalities in the MPI standard require that the location of the underlying physical processes associated with MPI processes is defined and that processes do not migrate between hardware resources. The standard however neglects this fact in many places leaving the users and developers with unclear ways to use MPI interfaces or unmet requirements for optimizations leaving MPI implementations with their own solutions.
virtual Topologies: Optimized mapping of MPI processes to physical processes may only be valid as long as physical processes do not migrate. If a communicator is created - and reordering was allowed for optimization - communication paths may be taken into account by the implementations. Process migration will prevent or may make optimizations useless, e.g., if an implementation takes care of NUMA properties.
Communicators: The creation of communicators with MPI_Comm_split_type() relies on locality properties of the processes not changing during the call. Further, the new communicator is not guaranteed to have these properties - so users cannot rely on without additional checks/code.
The standard mentions here:
or
or
MPI shared memory: MPI shared memory allows local and remote processes access via load/store accesses by providing a baseptr via, e.g., MPI_Win_allocate_shared or MPI_Win_shared_query. Especially in the case of load/store access from remote processes this requires the target not to migrate and invalidate the baseptr as this is used outside MPI.
The standard puts here the burden onto the user - without help of MPI, cf. MPI_Comm_split_type + MPI_COMM_TYPE_SHARED Atu p.339:
MPI_Get_processor_name: Is allowed to return different values with each call - but no negative effect, as this is just informative for the user.
MPI_Sessions: TBD
Proposal
We should address this in some way. Possible ways:
Add new procedure to bind MPI processes to physical resources
TODO: This seems to have been proposed before, add reference to the old proposal
Handle at communicator creation via info argument
PRO: MPI_Win_allocate_shared can check if provided communicator was created with the appropriate info/property PRO: user has control PRO: uses existing info interface
CON: binding is optional by the user and will require checks in relevant code paths even for intended communicators CON: requires boilerplate code for users to create info object CON: not all communicator creation functions provide an info argument as input
Handle at the communicator level as a hidden communicator property
PRO: MPI_Win_allocate_shared can check if provided communicator was created with the appropriate info/property
CON: Not clear if MPI_Win_allocate_shared should fail or show the old behaviour CON: Users cannot
Changes to the Text
TBD processe with same naem in a pset?
Impact on Implementations
Impact on Users
References and Pull Requests
TBD
Notes from discussions in the WG so far: