Closed raffenet closed 2 years ago
Basically, the request is to provide an MPIX
function that queries another process's status. This is useful for fault-tolerance in general.
Proposed routine:
int MPIX_Comm_check_status(MPI_Comm comm, int rank, int *is_alive);
status
can be confused with MPI_Status
. My preferred name is MPIX_Comm_check_alive
-- more intuitive.
MPIX_Comm_failure_ack
(proposed in ULFM) can be renamed to MPIX_Comm_check_alive_all
😃
Have you looked at https://www.sciencedirect.com/science/article/pii/S0167819118303867 ?
Have you looked at https://www.sciencedirect.com/science/article/pii/S0167819118303867 ?
I have not. Thanks for the pointer!
All,
And it’s supporting FAILED images in Intel Fortran which we are talking about now.
Note that GNU Fortran uses Open Coarrays. Intel Fortran uses Intel MPI and would like to be able to use MPICH.
-John
From: Hui Zhou @.> Sent: Friday, April 1, 2022 10:23 AM To: pmodels/mpich @.> Cc: Bishop, John E @.>; Manual @.> Subject: Re: [pmodels/mpich] supporting Fortran coarrays (Issue #5788)
Have you looked at https://www.sciencedirect.com/science/article/pii/S0167819118303867 ?
I have not. Thanks for the pointer!
— Reply to this email directly, view it on GitHubhttps://github.com/pmodels/mpich/issues/5788#issuecomment-1085966440, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUK264OMNHLRUTRQMIGI73TVC4BEFANCNFSM5MQOQWOA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hydra supports a "PMI_dead_processes"
key in both PMI1 and PMI2 servers. It returns a comma separated list of world rank values that the processes manager has detected to have failed. We used it to support the MPICH ULFM implementation in the past.
The proposed MPIX_Comm_get_failed(comm, &fgrp)
(https://fault-tolerance.org/2019/08/26/simplifying-the-ack-get_acked-couple/) seems to be most straight forward interface for this purpose (discovery only, mitigation later). I am trying to get a gauge on how promising is this to be final and make into the standard at some point.
Hydra supports a
"PMI_dead_processes"
key in both PMI1 and PMI2 servers. It returns a comma separated list of world rank values that the processes manager has detected to have failed. We used it to support the MPICH ULFM implementation in the past.
Actually we still have MPIR_pmi_get_failed_procs
in the PMI utils to get this information. Just need to glue it up with whatever API we choose to expose.
What does this mean for a user of MPICH? Is MPIX_Comm_get_failed implemented? Will it be part of the MPI standard or an MPICH extension?
Yes, MPIX_Comm_get_failed
is implemented in MPICH, available in mpich-4.1a1. Until the ULFM proposal be voted in, it will remain an MPICH extension.
This issue comes from an email discussion with some Fortran compiler devs. There exists today support in Intel MPI to enable Fortran coarrays. The Fortran runtime does some additional handshake with MPI to support the STOPPED IMAGES and FAILED IMAGES features. Here are some of the handshake/query details.
The actual information here is probably provided by Hydra via
PMI_Get
, if I had to guess. We can use this issue as a way to collect more information and requirements, and link PR prototypes.