pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
560 stars 279 forks source link

supporting Fortran coarrays #5788

Closed raffenet closed 2 years ago

raffenet commented 2 years ago

This issue comes from an email discussion with some Fortran compiler devs. There exists today support in Intel MPI to enable Fortran coarrays. The Fortran runtime does some additional handshake with MPI to support the STOPPED IMAGES and FAILED IMAGES features. Here are some of the handshake/query details.

We got them to implement a status-query routine like this:

            typedef enum {
                no_information = 0,
                 exit_code_only
                 /* --any-extensions-- */
            } img_report_type_t;

            typedef struct {
                 img_report_type_t report_type;
                 int image_exit_code;             /* If image terminated, 0 else. */
                 char pad[32]; /*For future extensions*/
            } img_report_t;

            typedef enum {
                MPI_IMAGE_STATUS_UNKNOWN,      /* 0; 3-rd party process manager or the image is being started */
                MPI_IMAGE_STATUS_RUNNING,      /* 1 */
                MPI_IMAGE_STATUS_TERMINATED,   /* 2 */
                MPI_IMAGE_STATUS_NO_SUCH_IMAGE /* 3; Never existed */
            } MPI_img_status_t;

            typedef MPI_img_status_t I_MPI_Check_image_status_type (
                           /*IN*/  unsigned int rank,
                           /*IN*/  MPI_Comm comm,
                           /*OUT*/ img_report_t * img_report);

            static I_MPI_Check_image_status_type *MPI_status_routine_ptr = NULL;

            We then did the Linux or Windows “weak” definition, e.g.:

            extern I_MPI_Check_image_status_type I_MPI_Check_image_status;

The actual information here is probably provided by Hydra via PMI_Get, if I had to guess. We can use this issue as a way to collect more information and requirements, and link PR prototypes.

hzhou commented 2 years ago

Basically, the request is to provide an MPIX function that queries another process's status. This is useful for fault-tolerance in general.

hzhou commented 2 years ago

Proposed routine:

int MPIX_Comm_check_status(MPI_Comm comm, int rank, int *is_alive);
hzhou commented 2 years ago

Ref: https://github.com/mpiwg-ft/ft-issues/issues/15

hzhou commented 2 years ago

status can be confused with MPI_Status. My preferred name is MPIX_Comm_check_alive -- more intuitive.

hzhou commented 2 years ago

MPIX_Comm_failure_ack (proposed in ULFM) can be renamed to MPIX_Comm_check_alive_all 😃

abouteiller commented 2 years ago

Have you looked at https://www.sciencedirect.com/science/article/pii/S0167819118303867 ?

hzhou commented 2 years ago

Have you looked at https://www.sciencedirect.com/science/article/pii/S0167819118303867 ?

I have not. Thanks for the pointer!

FRTL1 commented 2 years ago

All,

And it’s supporting FAILED images in Intel Fortran which we are talking about now.

Note that GNU Fortran uses Open Coarrays. Intel Fortran uses Intel MPI and would like to be able to use MPICH.

           -John

From: Hui Zhou @.> Sent: Friday, April 1, 2022 10:23 AM To: pmodels/mpich @.> Cc: Bishop, John E @.>; Manual @.> Subject: Re: [pmodels/mpich] supporting Fortran coarrays (Issue #5788)

Have you looked at https://www.sciencedirect.com/science/article/pii/S0167819118303867 ?

I have not. Thanks for the pointer!

— Reply to this email directly, view it on GitHubhttps://github.com/pmodels/mpich/issues/5788#issuecomment-1085966440, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUK264OMNHLRUTRQMIGI73TVC4BEFANCNFSM5MQOQWOA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

raffenet commented 2 years ago

Hydra supports a "PMI_dead_processes" key in both PMI1 and PMI2 servers. It returns a comma separated list of world rank values that the processes manager has detected to have failed. We used it to support the MPICH ULFM implementation in the past.

hzhou commented 2 years ago

The proposed MPIX_Comm_get_failed(comm, &fgrp) (https://fault-tolerance.org/2019/08/26/simplifying-the-ack-get_acked-couple/) seems to be most straight forward interface for this purpose (discovery only, mitigation later). I am trying to get a gauge on how promising is this to be final and make into the standard at some point.

raffenet commented 2 years ago

Hydra supports a "PMI_dead_processes" key in both PMI1 and PMI2 servers. It returns a comma separated list of world rank values that the processes manager has detected to have failed. We used it to support the MPICH ULFM implementation in the past.

Actually we still have MPIR_pmi_get_failed_procs in the PMI utils to get this information. Just need to glue it up with whatever API we choose to expose.

FRTL1 commented 2 years ago

What does this mean for a user of MPICH? Is MPIX_Comm_get_failed implemented? Will it be part of the MPI standard or an MPICH extension?

hzhou commented 2 years ago

Yes, MPIX_Comm_get_failed is implemented in MPICH, available in mpich-4.1a1. Until the ULFM proposal be voted in, it will remain an MPICH extension.