sourceryinstitute / OpenCoarrays

A parallel application binary interface for Fortran 2018 compilers.
http://www.opencoarrays.org
BSD 3-Clause "New" or "Revised" License
243 stars 58 forks source link

Defect: internal error when accessing array-coarray of derived type with allocatable component #739

Closed hsnyder closed 1 year ago

hsnyder commented 2 years ago

The following program produces incorrect results. The line marked ! NOTE should print 1, 2, 3, 4, but instead I get "OpenCoarrays internal error on image 2: libcaf_mpi::caf_sendget_by_ref(): can not allocate 0 bytes of memory."

Versions:

GCC 11.2.0
OpenCoarrays 2.9.2
MPICH 3.2
program bug
    type :: container
        integer, allocatable :: stuff(:)
    end type

    type(container) :: co_containers(10)[*]

    if (this_image() == 1) then
        allocate(co_containers(2)%stuff(4))
        co_containers(2)%stuff = [1,2,3,4]
    end if

    sync all

    if (this_image() == 2) then
        print *, co_containers(2)[1]%stuff  ! NOTE
    end if
end program
rouson commented 2 years ago

@vehre please let us know if you have the timed and interest to work on this issue.

This seems similar to code that I referenced via a link in a comment on issue #700. I'm posting that link again below because the initial comment on issue 700 included only a reduced version of what I was trying to do. I hope to find time to create a separate issue for the larger example, but for now, here it is:

Link to the somewhat larger demonstrator

rouson commented 1 year ago

@vehre this has entered our critical path for a paper draft due in December. Any chance you can work on a fix soon?

vehre commented 1 year ago

@rouson I can find some time to work on this, but I have to report: The example in the description works for me on:

Only the one in the link in Link to the somewhat larger demonstrator crashes on init, with an PMPI_Win_allocate: Invalid topology,... error. If you want me to investigate further, just let me know.

rouson commented 1 year ago

@vehre Yes, please investigate further. @everythingfunctional encountered this same issue yesterday. I'll let him confirm his setup. I have a broken installation at the moment after problems with a macOS upgrade so I'm not able to confirm this immediately.

everythingfunctional commented 1 year ago

I'm running with:

The following reproducer is what led me back here:

module payload_m
    implicit none
    private
    public :: payload_t, empty_payload

    type :: payload_t
        !! A raw buffer to facilitate data transfer between  images
        !!
        !! Facilitates view of the data as either a string or raw bytes.
        !! Typical usage will be either to
        !! * produce a string representation of the data, and then parse that string to recover the original data
        !! * use the `transfer` function to copy the raw bytes of the data
        private
        integer, allocatable, public :: payload_(:)
    contains
        private
        procedure, public :: raw_payload
        procedure, public :: string_payload
    end type

    interface payload_t
        pure module function from_raw(payload) result(new_payload)
            implicit none
            integer, intent(in) :: payload(:)
            type(payload_t) :: new_payload
        end function

        pure module function from_string(payload) result(new_payload)
            implicit none
            character(len=*), intent(in) :: payload
            type(payload_t) :: new_payload
        end function

        module procedure empty_payload
    end interface

    interface
        pure module function empty_payload()
            implicit none
            type(payload_t) :: empty_payload
        end function

        pure module function raw_payload(self)
            implicit none
            class(payload_t), intent(in) :: self
            integer, allocatable :: raw_payload(:)
        end function

        pure module function string_payload(self)
            implicit none
            class(payload_t), intent(in) :: self
            character(len=:), allocatable :: string_payload
        end function
    end interface

end module

submodule(payload_m) payload_s
    implicit none
contains
    module procedure from_raw
        new_payload%payload_ = payload
    end procedure

    module procedure from_string
        new_payload = payload_t([len(payload), transfer(payload,[integer::])])
    end procedure

    module procedure empty_payload
        empty_payload%payload_  = [integer::]
    end procedure

    module procedure raw_payload
        if (allocated(self%payload_)) then
            raw_payload = self%payload_
        else
            raw_payload = [integer::]
        end if
    end procedure

    module procedure string_payload
        if (allocated(self%payload_)) then
            if (size(self%payload_) > 0) then
                allocate(character(len=self%payload_(1)) :: string_payload)
                if (len(string_payload) > 0) &
                    string_payload = transfer(self%payload_(2:),string_payload)
            else
                allocate(character(len=0) :: string_payload)
            end if
        else
            allocate(character(len=0) :: string_payload)
        end if
    end procedure

end submodule

program example
    use payload_m, only: payload_t

    character(len=*), parameter :: MESSAGE = "Hello, World!"
    type(payload_t) :: mailbox[*]

    if (this_image() == 1) then
        mailbox = payload_t(MESSAGE)
    end if
    sync all
    if (this_image() /= 1) then
        mailbox = mailbox[1]
    end if

    print *, mailbox%string_payload(), " from image: ", this_image()
end program

which crashes as follows:

$ caf -g -fbacktrace main.f90 -o main
$ cafrun -n 2 ./main
 Hello, World! from image:            1

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f98c1e519ff in ???
#1  0x557dbdb21c12 in __payload_m_MOD_string_payload
    at /home/brad/examples/coarray-allocatable-components/main.f90:84
#2  0x557dbdb233b7 in example
    at /home/brad/examples/coarray-allocatable-components/main.f90:111
#3  0x557dbdb234ad in main
    at /home/brad/examples/coarray-allocatable-components/main.f90:98
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node stray exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Error: Command:
   `/usr/bin/mpiexec -n 2 ./main`
failed to run.
vehre commented 1 year ago

@rouson I have analysed the issue further: The code generated for managing the allocatable component of the type in the module does not take into account that the type could be used in a coarray. I.e. there is no space assigned (at compile time) to keep track of the allocation status and the coarray (slave) token. Furthermore is because of this no code generated to define "a mpi window" for the allocated memory, which prevents the one image from the accessing this memory on the other image.

Or with other words: Solving this issue is nothing quick to be done, but some bigger effort. One needs to find a way to portably generate the code of the module to either always call coarray-registration routines for every memory allocated in a module when the compile flag -fcoarray=lib is given or to create two instances of the code to execute for each module, one with coarray support and one without.

How to proceed?

rouson commented 1 year ago

@vehre thanks for the quick reply. If this had been a quick fix, Sourcery Institute could have funded it. Because its a larger effort, I'll need to seek alternative funding. Please email me a rough estimate at your earliest convenience.

rouson commented 1 year ago

The reproducer submitted in the original comment for this issue has been fixed on the main branch and will appear in the 2.10.1 release, which currently has draft release notes. @everythingfunctional please create a new issue with the from comment your issue comment above and link to @vehre's comment.