starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!
https://starpu.gitlabpages.inria.fr/
GNU Lesser General Public License v2.1
58 stars 13 forks source link

support for MPI data count larger than INT_MAX #43

Closed JieRen98 closed 3 months ago

JieRen98 commented 4 months ago

Is your feature request related to a problem? Please describe. Since MPI uses int32 as the data count, see as follows

int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest,
             int tag, MPI_Comm comm)

When we want to send a larger buffer (count > INT_MAX), we need to split the buffer into several chunks and send them one by one. However, StarPU does not support it so far. (e.g., https://gitlab.inria.fr/starpu/starpu/-/blob/master/src/drivers/mpi/driver_mpi_common.c?ref_type=heads#L300)

Describe the solution you'd like Split the buffer into several chunks:

void send_large_byte_buffer(const void* data, size_t total_size, int dest, int tag, MPI_Comm comm) {
    const size_t max_int = INT_MAX;
    size_t chunks = (total_size + max_int - 1) / max_int;

    for (size_t i = 0; i < chunks; ++i) {
        const size_t offset = i * max_int;
        const int count = (total_size - offset) > max_int ? max_int : (int)(total_size - offset);

        MPI_Send((const char*)data + offset, count, MPI_BYTE, dest, tag, comm);
    }
}

Functions' signatures (e.g., __starpu_mpi_common_send_to_device, __starpu_mpi_common_send, etc) should be changed correspondingly.

Describe alternatives you've considered N/A

Additional context N/A

sthibaul commented 4 months ago

StarPU does not support it so far. (e.g., https://gitlab.inria.fr/starpu/starpu/-/blob/master/src/drivers/mpi/driver_mpi_common.c?ref_type=heads#L300)

I guess you are using starpu_mpi_task_insert etc., not the MPI master-slave driver support, so it's rather the MPI datatype definition from mpi/src/starpu_mpi_datatype.c that you need fixed.

I understand that you have a urgent deadline. Which StarPU data type are you using in your application?

JieRen98 commented 4 months ago

Greetings,

I am using Chameleon actually. I am not completely sure how Chaneleon interact with StarPU. I am going to use multiple precisions (FP64, FP32, FP16, and FP8).

Best, Jie

Samuel Thibault @.***>于2024年4月8日 周一14:25写道:

StarPU does not support it so far. (e.g., https://gitlab.inria.fr/starpu/starpu/-/blob/master/src/drivers/mpi/driver_mpi_common.c?ref_type=heads#L300 )

I guess you are using starpu_mpi_task_insert etc., not the MPI master-slave driver support, so it's rather the MPI datatype definition from mpi/src/starpu_mpi_datatype.c that you need fixed.

I understand that you have a urgent deadline. Which StarPU data type are you using in your application?

— Reply to this email directly, view it on GitHub https://github.com/starpu-runtime/starpu/issues/43#issuecomment-2042500036, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALYD4VWAVU3HZNF6C77IKGTY4J5DRAVCNFSM6AAAAABF4M666CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGUYDAMBTGY . You are receiving this because you authored the thread.Message ID: @.***>

sthibaul commented 4 months ago

I am using Chameleon actually. I am not completely sure how Chaneleon interact with StarPU

Ok, then I guess you are using a matrix descriptor from Chameleon?

JieRen98 commented 4 months ago

Yes, specifically, my customized descriptor.

Samuel Thibault @.***>于2024年4月8日 周一14:31写道:

I am using Chameleon actually. I am not completely sure how Chaneleon interact with StarPU

Ok, then I guess you are using a matrix descriptor from Chameleon?

— Reply to this email directly, view it on GitHub https://github.com/starpu-runtime/starpu/issues/43#issuecomment-2042510256, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALYD4VWI6VRMSUOL6DU6NSLY4J5ZBAVCNFSM6AAAAABF4M666CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGUYTAMRVGY . You are receiving this because you authored the thread.Message ID: @.***>

sthibaul commented 4 months ago

How is it customized? Essentially, the question is which starpu_data_something_register function is getting called in your case.

sthibaul commented 4 months ago

put another way, do you have any data_register call beyond starpu_vector_data_register and starpu_mpi_data_register?

Is it using starpu_data_register directly?

sthibaul commented 4 months ago

Does it use starpu_mpi_interface_datatype_register?

JieRen98 commented 4 months ago

Sorry I am having my lunch, I will answer your question about 20mins later.

Samuel Thibault @.***>于2024年4月8日 周一14:46写道:

Does it use starpu_mpi_interface_datatype_register?

— Reply to this email directly, view it on GitHub https://github.com/starpu-runtime/starpu/issues/43#issuecomment-2042537377, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALYD4VSCUKNRWHJUHAJVWYDY4J7RXAVCNFSM6AAAAABF4M666CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGUZTOMZXG4 . You are receiving this because you authored the thread.Message ID: @.***>

sthibaul commented 4 months ago

Is MPI_Type_vector_c supported by your mpi implementation?

sthibaul commented 4 months ago

(basically, we would just want to use the _c variants of the MPI calls that we currently make: MPI_Irecv, MPI_Isend, MPI_Issend, MPI_Type_vector, MPI_Type_contiguous, MPI_Type_size)

JieRen98 commented 4 months ago

Chameleon uses both starpu_data_register and starpu_mpi_data_register to register tiles. I believe StarPU does not know what the type is but only the size count in bytes.

sthibaul commented 4 months ago

Then it must be also using starpu_mpi_interface_datatype_register to register the mpi type to be used? Otherwise StarPU does not even know the size count in bytes.

sthibaul commented 4 months ago

(starpu_data_register alone does not tell starpu the size count in bytes)

JieRen98 commented 4 months ago

Yes, you are right, here is the type registration although I do not understand it completely...:

void
starpu_cham_tile_interface_init()
{
    if ( starpu_interface_cham_tile_ops.interfaceid == STARPU_UNKNOWN_INTERFACE_ID )
    {
        starpu_interface_cham_tile_ops.interfaceid = starpu_data_interface_get_next_id();
#if defined(CHAMELEON_USE_MPI_DATATYPES)
  #if defined(HAVE_STARPU_MPI_INTERFACE_DATATYPE_NODE_REGISTER)
        starpu_mpi_interface_datatype_node_register( starpu_interface_cham_tile_ops.interfaceid,
                                                    cti_allocate_datatype_node,
                                                    cti_free_datatype );
  #else
        starpu_mpi_interface_datatype_register( starpu_interface_cham_tile_ops.interfaceid,
                                                cti_allocate_datatype,
                                                cti_free_datatype );
  #endif
#endif
    }
}

This shows how Chameleon registers the tile. I thought attributes .allocsize and .tilesize would tell StarPU the size.

void
starpu_cham_tile_register( starpu_data_handle_t *handleptr,
                           int                   home_node,
                           CHAM_tile_t          *tile,
                           cham_flttype_t        flttype )
{
    size_t elemsize = CHAMELEON_Element_Size( flttype );
    starpu_cham_tile_interface_t cham_tile_interface =
        {
            .id         = STARPU_CHAM_TILE_INTERFACE_ID,
            .flttype    = flttype,
            .dev_handle = (intptr_t)(tile->mat),
            .allocsize  = -1,
            .tilesize   = tile->m * tile->n * elemsize,
        };
    memcpy( &(cham_tile_interface.tile), tile, sizeof( CHAM_tile_t ) );
    /* Overwrite the flttype in case it comes from a data conversion */
    cham_tile_interface.tile.flttype = flttype;

    if ( tile->format & CHAMELEON_TILE_FULLRANK ) {
        cham_tile_interface.allocsize = tile->m * tile->n * elemsize;
    }
    else if ( tile->format & CHAMELEON_TILE_DESC ) { /* Needed in case starpu ask for it */
        cham_tile_interface.allocsize = tile->m * tile->n * elemsize;
    }
    else if ( tile->format & CHAMELEON_TILE_HMAT ) {
        /* For hmat, allocated data will be handled by hmat library. StarPU cannot allocate it for the library */
        cham_tile_interface.allocsize = 0;
    }

    starpu_data_register( handleptr, home_node, &cham_tile_interface, &starpu_interface_cham_tile_ops );
}
sthibaul commented 4 months ago

Please also show cti_allocate_datatype_node and cti_allocate_datatype, that's very most probably where you need a fix

JieRen98 commented 4 months ago

Here you go:

#if defined(CHAMELEON_USE_MPI_DATATYPES)
int
cti_allocate_datatype_node( starpu_data_handle_t handle,
                            unsigned             node,
                            MPI_Datatype        *datatype )
{
    int ret;

    starpu_cham_tile_interface_t *cham_tile_interface = (starpu_cham_tile_interface_t *)
        starpu_data_get_interface_on_node( handle, node );

    size_t m  = cham_tile_interface->tile.m;
    size_t n  = cham_tile_interface->tile.n;
    size_t ld = cham_tile_interface->tile.ld;
    size_t elemsize = CHAMELEON_Element_Size( cham_tile_interface->flttype );

    ret = MPI_Type_vector( n, m * elemsize, ld * elemsize, MPI_BYTE, datatype );
    STARPU_ASSERT_MSG(ret == MPI_SUCCESS, "MPI_Type_vector failed");

    ret = MPI_Type_commit( datatype );
    STARPU_ASSERT_MSG(ret == MPI_SUCCESS, "MPI_Type_commit failed");

    return 0;
}

int
cti_allocate_datatype( starpu_data_handle_t handle,
                       MPI_Datatype        *datatype )
{
    return cti_allocate_datatype_node( handle, STARPU_MAIN_RAM, datatype );
}

void
cti_free_datatype( MPI_Datatype *datatype )
{
    MPI_Type_free( datatype );
}
#endif
sthibaul commented 4 months ago

Also, again,

Is MPI_Type_vector_c supported by your mpi implementation?

JieRen98 commented 4 months ago

MPI_Type_vector_c

I didn't find any line includes MPI_Type_vector_c

sthibaul commented 4 months ago

Here you go:

    ret = MPI_Type_vector( n, m * elemsize, ld * elemsize, MPI_BYTE, datatype );

That's it: you want to use MPI_Type_vector_c instead. StarPU just MPI_Sends one of this type.

sthibaul commented 4 months ago

MPI_Type_vector_c

I didn't find any line includes MPI_Type_vector_c

Where did you not find it?

Put another way: which MPI implementation are you using?

sthibaul commented 4 months ago

I see that notably openmpi doesn't seem to be providing the _c variants, so in that case you need to use a for loop to make the series of MPI_Type_vector calls

JieRen98 commented 4 months ago

Here you go:

    ret = MPI_Type_vector( n, m * elemsize, ld * elemsize, MPI_BYTE, datatype );

That's it: you want to use MPI_Type_vector_c instead. StarPU just MPI_Sends one of this type.

Ok, Chameleon uses byte and the leading dim becomes ld * sizeof(t), that's fine I guess. So do you mean I should change this to MPI_Type_vector_c to support large integers?

sthibaul commented 4 months ago

If your MPI implementation supports MPI_Type_vector_c, that's the simplest, yes. If not, you need to use a for loop to describe the data type piece by piece.

JieRen98 commented 4 months ago

MPI_Type_vector_c

I didn't find any line includes MPI_Type_vector_c

Where did you not find it?

Put another way: which MPI implementation are you using?

I am using mpich, I mean Chameleon does not use MPI_Type_vector_c, not my MPI does not have this.

sthibaul commented 4 months ago

So in the end it's the chameleon code that needs fixing. StarPU will however want to do the same for its predefined vector/matrix/etc. types, so keeping this issue open for that.

sthibaul commented 4 months ago

I am using mpich

mpich does have MPI_Type_vector_c, so you can simply add a _c to the MPI_Type_vector call, and that should work.

sthibaul commented 4 months ago

(mpich does so apparently since its version 4)

JieRen98 commented 4 months ago

Thanks a lot! I was reading StarPU, but it seems that I did not do well. So starpu_mpi_interface_datatype_node_register is the most important thing. While the starpu_data_register and starpu_mpi_data_register calls have no effect on the integer overflow, if allocate_datatype_func is correctly defined (using MPI_Type_vector_c) in calling starpu_mpi_interface_datatype_node_register?

sthibaul commented 4 months ago

yes, that's the idea. For application-defined interfaces it's starpu_mpi_interface_datatype_node_register that tells starpu how to send the data over MPI

JieRen98 commented 4 months ago

Thanks a lot, you helped a lot!