helmholtz-analytics / heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
https://heat.readthedocs.io/
MIT License
196 stars 53 forks source link

Datatype tiling for large communication #456

Open Markus-Goetz opened 4 years ago

Markus-Goetz commented 4 years ago

Description Heat allows to use various wrapped MPI calls to transmit data between processor (e.g. replit()). If the buffer of such a transmission is too large, i.e. exceeding the int32 value range, MPI will quit with an error.

Could perhaps be fixed using the tiling implementation of the QR branch.

To Reproduce Steps to reproduce the behavior:

  1. Which module/class/function is affected? communications.py
  2. What are the circumstances under which the bug appears? Several, e.g.:
    a = ht.zeros(((INT32_MAX + 1) * processors, processors), split=0).resplit(1)
  3. What is the exact error-message/errorous behavious? Depends on MPI implementation

Expected behavior No MPI error

Version Info any

coquelin77 commented 4 years ago

small update on this one. the #520 PR has a new tiling class. theoretically, this could be modified to cope with this by only sending partial tiles. although it may require a fair bit of changes.

mrfh92 commented 11 months ago

In principle relevant, although not of highest priority because this problem can be solved by increasing the number of processes usually. (Reviewed within #1109 )

My question @Markus-Goetz: should this issue address the wrappers for the MPI-operations (i.e. heat.comm.Send() performs several mpi4py.MPI.comm.Send() if the data to send is too large) or shall we rather adapt the usage of heat.comm.Send() in those algorithms where potentially large data are sent? -- The first idea sounds more elegant, however, w.r.t. #383 the second option may allow better refactoring of algorithms including overlap of communication and computation.

ClaudiaComito commented 3 weeks ago

This is fixed by #1493 , isn't it? @JuanPedroGHM @Markus-Goetz @mrfh92

Markus-Goetz commented 3 weeks ago

It is not entirely. #1493 only solves non-contiguous strided access, but not recursive wrapping if the number of elements is more than INT32_MAX