litebird / litebird_sim

Simulation tools for LiteBIRD
GNU General Public License v3.0
18 stars 13 forks source link

MPI communicator: OverflowError for pickled object larger than 2GB #257

Closed nraffuzz closed 7 months ago

nraffuzz commented 1 year ago

Using the internal mpi communicator of litebird_sim, I noticed that when parallel operations are performed, such as a broadcasting objects larger than 2GB, an OverflowError is generated. Something similar to this: OverflowError: integer 2176051861 does not fit in 'int'

This is due the total memory of the pickled object being larger than 2GB, where the message count overflows a 32bit int. This is a problem present in MPI-1/2/3. For more details, look here.

I solved this issue by replacing the litebird_sim communicator with the auxiliary module mpi4py.util.pkl5.

Shortly, it is sufficient to import:

from mpi4py import MPI
from mpi4py.util import pkl5

and replace in your scripts comm = lbs.MPI_COMM_WORLD with comm = pkl5.Intracomm(MPI.COMM_WORLD)

Is this something that should be included in litebird_sim in the future to avoid encountering this problem again?

ziotom78 commented 1 year ago

Hi @nraffuzz , thanks for having spotted this! I gathered more information, it seems that pkl5 is able to lift this limitation at the expense of more CPU usage and increased memory resources.

Since sending more than 2GB of data between MPI nodes is not something that happens often, I would rather stick with the current implementation and add a note in the manual, possibly under the page “Using MPI”.