Closed nraffuzz closed 7 months ago
Hi @nraffuzz , thanks for having spotted this! I gathered more information, it seems that pkl5
is able to lift this limitation at the expense of more CPU usage and increased memory resources.
Since sending more than 2GB of data between MPI nodes is not something that happens often, I would rather stick with the current implementation and add a note in the manual, possibly under the page “Using MPI”.
Using the internal mpi communicator of litebird_sim, I noticed that when parallel operations are performed, such as a broadcasting objects larger than 2GB, an OverflowError is generated. Something similar to this:
OverflowError: integer 2176051861 does not fit in 'int'
This is due the total memory of the pickled object being larger than 2GB, where the message count overflows a 32bit
int
. This is a problem present in MPI-1/2/3. For more details, look here.I solved this issue by replacing the litebird_sim communicator with the auxiliary module mpi4py.util.pkl5.
Shortly, it is sufficient to import:
and replace in your scripts
comm = lbs.MPI_COMM_WORLD
withcomm = pkl5.Intracomm(MPI.COMM_WORLD)
Is this something that should be included in litebird_sim in the future to avoid encountering this problem again?