litebird / litebird_sim

Simulation tools for LiteBIRD
GNU General Public License v3.0
18 stars 13 forks source link

Bug running `sim.create_observations` in parallel inside a for loop #316

Closed AleNovelli closed 1 month ago

AleNovelli commented 3 months ago

Hi everybody, there seems to be a bug if one tries to run sim.create_observations inside a for loop and execute the code in parallel. Here is a minimum working example code that will reproduce the bug: (I run it in parallel with mpirun -n 2 reproduce_bug.py, if run serially the code works fine)

import litebird_sim as lbs
import astropy

start_time = astropy.time.Time("2027-01-01T00:00:00")

imo = lbs.Imo() 
imo_version = "v1.3"

#setup di mpi
comm = lbs.MPI_COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

#how long you want each simulation to last
mission_time_days=1 #2

sim = lbs.Simulation(
    mpi_comm=comm,
    start_time=start_time,
    duration_s=int(mission_time_days * 24 * 3600.0),
)

detector=lbs.DetectorInfo.from_imo(
    url=f"/releases/v1.3/satellite/LFT/L1-040/000_000_003_QA_040_T/detector_info",
    imo=imo,
)

#detector.sampling_rate_hz=1 #<- NO effect

days_simulated=0
for i in range(100):
    sim.create_observations(
        detectors= [detector],
        num_of_obs_per_detector= int(mission_time_days*24*2),
        split_list_over_processes=False,
        n_blocks_det=1,
        n_blocks_time=size,
        tods=[],
    )

    days_simulated+=mission_time_days
    comm.barrier()
    print("rank:",rank, "| days_simulated=", days_simulated )

The loop runs correctly up to a cumulative 22 days of days_simulated (independent of detector.sampling_rate_hz , mission_time_days, and number of cores used) and then crashes with the following error message:

$ mpirun -n 2 python reproduce_bug.py
[...]
Traceback (most recent call last):
  File "produce_hitmaps.py", line 129, in <module>
    sim.create_observations(
  File "/home/anovelli/lbs_env/lib/python3.8/site-packages/litebird_sim/simulations.py", line 887, in create_observations
    cur_obs = Observation(
  File "/home/anovelli/lbs_env/lib/python3.8/site-packages/litebird_sim/observations.py", line 162, in __init__
    self._set_attributes_from_list_of_dict(detectors, root)
  File "/home/anovelli/lbs_env/lib/python3.8/site-packages/litebird_sim/observations.py", line 234, in _set_attributes_from_list_of_dict
    self.setattr_det_global(k, dict_of_array.get(k), root)
  File "/home/anovelli/lbs_env/lib/python3.8/site-packages/litebird_sim/observations.py", line 557, in setattr_det_global
    comm_row = comm_grid.Split(comm_grid.rank // self._n_blocks_time)
  File "mpi4py/MPI/Comm.pyx", line 199, in mpi4py.MPI.Comm.Split
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error

It does not seem to be a RAM problem (caused for example by observations being appended instead of overwritten) as this code only uses a small portion of my available RAM before crashing.

For context: I encountered this bug while trying to produce up to full-mission hitmaps for each pixel type of LiteBIRD (see detector.pixtype ). I can't just assign each node a different "step" of the for loop because I don't have enough RAM to allocate a year of pointing for each available node. I can provide the full version of my script if needed.

Please let me know if you have any idea of what might be causing this issue; I am finding it difficult to debug this. Ale

paganol commented 1 month ago

I think this was solved, right? Feel free to reopen it if it's no the case