Hi everybody,
there seems to be a bug if one tries to run sim.create_observations inside a for loop and execute the code in parallel.
Here is a minimum working example code that will reproduce the bug:
(I run it in parallel with mpirun -n 2 reproduce_bug.py, if run serially the code works fine)
import litebird_sim as lbs
import astropy
start_time = astropy.time.Time("2027-01-01T00:00:00")
imo = lbs.Imo()
imo_version = "v1.3"
#setup di mpi
comm = lbs.MPI_COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
#how long you want each simulation to last
mission_time_days=1 #2
sim = lbs.Simulation(
mpi_comm=comm,
start_time=start_time,
duration_s=int(mission_time_days * 24 * 3600.0),
)
detector=lbs.DetectorInfo.from_imo(
url=f"/releases/v1.3/satellite/LFT/L1-040/000_000_003_QA_040_T/detector_info",
imo=imo,
)
#detector.sampling_rate_hz=1 #<- NO effect
days_simulated=0
for i in range(100):
sim.create_observations(
detectors= [detector],
num_of_obs_per_detector= int(mission_time_days*24*2),
split_list_over_processes=False,
n_blocks_det=1,
n_blocks_time=size,
tods=[],
)
days_simulated+=mission_time_days
comm.barrier()
print("rank:",rank, "| days_simulated=", days_simulated )
The loop runs correctly up to a cumulative 22 days of days_simulated (independent of detector.sampling_rate_hz , mission_time_days, and number of cores used) and then crashes with the following error message:
$ mpirun -n 2 python reproduce_bug.py
[...]
Traceback (most recent call last):
File "produce_hitmaps.py", line 129, in <module>
sim.create_observations(
File "/home/anovelli/lbs_env/lib/python3.8/site-packages/litebird_sim/simulations.py", line 887, in create_observations
cur_obs = Observation(
File "/home/anovelli/lbs_env/lib/python3.8/site-packages/litebird_sim/observations.py", line 162, in __init__
self._set_attributes_from_list_of_dict(detectors, root)
File "/home/anovelli/lbs_env/lib/python3.8/site-packages/litebird_sim/observations.py", line 234, in _set_attributes_from_list_of_dict
self.setattr_det_global(k, dict_of_array.get(k), root)
File "/home/anovelli/lbs_env/lib/python3.8/site-packages/litebird_sim/observations.py", line 557, in setattr_det_global
comm_row = comm_grid.Split(comm_grid.rank // self._n_blocks_time)
File "mpi4py/MPI/Comm.pyx", line 199, in mpi4py.MPI.Comm.Split
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
It does not seem to be a RAM problem (caused for example by observations being appended instead of overwritten) as this code only uses a small portion of my available RAM before crashing.
For context:
I encountered this bug while trying to produce up to full-mission hitmaps for each pixel type of LiteBIRD (see detector.pixtype ). I can't just assign each node a different "step" of the for loop because I don't have enough RAM to allocate a year of pointing for each available node. I can provide the full version of my script if needed.
Please let me know if you have any idea of what might be causing this issue; I am finding it difficult to debug this.
Ale
Hi everybody, there seems to be a bug if one tries to run
sim.create_observations
inside a for loop and execute the code in parallel. Here is a minimum working example code that will reproduce the bug: (I run it in parallel withmpirun -n 2 reproduce_bug.py
, if run serially the code works fine)The loop runs correctly up to a cumulative 22 days of
days_simulated
(independent ofdetector.sampling_rate_hz
,mission_time_days
, and number of cores used) and then crashes with the following error message:It does not seem to be a RAM problem (caused for example by observations being appended instead of overwritten) as this code only uses a small portion of my available RAM before crashing.
For context: I encountered this bug while trying to produce up to full-mission hitmaps for each pixel type of LiteBIRD (see
detector.pixtype
). I can't just assign each node a different "step" of the for loop because I don't have enough RAM to allocate a year of pointing for each available node. I can provide the full version of my script if needed.Please let me know if you have any idea of what might be causing this issue; I am finding it difficult to debug this. Ale