Closed hollisakins closed 2 weeks ago
Thanks for raising the issue. I don't actually think that this is caused by parallelization since writing to the HDF5 file is done outside of any pool, i.e., only done by the main process. I don't think I ever encountered the issue myself and it's probably hard to reliably reproduce. But let me look into whether I can implement some simple safeguards, i.e., having the process wait and try again after a few seconds if it fails initially.
Hi @hollisakins, can you maybe tell me how you ran bagpipes? Some code would be ideal. In particular, I won't to double check whether you used MPI or nautilus' internal parallelization based on the multiprocessing
module.
@hollisakins Closing this due to inactivity. Please feel free to re-open this issue if you think there's still a problem.
Hi,
I've been using nautilus as part of bagpipes and also on its own to run some model fits to galaxy spectra. I've been encountering the issue that the h5py checkpointing functionality fails occasionally, due to a "resource temporarily unavailable" error. I suspect this might be happening because I'm parallelizing on ~18 cores, which is probably overkill, but still might be necessary for some applications.
Luckily its not a huge issue since I can just restart the run and it will resume from where it was. But if its a simple fix, it might be nice to find a workaround. Perhaps each parallel job needs to read/write to a separate file?
Here's the full traceback: