johannesulf / nautilus

Neural Network-Boosted Importance Nested Sampling for Bayesian Statistics
https://nautilus-sampler.readthedocs.io
MIT License
56 stars 8 forks source link

h5py blocking IO when checkpointing in parallel runs #47

Open hollisakins opened 4 months ago

hollisakins commented 4 months ago

Hi,

I've been using nautilus as part of bagpipes and also on its own to run some model fits to galaxy spectra. I've been encountering the issue that the h5py checkpointing functionality fails occasionally, due to a "resource temporarily unavailable" error. I suspect this might be happening because I'm parallelizing on ~18 cores, which is probably overkill, but still might be necessary for some applications.

Luckily its not a huge issue since I can just restart the run and it will resume from where it was. But if its a simple fix, it might be nice to find a workaround. Perhaps each parallel job needs to read/write to a separate file?

Here's the full traceback:


  File "/Users/hba423/miniforge3/lib/python3.10/site-packages/nautilus/sampler.py", line 427, in run
    self.write_shell_update(self.filepath, -1)
  File "/Users/hba423/miniforge3/lib/python3.10/site-packages/nautilus/sampler.py", line 1293, in write_shell_update
    fstream = h5py.File(Path(filepath), 'r+')
  File "/Users/hba423/miniforge3/lib/python3.10/site-packages/h5py/_hl/files.py", line 562, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/Users/hba423/miniforge3/lib/python3.10/site-packages/h5py/_hl/files.py", line 237, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 102, in h5py.h5f.open
BlockingIOError: [Errno 35] Unable to open file (unable to lock file, errno = 35, error message = 'Resource temporarily unavailable')```
johannesulf commented 4 months ago

Thanks for raising the issue. I don't actually think that this is caused by parallelization since writing to the HDF5 file is done outside of any pool, i.e., only done by the main process. I don't think I ever encountered the issue myself and it's probably hard to reliably reproduce. But let me look into whether I can implement some simple safeguards, i.e., having the process wait and try again after a few seconds if it fails initially.

johannesulf commented 4 months ago

Hi @hollisakins, can you maybe tell me how you ran bagpipes? Some code would be ideal. In particular, I won't to double check whether you used MPI or nautilus' internal parallelization based on the multiprocessing module.