tomasstolker / species

Toolkit for atmospheric characterization of directly imaged exoplanets
https://species.readthedocs.io
MIT License
22 stars 10 forks source link

Running multinest or ultranest in parallel? #89

Closed gabrielastro closed 7 months ago

gabrielastro commented 7 months ago

After getting multinest set up in parallel, I am trying actually to run it and am doing my best to follow the recommendation from the "Fitting data with a grid of model spectra" tutorial: "It is therefore recommended to first add all the required data to the database and then only run SpeciesInit, FitModel, and the sampler (run_multinest or run_ultranest) in parallel with MPI". However, running in parallel with N processors, I get N-1 times:

Traceback (most recent call last):
  File "[…]/Skript.py", line 297, in <module>
    fit = FitModel(object_name='meinPlanet', model='ames-cond', inc_spec=True)
  File "[…]/species/fit/fit_model.py", line 445, in __init__
    self.object = ReadObject(object_name)
  File "[…]/species/read/read_object.py", line 47, in __init__
    with h5py.File(self.database, "r") as h5_file:
  File "[…]/.local/lib/python3.9/site-packages/h5py/_hl/files.py", line 567, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "[…]/.local/lib/python3.9/site-packages/h5py/_hl/files.py", line 231, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 106, in h5py.h5f.open
BlockingIOError: [Errno 11] Unable to open file (unable to lock file, errno = 11,
  error message = 'Resource temporarily unavailable')

Thus the line FitModel(object_name='meinPlanet', model='ames-cond', inc_spec=True) seems to be the problem. One processor out of the N writes Interpolating Spektrum... [DONE]. Then, the program stalls (unsurprisingly). Am I missing something simple? As far as I can tell, really only SpeciesInit() and FitModel() are called before the sampler. Thanks for any help!

gabrielastro commented 7 months ago

By the way, the version of h5py installed (1.12.2) should have SWMR (Single-Write Multiple-Read) support:

from h5py import version
from h5py import h5
print('  version.hdf5_version_tuple = ', version.hdf5_version_tuple)
print('  h5.get_config().swmr_min_hdf5_version = ', h5.get_config().swmr_min_hdf5_version)

yields, run in the parallel environment:

  version.hdf5_version_tuple =  (1, 12, 2)
  h5.get_config().swmr_min_hdf5_version =  (1, 9, 178)

Therefore, from what I see from ~/.local/lib/python3.9/site-packages/h5py/_hl/files.py, SWMR should be possible. I tried adding swmr=True: with h5py.File(self.database, "r", swmr=swmr) as h5_file: in species/read/read_object.py but this did not help.

Edit: Actually, it did help because now the error is at a different location:

"[…]/species/read/read_model.py", line 141, in open_database
 with h5py.File(self.database, "a") as hdf5_file

but there it is done only to # Test if the spectra are present in the database (?), so appending should not be needed there. Other calls to the class do need write access, however. I look into the code and it looks a bit too involved for me to change the different calls throughout, passing the parameters, etc., so I would leave this for the developer :wink:.

tomasstolker commented 7 months ago

Thanks for opening this issue! The mode was indeed incorrectly set in ReadModel when the HDF5 file was opened. Should have been fixed in commit 093df1d.

gabrielastro commented 7 months ago

Excellent! Thank you very much for your quick fix. It works :heavy_check_mark:! Now multinest is running in parallel :smile:.

By the way (I can open a separate "Issue" if you prefer), when run in parallel, it might be good if species printed the start-up messages (and maybe also the ones for setting up the fit: Interpolating Data… [DONE] and so on) prefixed with the processor number. A common problem with trying to run programs in parallel is that instead of one instance with N processes, N instances of the program are run simultaneously. (This typically comes from using different versions of MPI for compiling and for running.) If something like [proc N of M processors] were printed, at least once at start-up, it would be a good confirmation for the user that things are ok. Currently, there is no real way to tell, actually.

tomasstolker commented 7 months ago

Feel free to create a pull request for that. It would be low priority to implement from my side since it seems to be running fine.

gabrielastro commented 7 months ago

Ok. Yes, it runs fine! As a minimal version, how about for the beginning of SpeciesInit() in species/core/species_init.py:

        try:
            from mpi4py import MPI
            mpi_rank = MPI.COMM_WORLD.Get_rank()
            mpi_size = MPI.COMM_WORLD.Get_size()
            species_parallel_msg = "Proc. %d of %d\n" % (mpi_rank, mpi_size)
        except ModuleNotFoundError:
            species_parallel_msg = ""

        species_version = species.__version__
        species_msg = f"species v{species_version}"

        mess_len = max(len(species_msg),len(species_parallel_msg))
        print(mess_len * "=")
        print(species_msg)
        print(species_parallel_msg, end="")
        print(mess_len * "=")

or maybe setting species_parallel_msg to non-null only if additionally mpi_size>1, but maybe it is better as is so that the user may think "Oh? I could have more than one processor? How nice!" It is probably not quite coded in the official species style so maybe this can be seen as an informal, poor-man's pull request :grin:? I tested it and it works fine both in parallel and without mpi4py support (doing before sys.modules['mpi4py'] = None to fake it).

tomasstolker commented 7 months ago

Thanks for the suggestion! I have implemented this in commit ed70768 👍.

gabrielastro commented 7 months ago

Thanks a lot! I can confirm that also UltraNest works in parallel. Under the line with Nested sampling with UltraNest, might it be good to have some function of UltraNest (i.e., not purely from species, which is done at the beginning) confirm to the user that indeed N processors are being used and are seen by Ultranest? Same thing for Multinest. This might save some users with a faulty set-up a lot of time by making diagnosis easier. (I will change the title of this thread because the parallel running works with both samplers now.)

On a related note, would it be possible to allow resume=False as an argument to fit.run_ultranest()? I guess it would correspond to 'overwrite' (i.e., when one is not trying to resume a run). It would make it more intuitive (True and False).

And finally, UltraNest has this very convenient compact output while running:

Creating directory for new run ultranest/run1
[ultranest] Sampling 1000 live points from prior ...
[ultranest] Widening roots to 1196 live points (have 1000 already) ...
[ultranest] Sampling 196 live points from prior ...
[ultranest] Widening roots to 1427 live points (have 1196 already) ...
[ultranest] Sampling 231 live points from prior ...
3235.3(14.70%) | Like=3253.39..3258.58 [3253.3935..3253.3948]*| it/evals=18315/2127774 eff=0.8413% N=1000

Especially the percentage is useful, and it remains compact. Would it be possible to have this for MultiNest too?

tomasstolker commented 7 months ago

Especially the percentage is useful, and it remains compact. Would it be possible to have this for MultiNest too?

That would be a feature request for (Py)MultiNest!

gabrielastro commented 7 months ago

Of course, that is one way :). I can try asking. Another thing that might be on the species level, though, is making MultiNest cancellable (interruptible) with Ctrl+C, which UltraNest ist. Currently, I need to pause ipython with Ctrl+Z and kill %% it, which is not elegant and (over)kill. I tried to find what part of UltraNest does the elegant screen updates but could not find it; maybe the two (display and interruptability) could be handled together…

tomasstolker commented 7 months ago

That would also be at the MultiNest level actually...

gabrielastro commented 7 months ago

Ok! Thanks. I guess usually one submits a script on a cluster and the job can be killed, or uses a jupyter notebook and the kernel can be stopped, so this inelegant "pause and kill" is usually not so much an inconvenience…