Making unpacking/storing of atmospheric models more memory-efficient

gabrielastro commented 10 months ago

When doing several times within a session or across sessions but in the same directory:
```
database.add_model(model='atmo-neq-weak', teff_range=(1500., 1800.))
```
changing the bounds of teff_range or not, I always see:
```
Unpacking ATMO NEQ weak model spectra (276 MB)... [DONE]
```
and similarly for tar files for other atmospheric models. I am not sure that the files really are getting unpacked again but it is taking a long time (e.g. for the 16-GB file exo-rem-highres.tgz), whereas I guess species should only need to check whether the files needed are present (unpacked) or not. If not, maybe only the needed ones could be extracted from the tar file, or at worse everything unpacked again (the current behaviour), or the user could be offered for the files to be interpolated (as is currently done for files that are indeed missing).
Would it be sensible to have the possibility of storing the atmospheric models centrally? (Keep an option for local copies in case it makes sense for some uses.) I have run folders at different locations for different projects but in every directory there are copies of the spectra (unchanged since years…), requiring downloading every time.
Unless I am mistaken, currently all files from a tarball are extracted instead of only the ones required from a database.add_model() call. Would it not make sense to extract only the needed spectra? And especially for work with high-resolution spectra, it might even make sense to offer the option of partial spectral files (i.e., extract each needed *_spec.dat file but delete it ca. right away, keeping only an appropriately-named partial-range file). This way, one could explore a bigger range (and/or a bigger number) of atmospheric parameters (Teff, logg, cloud parameters, etc.) but over a small wavelength range without memory explosion such as:
```
RuntimeError: Disable slist on flush dest failure failed (file write failed: time = Mon Jan 29 16:17:50 2024
, filename = 'species_database.hdf5', file descriptor = 22, errno = 5, error message = 'Input/output error', buf = 0x3c7a868, total write size = 664, bytes this sub-write = 664, bytes actually written = 18446744073709551615, offset = 0)
```
I had been warned:
```
UserWarning: Adding the full high-resolution grid of Exo-Rem to the HDF5 database may not be feasible since it requires a large amount of memory
```
It might often suffice to limit the Teff range (the main culprit) but maybe not always. Not everyone has access to unlimited memory and smaller files = faster reading in, etc..

Sorry if these are naive beginner's questions showing that I do not understand the deeper structure of species :). I am hoping they will at least be useful to others, through your comments here and/or in the documentation (which is already very good)!

gabrielastro commented 10 months ago

Also related, would it be possible to have in database.add_model() the option of limiting not only teff but also the other parameter ranges? Some parameter ranges are sometimes known to be much too wide (especially if they will be fixed), so one could save some (a lot of) read-in time.
Atmospheric model files can have maaaany significant digits:
```
# Wavelength (um) - Flux (W m-2 um-1)
5.000000000000000000e-01 1.046887465549093861e+02
5.001237804050535640e-01 1.020516848591326635e+02
5.002475914532845680e-01 1.008752535495290061e+02
5.003714331522789438e-01 1.008323757761552457e+02
```
but this could be dramatically reduced without loss of precision, which would reduce not only the needed RAM (I guess) but certainly the file size by a large factor (ca. 3, or 0.5 dex!).

tomasstolker commented 10 months ago

Thanks for opening another issue 😊!

Indeed the whole file is unpacked. Ideally it would check if the files are present and if the files have the correct size (i.e. that a previous try was not aborted), but this is not something I plan to implement (but would welcome a PR!). I can add an unpack parameter in add_model though.
There is the configuration file where you can set the download/unpack folder and also the HDF5 file that is to be used. So downloading, unpacking, and adding is only required once.
I have implemented this in commit 4804241eba3e8bf16babf21660cbb38bd02eb050. So it only unpacks spectra files within the selected teff_range. For now, I am not planning to add support for adding partial files, since setting teff_range should for most cases be fine, but would welcome a PR.
Hm not sure, should be somewhat straightforward to implement, but my guess is that such additional restrictions are typically not needed. It is only needed to add the model spectra once, so should be fine if that takes a few minutes (at most). In FitModel it is possible to fix a parameter though!
I wouldn't know what would be a reasonable number... It may also depend on the model? So I rather stick with the safe number of decimals 😉

gabrielastro commented 10 months ago

Thanks for answering quickly to one more issue 😊. Thanks a lot for adding the unpack parameter! It works very nicely. Now it is much faster.

Good point that it would require some checks to avoid bad surprises later on. There probably exist some high-quality libraries implementing this but this is not my area… Maybe an easy compromise would be to offer a "use-at-your-risk" option only checking if the file exists (and warn the user that it is up to him or her to make sure the files are not corrupted)?
Ah! perfect. Sorry for having missed this!
Thanks! I agree that partial wavelength ranges are not needed (and would require some trouble to program).

4a. Even though add_model() is now much faster, it is taking seeeeveral minutes when initialising:

database.add_model(model='exo-rem-highres', teff_range=(1000., 1500.))

printed after eight minutes Unpacking 3171/9575 model spectra from Exo-REM (16 GB)... and after further ca. four minutes:

Please cite Charnay et al. (2018) when using Exo-REM in a publication
Reference URL: https://ui.adsabs.harvard.edu/abs/2018ApJ...854..172C/abstract
Wavelength range (um) = 0.67 - 250.0
Spectral resolution = 20000
Teff range (K) = 1000.0 - 1500.0

and the unpacking began, and after ten more minutes it printed Adding Exo-REM model spectra... [DONE], and then after ca. two more minutes it was done, including:

Number of stored grid points: 3300
Number of interpolated grid points: 128

So, in total 25 minutes. Programming the main part of the code obviously takes (me) much more time than this :), but still… The (University) computer is quite modern, with many cores and sufficient RAM, I thought, and there are no network issues. Or do these numbers seem suspicious to you?

I guess it has to be done only once per database file. I was thinking of having a local database for every project to keep it tidier, but given the time it takes I will probably re-use the database across projects. Which is ok too! I am still a beginner.

Maybe restricting the other parameters in the same way as Teff would still be relatively easy to implement? Or maybe not because not all grids have the same axes…

4b. If database.add_model() is called again for the same models, the same process starts over, right? Maybe the following changes would be sensible: if changing wavel_range, keep the behaviour as is (everything just has to be overwritten anyway to avoid complicated programmer-side checks and tracking), but if some other range of parameter is different compared to what is stored in the database, added only the needed part instead of again unpacking and reading in everything. If the range (e.g., of Teff) gets reduced, models would need to be delete from the database file, instead of all models from that family deleted and only the desired ones being read in again. But again, if it is too complicated (messy) to program, maybe it is not needed…

Actually, it may be simple:
- The smallest delta lambda for each grid is known, so keeping three more significant digits than this smallest number will be more than enough. Actually, since the wavelength is in scientific notation and the resolution = lambda/Delta lambda is constant, the number of digits needed is easy! N_digits = ceil(log10(R))+3 is more than enough.
- What is the highest per-bin flux accuracy conveivable in the model files that you read in and/or in observations? SNR = 1e3? So four or at the very most five significant digits would be more than enough, right?

Was that convincing ;)? These measures would reduce the file sizes dramatically!

tomasstolker commented 10 months ago

That does take a long time! The high-res grid had not been used before by anyone I think... Okay perhaps adding the additional parameter restrictions would help in that case. What you can also do, is creating a separate folder and copy from data_folder only the model spectra files that you want to use, and then add them with add_custom_model.
SNR of 1000 would certainly be sufficient for planet observations but I don't know if that is also a typical convergence accuracy for the atmospheric models?

tomasstolker commented 10 months ago

I just realized that add_model has the wavel_range parameter. It does have to read in each file though, but then only stores the selected wavelength range.

gabrielastro commented 9 months ago

Remaining in this thread:

I realised that calling FitModel() with a value for teff in bounds but without having first called database.add_model(), the models for all Teff values are added, instead of only the needed range. Maybe it would make sense to align the behaviour to that of database.add_model(). Also, I guess it means that one (never?) needs to call database.add_model() explicitly and that FitModel() will take care of this.

In passing: "safe some time" → "save some time".

Right, perfect! Thanks.
Thanks for pointing that out! By the way, about spec_res:
```
spec_res : float, None
Spectral resolution to which the spectra will be resampled.
This parameter is optional since the spectra have already
been resampled to a lower, constant resolution (typically
:math:`R = 5000`). The argument is only used if
``wavel_range`` is not ``None``.
```
I confused myself into thinking that it is meaningful to set this if the model resolution is much higher than that of the data but that is not true; FitModel will take care of this. Maybe it would be good to add a comment on this in the documentation. Is there a typical scenario where spec_res would be useful? Also, what happens if the model resolution is worse than that of the data?
Good suggestion, thank you!
Maybe the SNR I mentioned is a red herring. Certainly interpolating the models means that we cannot trust more than the third or fourth significant digit to be physically meaningful, so anything past this can be removed from the input file (but of course then working numerically in double precision or whatever is needed). That will dramatically reduce the file size.

tomasstolker commented 9 months ago

There is more in species then FitModel 😉 so 1 and 6 seem too specific. The spec_res parameter from add_model was needed in an early version of the package. Currently, it only helps in terms of time/storage efficiency.

gabrielastro commented 9 months ago

Ok about 1 :smile:. What happens if the model resolution is worse than that of the data, though? But for 6, I cannot imagine how an interpolated model could have more than three or four significant digits; the rest is basically numerical noise… (But there is no real harm; just to storage space, transfer time, and elegance :wink:.)

tomasstolker commented 9 months ago

Not all routines require interpolation so I think that your suggestion could introduce unwanted inaccuracies for some of the non-FitModel applications.

gabrielastro commented 9 months ago

Hmm… Ok, as you wish… To be continued another time. Or just try truncating the input files and see how many bug reports start coming in :grin:. Thanks in any case for this exchange!

tomasstolker / species

Making unpacking/storing of atmospheric models more memory-efficient #83