openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
714 stars 38 forks source link

[REVIEW]: cortecs: A Python package for compressing opacities #6104

Closed editorialbot closed 1 month ago

editorialbot commented 10 months ago

Submitting author: !--author-handle-->@arjunsavel<!--end-author-handle-- (Arjun B. Savel) Repository: https://github.com/arjunsavel/cortecs Branch with paper.md (empty if default branch): Version: v1.0.1 Editor: !--editor-->@ivastar<!--end-editor-- Reviewers: @pscicluna, @doriannblain Archive: 10.5281/zenodo.13208759

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b0c4f5012351abb476081ffe9226fbf4"><img src="https://joss.theoj.org/papers/b0c4f5012351abb476081ffe9226fbf4/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b0c4f5012351abb476081ffe9226fbf4/status.svg)](https://joss.theoj.org/papers/b0c4f5012351abb476081ffe9226fbf4)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@pscicluna & @doriannblain, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @ivastar know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @doriannblain

📝 Checklist for @pscicluna

editorialbot commented 10 months ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 10 months ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.04 s (1093.0 files/s, 135369.1 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          24            577            984           1643
TeX                              1             11              0            268
Markdown                         3             60              0            200
YAML                             9             40             58            198
Jupyter Notebook                 4              0           1371            188
reStructuredText                 6             57             76             90
-------------------------------------------------------------------------------
SUM:                            47            745           2489           2587
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 10 months ago

Wordcount for paper.md is 826

editorialbot commented 10 months ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1038/s41586-021-03912-6 is OK
- 10.3847/1538-3881/ac658f is OK
- 10.3847/2041-8213/ace828 is OK
- 10.1051/0004-6361/201322971 is OK
- 10.1088/0004-637X/775/2/137 is OK
- 10.1098/rsta.2015.0202 is OK
- 10.1186/s40537-021-00444-8 is OK
- 10.3847/1538-4357/ac61d6 is OK
- 10.1109/BigData.2017.8257926 is OK
- 10.21105/joss.00024 is OK
- 10.1117/12.2312345 is OK
- 10.1051/0004-6361/202038306 is OK
- 10.1117/12.2561564 is OK
- 10.1088/0004-637X/707/1/24 is OK
- 10.1093/mnras/stad2586 is OK
- 10.3847/1538-3881/acd24d is OK
- 10.3847/1538-4357/ac423f is OK
- 10.1051/0004-6361/202347262 is OK
- 10.3847/1538-3881/accd65 is OK
- 10.1093/mnras/stac3388 is OK
- 10.1093/mnras/sty1877 is OK
- 10.1093/mnras/stt2011 is OK
- 10.1093/mnras/stz2778 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 10 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

doriannblain commented 10 months ago

Review checklist for @doriannblain

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

doriannblain commented 10 months ago

Hi all, I finished my review. I attached my summary in a separate file to slightly reduce the length of my reply.

General comment

I tested cortecs by following the installation procedure, and the "quickstart" section of the documentation. Those sections are overall well made, and I found the code relatively easy to use and to understand. The accompanying paper is also well-written. Unfortunately, a couple of key points are not discussed. However, this is an indubitably useful submission and I recommend it for publication after major revisions.

I have two main issues. The first one is that I find that the core functionality of cortecs (lowering the memory usage of opacities) is not well exposed in the documentation and lacks a more concrete example. My understanding is that a real-world usage of cortecs would be the following:

  1. Load opacities with an Opac object and fit them with a Fitter object,
  2. Save the "fitting parameters" on disk,
  3. When initializing a retrieval, load the previously saved fitting parameters,
  4. Use an eval function with the loaded fitting parameters to retrieve the opacities on-demand.

Step 1 is already well described. I suggest that the authors rework the "Quickstart" section to include steps 2 to 4 in a clearer way. Step 3 can simply be the loading step. The "Checking accuracy and speed" subsection is useful. It should be kept, but at the end of the section (and added in some form to the paper, see below).

My second main issue is that it is unclear what exactly are the limitations of cortecs (if there are any at all). The benchmark provided in the article, while convincing, only shows the case of a hot Jupiter. I acknowledge that this is a powerful example, as currently, most high-resolution studies are done on this kind of planet. At the same time, hot Jupiters emission spectra in the near-infrared (the benchmark case) usually have most of their contribution coming from a limited region of the opacity data pressure-temperature space, that is 800-1500 K and 0.01-10 bar, while opacity data pressure and temperature grid commonly extend from 100 to 3000 K, and 1e3 to 1e-6 bar.

While I don't think more benchmarks are necessary, the article should quantify the differences between the "decompressed" opacities and the original opacities in the entirety of the wavelength-pressure-temperature space of the opacity data, at least for key species such as H2O and CO (this is already partially done in the documentation). A tool/function should also be provided to the users so they can easily evaluate these differences. In addition, it should be made clear in the documentation that this evaluation is part of the intended, standard cortecs workflow. This should also be assorted by a recommendation to compare spectra built with and without using cortecs.

I have other comments, detailed in the sections below.

Code- and documentation-related comments

cross_section = np.log10(cross_section)


- In the same place, could the authors explain this choice of 1e-104 as a threshold value, instead of e.g. `sys.float_info.min`?
- Likewise, the error raised in cell 8 of the "Quickstart" section is well explained, but it would be better if that error is not raised at all (it also prevents an easy full rerun of the notebook). The code already has the correct condition to handle it, so why not return hyperparameters that would reconstruct constant opacities instead of raising an error? It would be much more practical for the user and expand the use cases of `cortecs`.
- For the PCA fitting method, I would need the following clarifications:
    - what is the purpose of calling `fit_mlr` in `do_pca` when `prep_pca` only returns `xMat`, which is calculated by `do_svd`?
    - what is the purpose of calculating `fNorm` in `fit_mlr` when `fit_pca` only returns `beta`?
    - Is there a reason for the SVD to be performed along the pressure grid (size 13) rather than the temperature grid (size 30)? Would not the "compression factor" be higher, up to 15 (or 30?) for `nc=1`, if the temperature grid was chosen instead?
- The `Evaluator.eval` function uses the temperature and pressure grid, so its arguments should be indices, not the actual temperature and pressure values.
- Related to the above point, `Evaluator.eval` should raise a clear message when the given pressure or temperature are out-of-bounds.
- A useful addition is the ability for the user to easily save and load "fitting parameters" (as outlined in my general comment). Right now, this is implemented for neural networks but not for PCA fits, and it is inoperant for Polynomial fits (the `save` argument of `fit_polynomial` does not do anything). I suggest implementing this feature for all the fitting frameworks.
- Since Tensorflow is optional for this package to work, I suggest making it an optional dependency.
- The code has minor logic weaknesses, deprecated docstrings, and could better follow the PEP8 convention. A PyCharm code inspection gave me several warnings and more than 100 weak warnings, most of them being PEP8 violations. The authors can also use `flake8` to catch those. These should be quickly fixable and improve the code quality.
- Some of the modules (e.g. fit_pca.py) have the `if __name__ == '__main__'` pattern, signalling them as scripts, which they should not be.

### Paper-related comments
- Looking at the `cortecs` repository commits, it seems that A. B. Savel is the only contributor to the code. Can the authors expand on the contribution of M. Bedell and E. M.-R. Kempton?
- I would suggest using more specific terms when referring to "memory" in the paper (i.e., make a clear distinction between RAM, VRAM, non-volatile memory), to prevent potential confusions.
- l.17: the authors state that the code is available on `conda`, but the documentation states that it will be distributed to `conda` "at a later date". I do not think the package is on `conda` yet, can the authors solve this?
- l.33: the authors refer to the memory footprint "on disk". It would also be interesting to at least briefly mention how the RAM/VRAM usage usually compares with disk usage, as this is the main issue that `cortecs` is addressing.
- l.34: the IGRINS acronym should be defined.
- l.34—35: can the authors give a reference for the opacity data used in this example, as well as specify the size of the pressure and temperature girds?
- l.35: I suggest using the more precise term "resolving power" instead of "resolution".
- l.38—39: I suggest adding a figure illustrating the wavelength, pressure, and temperature dependencies of the opacities. This figure could also be used to show an example of a fit(s) with `cortecs`.
- Methods: 
    - Could the authors discuss their choice of using the term "compression/decompression" and not for example "fitting/modelling"? If "compression" is appropriate, then it should be specified if the compression is lossy (which I believe to be the case) or lossless, and why.
    - I think this section would benefit from a slightly more concrete/detailed explanation on how the "compression/decompression" is achieved, that is, saving and loading "fitting parameters" (I will use this term despite it not being exact for PCA and neural nets). A small example would also be valuable.
- l.52—55: the features detailed in this paragraph are not mentioned in the documentation. Besides, they are not fully implemented yet. However, these features seem peripheral. If that is indeed the case, implementing them is not necessary for this publication, and this paragraph could be removed. If those features are important, then examples for each of them should be provided in the documentation.
- l.63: 
    - the opacity data sizes before and after "compression" should be added, as well as a reference for the opacities used. In in that case, it would also be interesting to see the size of the PCA vectors and PCA coefficients. A "compression" performance comparison of each method would also be valuable (this can be e.g. a small table listing compression factors).
    - in the "Quickstart" example, the combined size (i.e. number of elements) of those two arrays (PCA vectors and PCA coefficients) is 3.2 times smaller than the size of the cross-section array, so less than the factor 13 mentioned in the article (but still significant). Even by setting `nc=1`, this factor only reaches 6.5. I am probably missing something as 6.5 is suspiciously the half of 13, which is also the size of the pressure grid. In all cases, the setup leading to that 13 "compression factor" needs to be exposed. It should also be clarified that the maximum "compression factor" (for that method at least) depends on the size of the pressure grid, if that is indeed the case.
- l.66: the runtimes with "compressed" and "uncompressed" opacities should be indicated, accompanied by the hardware used. The data reading time, and data "decompression" time should also be indicated. Similarly to my previous comment, a performance comparison of all "compression" methods would be valuable.
- l.66—67:
    - Are the retrieval and data described here detailed in another (upcoming) article? If yes, this should be mentioned.
    - While convincing, Fig. 1 by itself is not sufficient to claim that `cortecs` can be used "in [all] high-resolution retrievals". For that, the article needs to display examples challenging the opacity evaluating functions, such as very cold or very hot planets, planets with a thin atmosphere, or planets with high (Z > 100) or extreme metallicity (e.g. steam, Earth-like or Venus-like atmospheres). I think this is beyond the scope of this paper though, so I would suggest instead precising "in at least some high-resolution retrievals". Adding for example a histogram showing the differences between the "decompressed" and "uncompressed" opacities (akin to what is done in the documentation), and a small discussion around it, could help the reader forming an informed opinion.
- Fig. 1: 
    - The gold and teal labels are misaligned; this should be fixed.
    - In the caption: WASP-77A -> WASP-77Ab.
- The documentation mentions an `Optimizer` object that looks partially implemented in the code (it seems to only work for neural nets) and not mentioned in the paper. Do the authors have plans to develop on this feature in this submission?
pscicluna commented 9 months ago

Review checklist for @pscicluna

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

ivastar commented 8 months ago

Happy new year everyone here! Pinging @pscicluna with a reminder about the review. Please let us know when you might be able to complete it.

pscicluna commented 8 months ago

@ivastar sorry for being a little slow, but I should be done in a couple of days!

pscicluna commented 8 months ago

I apologise for the delay, but here is my review!

Summary:

The authors present an interesting and widely applicable software package. They are targeting its usage for dimensionality reduction of large opacity datasets for e.g. molecular gas, but I expect it could be applied much more widely (perhaps with small tweaks). For example, I expect that cortecs could also be useful for interpolation of opacities, including prediction uncertainty.

Similar to @doriannblain, I also find it a little difficult to see how cortecs would be integrated into an analysis pipeline. I imagine the idea is to start your pipeline by generating the compressed opacities, then (if necessary) distributing the evaluators to all the compute devices, then the sampler can run on all devices to get the posterior you’re after? If that’s the case, it would be a good idea to provide an example workflow, either in the docs or by describing it further in the paper.

I also find it hard to tell when cortecs can be deployed to maximum benefit, and when it should be avoided. For example, can it only be used for molecular gas opacities, or could it be used to compress atomic opacities, or even temperature-dependent dust opacities? How large and complex does a dataset need to be to profit from this treatment? And is there a dataset too big for cortecs?

Otherwise, I have a few fairly high-level comments on the paper and the docs. I have tried to stay away from commenting on small details for now.

Paper:

The text of the paper is clear and well-written, however I would appreciate a more extensive description of the methods. Meanwhile, since the benchmark only shows downstream analysis (posterior samples), it would be nice to see both some additional examples, and more detail on the examples. For example, I would find it very interesting to have a figure or two showing how the reconstructed opacities compare to the original data (including the residuals) for different choices of compression. For example, for both rich and sparse opacities, how much does the choice of the number of components affect the quality of the PCA reconstruction, or the choice of hyperparameters for the neural network (or even architecture)? Could you also show an extreme example where cortecs actually fails? This would illustrate when cortecs is effective and what its limitations are so that users can get a better understanding of how to use it best.

I also noticed that there is no description of the compression algorithms in the paper. This is fine for PCA and the polynomials, but some description of the architectures of networks that are available and what the default choices are would be very helpful.

In figure one, I noticed that the posteriors without compression show much more substructure than the ones with compression. Is this because some behaviour has been removed by the compression, resulting in fewer regions of the prior volume that multinest finds with high likelihood? Or is it because the run with compression was faster per iteration, and hence you could generate more samples for a smoother posterior?

Docs: I attempted to run the quickstart example, and found it did not load the data because the data appeared to have been saved with the keyword allow_pickle=True, however I do not seem to be able to set this keyword myself. As a result, I could not continue with the test. Since the same dataset is used for all the examples, I could not run any of them. As a result, I was also unable to verify the functionality of the package for the moment.

I would really like to see more detailed API documentation. What are the meanings of the parameters, how does changing them affect the output, etc? There are even some modules entirely missing, such as the optimisers.

Similar to the paper, the documentation does not describe what architecture(s) of neural networks can be used. What is the default network (fully-connected, I assume)? Can the user supply a different network architecture if they want to, and if so how must it be defined? How might the choice of network architecture affect the results?

I find the discussion of Optimisation of the fitting a bit difficult to follow, because some key details are missing. What is being optimised for - reconstruction accuracy, compression ratio, training/validation loss, … ? What is the algorithm being deployed for optimisation? Without this information, it is hard to tell when a users should try optimisation and what they should hope to get out of it.

Code:

I think that @doriannblain has already covered everything required on the code itself, so I won’t make any comment on it right now.

ivastar commented 8 months ago

@pscicluna and @doriannblain thank you for the detailed reviews!

I am passing the ball back to @arjunsavel. Please review the comments and implement/respond to the comments. Further discussion here is the thread or in issues on the repo is welcome. It would be great if you can respond in the next 3-4 weeks, i.e., by ~February 19, 2024.

arjunsavel commented 8 months ago

Sounds great, thank you all!

arjunsavel commented 8 months ago

Thank you to both referees for comprehensive and helpful responses! We’ve implemented the requested changes — see the below descriptions. Please let me know if you have any follow-up questions or concerns about the codebase. Thank you again!

Response to @doriannblain

I have two main issues. The first one is that I find that the core functionality of cortecs (lowering the memory usage of opacities) is not well exposed in the documentation and lacks a more concrete example. My understanding is that a real-world usage of cortecs would be the following:

Load opacities with an Opac object and fit them with a Fitter object,

Save the "fitting parameters" on disk,

When initializing a retrieval, load the previously saved fitting parameters,

Use an eval function with the loaded fitting parameters to retrieve the opacities on-demand.

Step 1 is already well described. I suggest that the authors rework the "Quickstart" section to include steps 2 to 4 in a clearer way. Step 3 can simply be the loading step. The "Checking accuracy and speed" subsection is useful. It should be kept, but at the end of the section (and added in some form to the paper, see below).

We have now included these steps in the tutorial and explicitly outlined these steps in the new “Workflow” section of the paper.

My second main issue is that it is unclear what exactly are the limitations of cortecs (if there are any at all). The benchmark provided in the article, while convincing, only shows the case of a hot Jupiter. I acknowledge that this is a powerful example, as currently, most high-resolution studies are done on this kind of planet. At the same time, hot Jupiters emission spectra in the near-infrared (the benchmark case) usually have most of their contribution coming from a limited region of the opacity data pressure-temperature space, that is 800-1500 K and 0.01-10 bar, while opacity data pressure and temperature grid commonly extend from 100 to 3000 K, and 1e3 to 1e-6 bar. While I don't think more benchmarks are necessary, the article should quantify the differences between the "decompressed" opacities and the original opacities in the entirety of the wavelength-pressure-temperature space of the opacity data, at least for key species such as H2O and CO (this is already partially done in the documentation). A tool/function should also be provided to the users so they can easily evaluate these differences. In addition, it should be made clear in the documentation that this evaluation is part of the intended, standard cortecs workflow. This should also be assorted by a recommendation to compare spectra built with and without using cortecs.

Great point. Cortecs now includes a tool — cortecs.fit.metrics.calc_metrics — that easily calculates the absolute and percentage errors. We also have a stronger warning at the end of the quickstart.

I obtained infinite values in the "Quickstart" example when I ran the eval function at some pressures and temperatures (e.g. at T[1], P[5] and wl[-375]). On my test on the last 500 wavelength points, there was more than 17000 infinite values (~1% of the tested range). This should be fixed.

We have addressed this bug — all values should be finite now.

Still in the "Quickstart" PCA example, and related to my second main issue, the highest finite error I obtained was 70% when evaluating the whole opacity file (at T[0], P[12] and wl[3949]). This is not necessarily worrying as the authors show a convincing corner plot in their article, but I would like them to comment on this. Also, this should be highlighted in the example and in the article, instead of simply showing the 9% error at one point.

We now comment on the global fit quality, and we now include a warning that users check that the accuracy level is sufficient.

The role of the nc argument in the PCA example should be explained, as it drives the "compression factor" of this method.

We have indicated that nc refers to the number of principal components. In a similar vein, we have also clarified the purpose of the wav_ind parameter.

The units of the pressure, temperature and wavelength arrays should be indicated in the example.

Done!

The installation test outlined in the documentation’s "Installation" section does not run any actual test. Instead, this could be as simple as verifying that python3 -c ‘import cortecs’ does not raise any error.

We have added the suggested text to the installation instructions.

The authors make a good job at explaining the divide by zero warning occurring in the "Quickstart" section (cell 3). However, a better way is to prevent this warning from being raised altogether. I suggest replacing src/opac/io.py:157-160 with something like:

Thanks for this suggestion — we’ve now implemented it.

In the same place, could the authors explain this choice of 1e-104 as a threshold value, instead of e.g. sys.float_info.min?

Setting these values too low could break the fits. Perhaps a better solution would be to mask out these points in the fitting process. -104 is set as a boundary for standard opacity functions (e.g., as used by the PLATON code).

Likewise, the error raised in cell 8 of the "Quickstart" section is well explained, but it would be better if that error is not raised at all (it also prevents an easy full rerun of the notebook). The code already has the correct condition to handle it, so why not return hyperparameters that would reconstruct constant opacities instead of raising an error? It would be much more practical for the user and expand the use cases of cortecs.

To allow an easy rerun of the whole notebook, we’ve ensured that the first one passed to the function actually works. We’ve additionally passed a “force_fit_constant” parameter to “prep_pca”, which allows fitting constant opacities. The code can currently fit constant opacity functions at certain wavelengths, temperatures, or pressures — it’s just that for fitting a opacity function O(t,P, lambda), our scheme is to find the components that fit O(t,P, 0), then perform linear regression to fit those components to O(t, P, lambda).

what is the purpose of calling fit_mlr in do_pca when prep_pca only returns xMat, which is calculated by do_svd?

We thank the reviewer for a detailed review of our code. This is extraneous code that has now been omitted.

what is the purpose of calculating fNorm in fit_mlr when fit_pca only returns beta?

Similarly, this is extraneous code that has now been omitted.

Is there a reason for the SVD to be performed along the pressure grid (size 13) rather than the temperature grid (size 30)? Would not the "compression factor" be higher, up to 15 (or 30?) for nc=1, if the temperature grid was chosen instead?

Interesting point. We chose these axes because of how the arrays are shaped when fitting along multiple axes. The pca_coeffs array is shaped (N_WAVE, N_PC, N_PRES), and the vectors array is shaped (N_TEMP, N_PC). If we swapped which axis we fit along, then the PCA_COEFFS array would be larger. Naively, it seems like this would decrease the compression factor, though this may just be how we’re packaging our data?

The Evaluator.eval function uses the temperature and pressure grid, so its arguments should be indices, not the actual temperature and pressure values.

Two of the eval funcs (neural network and polynomial) use the actual temperature–pressure values, so it’s quicker to pass those values in than index the array. It also seems a bit more user-intuitive to evaluate the opacity function at a temperature and pressure value. Happy to change if you disagree!

Related to the above point, Evaluator.eval should raise a clear message when the given pressure or temperature are out-of-bounds.

We now raise temperature, pressure, and wavelength out-of-bounds errors.

A useful addition is the ability for the user to easily save and load "fitting parameters" (as outlined in my general comment). Right now, this is implemented for neural networks but not for PCA fits, and it is inoperant for Polynomial fits (the save argument of fit_polynomial does not do anything). I suggest implementing this feature for all the fitting frameworks.

We have now implemented parameter-saving for the remaining fitting frameworks.

Since Tensorflow is optional for this package to work, I suggest making it an optional dependency.

True, tensorflow is optional — and it can be a bit tricky to manage in an environment. We’ve made tensorflow and keras optional dependencies, as not all users will be performing neural network fits.

The code has minor logic weaknesses, deprecated docstrings, and could better follow the PEP8 convention. A PyCharm code inspection gave me several warnings and more than 100 weak warnings, most of them being PEP8 violations. The authors can also use flake8 to catch those. These should be quickly fixable and improve the code quality.

We addressed the minor logic weaknesses. The remaining flake8 errors are because lines are too long or from star imports. We prefer to keep line lengths as they are (we’re conforming to the black code formatting style), but we can change to autopep8 if the reviewer desires.

Some of the modules (e.g. fit_pca.py) have the if name == 'main' pattern, signalling them as scripts, which they should not be.

We have fixed this.

Looking at the cortecs repository commits, it seems that A. B. Savel is the only contributor to the code. Can the authors expand on the contribution of M. Bedell and E. M.-R. Kempton?

Thank you for the careful read of the paper. M. Bedell and E. M. R. Kempton both advised heavily on the design of this project.

I would suggest using more specific terms when referring to "memory" in the paper (i.e., make a clear distinction between RAM, VRAM, non-volatile memory), to prevent potential confusions.

Thank you for the suggestion — the paper reads more clearly now! We have replaced “memory” with specific qualifiers in all instances.

l.17: the authors state that the code is available on conda, but the documentation states that it will be distributed to conda "at a later date". I do not think the package is on conda yet, can the authors solve this?

We have amended the line in the paper to reflect that this package is not currently distributed on conda.

l.33: the authors refer to the memory footprint "on disk". It would also be interesting to at least briefly mention how the RAM/VRAM usage usually compares with disk usage, as this is the main issue that cortecs is addressing.

We address this as follows: “These on-disk memory quotes are relatively faithful to the in-memory RAM footprint of the data when stored as numpyarrays (2.1 GB for the uncompressed data and 160 MB for the compressed data).

l.34: the IGRINS acronym should be defined.

We have clarified that IGRINS is a spectrograph and spelled out its acronym (Immersion GRating INfrared Spectrometer).

l.34—35: can the authors give a reference for the opacity data used in this example, as well as specify the size of the pressure and temperature girds?

We have clarified the opacity tables used.

l.35: I suggest using the more precise term "resolving power" instead of "resolution".

We have made the requested change.

l.38—39: I suggest adding a figure illustrating the wavelength, pressure, and temperature dependencies of the opacities. This figure could also be used to show an example of a fit(s) with cortecs.

We’ve included a new figure showing wavelength, pressure, and temperature dependencies of the opacities along with residuals against cortecs fits.

Could the authors discuss their choice of using the term "compression/decompression" and not for example "fitting/modelling"? If "compression" is appropriate, then it should be specified if the compression is lossy (which I believe to be the case) or lossless, and why.

We include the following text now: “We generally refer to this process as compression'' as opposed tofitting'' to emphasize that we do not seek to construct physically motivated, predictive, or comprehensible models of the opacity function. Rather, we simply seek representations of the opacity function that consume less RAM/VRAM. The compression methods we use are ``lossy,’’ in that the original opacity data cannot be exactly recovered with our methods. We find that the loss of accuracy is tolerable for at least the hot Jupiter application (see Benchmark below).”

I think this section would benefit from a slightly more concrete/detailed explanation on how the "compression/decompression" is achieved, that is, saving and loading "fitting parameters" (I will use this term despite it not being exact for PCA and neural nets). A small example would also be valuable.

We have included a new “workflow” section that hopefully illustrates this more clearly. Happy to elaborate!

l.52—55: the features detailed in this paragraph are not mentioned in the documentation. Besides, they are not fully implemented yet. However, these features seem peripheral. If that is indeed the case, implementing them is not necessary for this publication, and this paragraph could be removed. If those features are important, then examples for each of them should be provided in the documentation.

Referring to “In addition to these compression/decompression methods, cortecs provides utility scripts for working with large opacity files. For instance, cortecs can convert opacity files between popular formats, break opacity files into wavelength "chunks" for parallel computations across CPUs, and add overlap between chunked files for calculations that include Doppler shifts.”— we have removed this paragraph.

the opacity data sizes before and after "compression" should be added, as well as a reference for the opacities used. In in that case, it would also be interesting to see the size of the PCA vectors and PCA coefficients. A "compression" performance comparison of each method would also be valuable (this can be e.g. a small table listing compression factors).

We now list the data sizes before and after compression. We also provide a table demonstrating the compression factor achieved using other methods. We have also provided a table demonstrating these sizes for the benchmark dataset. Would it be helpful if we provided a second table for the Quickstart dataset?

in the "Quickstart" example, the combined size (i.e. number of elements) of those two arrays (PCA vectors and PCA coefficients) is 3.2 times smaller than the size of the cross-section array, so less than the factor 13 mentioned in the article (but still significant). Even by setting nc=1, this factor only reaches 6.5. I am probably missing something as 6.5 is suspiciously the half of 13, which is also the size of the pressure grid. In all cases, the setup leading to that 13 "compression factor" needs to be exposed. It should also be clarified that the maximum "compression factor" (for that method at least) depends on the size of the pressure grid, if that is indeed the case.

We have clarified the shape / size of the opacity data used in the benchmark. It is different from the Quickstart opacity data — i.e., it has 18 pressure points, as opposed to 13.

l.66: the runtimes with "compressed" and "uncompressed" opacities should be indicated, accompanied by the hardware used. The data reading time, and data "decompression" time should also be indicated. Similarly to my previous comment, a performance comparison of all "compression" methods would be valuable.

We have included these timing estimates in the paper.

Are the retrieval and data described here detailed in another (upcoming) article? If yes, this should be mentioned.

We now indicate this in the article: “The non-compressed retrieval uses the data and retrieval framework from [@line:2021]; these specific samples were generated for an upcoming article (Savel et al. 2024, submitted)”

While convincing, Fig. 1 by itself is not sufficient to claim that cortecs can be used "in [all] high-resolution retrievals". For that, the article needs to display examples challenging the opacity evaluating functions, such as very cold or very hot planets, planets with a thin atmosphere, or planets with high (Z > 100) or extreme metallicity (e.g. steam, Earth-like or Venus-like atmospheres). I think this is beyond the scope of this paper though, so I would suggest instead precising "in at least some high-resolution retrievals". Adding for example a histogram showing the differences between the "decompressed" and "uncompressed" opacities (akin to what is done in the documentation), and a small discussion around it, could help the reader forming an informed opinion.

We have included the suggested wording, and we add the fitting results per a suggestion above.

The gold and teal labels are misaligned; this should be fixed.

We have fixed the alignment of the labels.

In the caption: WASP-77A -> WASP-77Ab.

Done!

The documentation mentions an Optimizer object that looks partially implemented in the code (it seems to only work for neural nets) and not mentioned in the paper. Do the authors have plans to develop on this feature in this submission?

Yes, we plan to develop on this feature. We have now extended cortecs to cover the PCA method, as well. Doing so for the polynomial method would require a significant testing refactor — we’re happy to do so if it would make the case for Optimizer stronger.

Response to @pscicluna

Similar to @doriannblain, I also find it a little difficult to see how cortecs would be integrated into an analysis pipeline. I imagine the idea is to start your pipeline by generating the compressed opacities, then (if necessary) distributing the evaluators to all the compute devices, then the sampler can run on all devices to get the posterior you’re after? If that’s the case, it would be a good idea to provide an example workflow, either in the docs or by describing it further in the paper.

We now describe an example workflow after the methods section.

I also find it hard to tell when cortecs can be deployed to maximum benefit, and when it should be avoided. For example, can it only be used for molecular gas opacities, or could it be used to compress atomic opacities, or even temperature-dependent dust opacities? How large and complex does a dataset need to be to profit from this treatment? And is there a dataset too big for cortecs?

We describe in detail a number of potential failure modes at the end of the “Statement of Need” section. We are happy to mock up a toy model showing such failure if you’d like!

The text of the paper is clear and well-written, however I would appreciate a more extensive description of the methods. Meanwhile, since the benchmark only shows downstream analysis (posterior samples), it would be nice to see both some additional examples, and more detail on the examples. For example, I would find it very interesting to have a figure or two showing how the reconstructed opacities compare to the original data (including the residuals) for different choices of compression. For example, for both rich and sparse opacities, how much does the choice of the number of components affect the quality of the PCA reconstruction, or the choice of hyperparameters for the neural network (or even architecture)? Could you also show an extreme example where cortecs actually fails? This would illustrate when cortecs is effective and what its limitations are so that users can get a better understanding of how to use it best.

We include a new plot (new Fig. 1) showing temperature, pressure, and wavelength dependencies. The failure point of cortecs are described in response to the second comment of this review

I also noticed that there is no description of the compression algorithms in the paper. This is fine for PCA and the polynomials, but some description of the architectures of networks that are available and what the default choices are would be very helpful.

We have clarified in our paper the standard neural network architecture and how the user can specify non-standard architectures.

In figure one, I noticed that the posteriors without compression show much more substructure than the ones with compression. Is this because some behaviour has been removed by the compression, resulting in fewer regions of the prior volume that multinest finds with high likelihood? Or is it because the run with compression was faster per iteration, and hence you could generate more samples for a smoother posterior?

We have clarified as follows: “The two posterior distributions exhibit slightly different substructure, which we attribute to the compressed results requiring 10% more samples to converge and residual differents between the compressed and uncompressed opacities.”

I attempted to run the quickstart example, and found it did not load the data because the data appeared to have been saved with the keyword allow_pickle=True, however I do not seem to be able to set this keyword myself. As a result, I could not continue with the test. Since the same dataset is used for all the examples, I could not run any of them. As a result, I was also unable to verify the functionality of the package for the moment.

Thanks for catching this! We have included the “allow_pickle=True” argument to all calls of np.load. If this still does not work, would you mind telling us your Python and NumPy versions?

I would really like to see more detailed API documentation. What are the meanings of the parameters, how does changing them affect the output, etc? There are even some modules entirely missing, such as the optimisers.

We have included the optimizer module in the API documentation. Additionally, we have written more detailed API documentation and reformatted some docstrings so that autodoc picks them up. (todo: check that fit_neural_net is picked up)

Similar to the paper, the documentation does not describe what architecture(s) of neural networks can be used. What is the default network (fully-connected, I assume)? Can the user supply a different network architecture if they want to, and if so how must it be defined? How might the choice of network architecture affect the results?

We now clarify in the “Fitting with a neural network“ tutorial that the parameters to the neural network architecture can indeed be changed, and that there is in principle an optimal selection of network hyperparameters. The reader is then pointed to the “Optimizing fits“ notebook, which explores this idea further.

I find the discussion of Optimisation of the fitting a bit difficult to follow, because some key details are missing. What is being optimised for - reconstruction accuracy, compression ratio, training/validation loss, … ? What is the algorithm being deployed for optimisation? Without this information, it is hard to tell when a users should try optimisation and what they should hope to get out of it.

We clarify in the Optimizing fits tutorial: “This "optimization" process refers to how well the original data can be reconstructed given a maximum size of the compression model.“

ivastar commented 8 months ago

@arjunsavel Wow! This has to be a record-breaking turn around of a review.

@pscicluna and @doriannblain please review the responses and follow up. If your comments have been satisfactory answered, please approve the submission.

ivastar commented 8 months ago

@editorialbot generate pdf

editorialbot commented 8 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

doriannblain commented 8 months ago

Hi all, it was a quick update indeed! Here are my comments on the new version.

General comment

I appreciate the changes made by the authors. I think the paper and the documentation are much clearer and rigorous now. The addition of the calc_metrics function and related discussion is commendable and very useful, as well as the saving function. The code is easier to use now, and the tutorial runs smoother, thanks to the additional error handlings implemented. I find the extended discussion on the workflow, the use of the word "compression", and the various changes in the paper satisfying. I think that the addition of the comparison and more precise performance discussion made the paper satisfy the "state of the field" criterion.

However, I still have two major issues with cortecs.

The first one is that I still cannot reproduce the factor 13 compression claimed in the paper, with PCA compression and nc = 3. I however did not try with the referenced opacities (which, as the authors state, have 18 pressure points instead of 13). Could the authors provide a link to the opacities used? In the meantime, this is what I obtained following the "Quickstart" example (data sizes as reported by my OS):

The second one is that I find it very hard for most user to make the code work for any opacity format. I tried to test with petitRADTRANS opacities and had to perform non-trivial operations to get the Fitter object to work. I think this could be made much easier with little coding effort. Even with the already implemented opacities, the loading process is not explained in the documentation, except for the "platon" one in the "Quickstart" section, and I had to read the source code.

For the pRT opacities, this is the code I needed, I added suggestions and questions in comments with # ➡️:

from cortecs.opac.io import loader_base

# Change hardcoded attributes of the object
# ➡️ These could be Opac.__init__ optional arguments instead
loader_base.wl_key = "bin_edges"
loader_base.T_key = "t"
loader_base.P_key = "p"
loader_base.cross_section_key = "xsecarr"

# Loading "chimera" to use the loader_base object
# ➡️ This could be more intuitive (loader string and class name could match)
opac_obj = Opac("12C-16O__HITRAN.R1e6_0.3-28mu.xsec.petitRADTRANS.h5", loader="chimera")

# Select 1000 arbitrary wavelengths to avoid long computation time
opac_obj.wl = opac_obj.wl[1678861:1679861]  # ~ 5.2 µm, arbitrary choice
opac_obj.cross_section = opac_obj.cross_section[:, :, 1678861:1679861]

# Invert wavelength axis because the wavenumbers in the h5 file are in increasing order
# ➡️ This could be automatically detected and taken care of
opac_obj.wl = opac_obj.wl[::-1]  # ➡️ the wavelengths are assumed to be in cm-1 and converted to um. The source file units should be an Opac optional argument and a comment should be added next to this conversion
opac_obj.cross_section = opac_obj.cross_section[::-1]

# Move T and P axis to match the cortecs expected axis ordering
# ➡️ This could be optional Opac.__init__ arguments 
#    (e.g., temperature_axis=0, pressure_axis=1, wavelength_axis=2 by default, 
#     and move the axis accordingly if they are changed)
opac_obj.cross_section = np.moveaxis(opac_obj.cross_section, 0, 1)

# Convert to log-space to prevent negative results when using eval
# ➡️ Is this behaviour expected? I think it should be prevented
opac_obj.cross_section = np.log10(opac_obj.cross_section)

# Fit the opacities and instantiate the evaluator
fitter = Fitter(opac_obj, wav_ind=-2, nc=3)
fitter.fit()
evaluator = Evaluator(opac_obj, fitter)

# Test to obtain one value
temperature = 364.9409723  # ➡️ the T grid does not use round values, this is why fetching for the nearest value in the grid is important
pressure = 100.0
wavelength = 5.222e-06

evaluator.eval(temperature, pressure, wavelength)

# Save
fitter.save("test_pca_prt")

I have also other minor remarks listed below.

Code- and documentation-related comments

Paper-related comments

Answers

Setting these values too low could break the fits. Perhaps a better solution would be to mask out these points in the fitting process. -104 is set as a boundary for standard opacity functions (e.g., as used by the PLATON code).

Thanks for the clarification. In that case, I suggest adding a comment in the code that this value was chosen to follow the PLATON code.

The pca_coeffs array is shaped (N_WAVE, N_PC, N_PRES), and the vectors array is shaped (N_TEMP, N_PC).

Following the "Quickstart" example, I find instead that pca_coeffs is of shape (N_WAVE, N_PC, N_TEMP) (i.e. 4616, nc+1, 30), while vectors is of shape (N_PRESS, N_PC) (i.e. 13, nc+1). If that is not expected, then I suggest automatizing the fitted axe in favour of the second larger one (to prevent fitting over wavelengths).

We addressed the minor logic weaknesses. The remaining flake8 errors are because lines are too long or from star imports. We prefer to keep line lengths as they are (we’re conforming to the black code formatting style), but we can change to autopep8 if the reviewer desires.

I still get several warnings and weak warnings with code inspection in PyCharm. Also, the flake8 errors left are indeed mostly line length, but not only. There are also errors related with the use of import *, which is considered bad practice in packages (sec. 6.1, paragraph 8). However, as I said these are minor issues. These, as well as following general good practices would improve the code quality (i.e. ease of maintenance, of debugging, and readability in the long term) but I do not consider them obstacles for a publication.

Two of the eval funcs use the actual temperature–pressure values.

Indeed, I apologize. But in that case, you should prevent an error to be raised when the exact temperature/pressure needed is not passed. The function isclose is not sufficient, the code should instead fetch the nearest value. In the future, you could interpolate the opacities in that case instead.

We have included a new “workflow” section that hopefully illustrates this more clearly. Happy to elaborate!

Thanks, I think this is good enough now.

"l.52--55" [...] we have removed this paragraph.

This is indeed the paragraph I was referring to. Thank you.

Would it be helpful if we provided a second table for the Quickstart dataset?

Yes indeed. Maybe at the beginning of "Quickstart" to have an overview of the possibilities right away?

we’re happy to do so if it would make the case for Optimizer stronger.

I think if you plan to discuss about Optimizer in this submission, then you should indeed implement it for the polynomial method. Otherwise, this can be the object of a future update.

pscicluna commented 8 months ago

Thanks for your speedy updates! The new additions markedly improve the paper. I do have some more comments, which hopefully will be fairly minor. I am, however, still having issues with the tutorials (see below) and will likely have more comments after that is fixed, as I have not really been able to interact with cortecs as yet.

The description of potential failure modes is very helpful. The authors' suggestion of a figure showing it would be greatly appreciated! Seeing how it fails will help users diagnose their own issues. Adding this to the documentation as well would be great.

The idea behind Figure 1 is definitely what I had in mind. However, I find it a bit difficult to read - the opacity spectra are so rich that it is hard to see any details. It is also unclear from the figure which compression method was used. It might be helpful to enlarge the figure and/or zoom in on narrow wavelength ranges of the spectra. I realise that the residual plot shows that the differences are generally small except in extreme cases, but seeing the details would be much more illustrative of the power of cortecs than the current plot it. Ideally, the results of multiple different compression methods would be shown on the same data, to give a sense of the trade-offs between them.

The improved API docs are greatly appreciated - however, it appears that these changes have not yet synced to RTD. I assume this is because no new version has been tagged yet, in which case it's not a big deal.

I still find the description of some details a bit lacking. For example, the optimisation is incomplete. I see from the code that it is performing a grid-search to find the hyperparameters with the lowest MSE, but it would be helpful to have this described in the documentation. It's also not clear to me from the code how the maximum size of the compressed data is considered in the optimisation - the code to me looks like it is just performing a grid search over the hyperparameters, and not considering the size of the compressed data.

I noticed that the package requires that devs update the version number manually. This can be easy to forget, and result in a mismatch between tags and the version number in the code. It may be worth considering using setuptools-git-versioning and making the version a dynamic value in pyproject.toml. This ensures that the git tag (or even commit hash) is the ultimate arbiter of version numbers, and can help prevent mismatches. Since this would also involve updating the package from setup.py to pyproject.toml, it may be worth either waiting until the next major release, or trying to use python-versioneer instead of setuptools-git-versioning.

I'm still facing issues with loading the data for the quickstart tutorial. This may be related to my environment, but I can't be sure. Running on Colab with numpy==1.23.5 I receive the following error:

---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size)
    440             try:
--> 441                 return pickle.load(fid, **pickle_kwargs)
    442             except Exception as e:

UnpicklingError: invalid load key, '{'.

The above exception was the direct cause of the following exception:

UnpicklingError                           Traceback (most recent call last)
3 frames
<ipython-input-5-98289f3fc416> in <cell line: 1>()
----> 1 opac_obj = Opac(cross_sec_filename, loader="platon", load_kwargs=load_kwargs)
      2 
      3 opac_obj

/usr/local/lib/python3.10/dist-packages/cortecs/opac/opac.py in __init__(self, filename, loader, load_kwargs)
     46         self.filename = filename
     47         self.load_obj = self._get_loader(loader)
---> 48         self.wl, self.T, self.P, self.cross_section = self.load_obj.load(
     49             filename, **load_kwargs
     50         )

/usr/local/lib/python3.10/dist-packages/cortecs/opac/io.py in load(self, cross_sec_filename, T_filename, P_filename, wl_filename)
    139         # temperatures are in K.
    140         # pressures are in Pa, I believe.
--> 141         wl = np.load(wl_filename, allow_pickle=True)
    142         T = np.load(T_filename, allow_pickle=True)
    143         P = np.load(P_filename, allow_pickle=True)

/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size)
    441                 return pickle.load(fid, **pickle_kwargs)
    442             except Exception as e:
--> 443                 raise pickle.UnpicklingError(
    444                     f"Failed to interpret file {file!r} as a pickle") from e
    445 

UnpicklingError: Failed to interpret file 'wavelengths.npy' as a pickle

However, running on Windows with python==3.11.7 and numpy==1.26.3 I get a different one:

---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
File ...anaconda3\envs\cortecs_review\Lib\site-packages\numpy\lib\npyio.py:465, in load(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size)
    464 try:
--> 465     return pickle.load(fid, **pickle_kwargs)
    466 except Exception as e:

UnpicklingError: invalid load key, '\x0a'.

The above exception was the direct cause of the following exception:

UnpicklingError                           Traceback (most recent call last)
Cell In[7], line 1
----> 1 opac_obj = Opac(cross_sec_filename, loader="platon", load_kwargs=load_kwargs)
      3 opac_obj

File ...anaconda3\envs\cortecs_review\Lib\site-packages\cortecs\opac\opac.py:48, in Opac.__init__(self, filename, loader, load_kwargs)
     46 self.filename = filename
     47 self.load_obj = self._get_loader(loader)
---> 48 self.wl, self.T, self.P, self.cross_section = self.load_obj.load(
     49     filename, **load_kwargs
     50 )
     52 self.n_wav, self.n_t, self.n_p = len(self.wl), len(self.T), len(self.P)
     54 return

File ...anaconda3\envs\cortecs_review\Lib\site-packages\cortecs\opac\io.py:141, in loader_platon.load(self, cross_sec_filename, T_filename, P_filename, wl_filename)
    120 """
    121 loads in opacity data that's built for PLATON. To be passed on to Opac object.
    122 
   (...)
    136     name of wavelength file
    137 """
    138 # todo: check wl units. They're in meters here.
    139 # temperatures are in K.
    140 # pressures are in Pa, I believe.
--> 141 wl = np.load(wl_filename, allow_pickle=True)
    142 T = np.load(T_filename, allow_pickle=True)
    143 P = np.load(P_filename, allow_pickle=True)

File ...anaconda3\envs\cortecs_review\Lib\site-packages\numpy\lib\npyio.py:467, in load(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size)
    465     return pickle.load(fid, **pickle_kwargs)
    466 except Exception as e:
--> 467     raise pickle.UnpicklingError(
    468         f"Failed to interpret file {file!r} as a pickle") from e

UnpicklingError: Failed to interpret file 'wavelengths.npy' as a pickle

As you can see, there are strong similarities between the two errors, but the details are slightly different. However, I wonder if the larger problem is that pickle is not really intended to be completely portable? It might be better to use a more portable format, perhaps HDF5, to package the example data. For all its faults, even FITS would guarantee that the data would be widely portable.

Minor comments:

There are some small typesetting issues in the caption of Figure 1. It looks like some of the math-mode delimiters are missing, and as a result the exponents are not being rendered correctly.

Line 112/113: "These opacity data (cite them) were originally stored" my emphasis - please insert the citation here. Or, if is the same data mentioned in the previous paragraph, make that clear.

ivastar commented 7 months ago

@arjunsavel please review the second round of comments. It would be especially useful if you can help @pscicluna diagnose the error so he can actually test our the code. It seems like he is testing on Colab which should be relatively easy to diagnose.

pscicluna commented 7 months ago

@arjunsavel if it would facilitate a faster feedback loop, I can open an issue on the cortecs repo if you prefer. Or we can take the discussion offline.

arjunsavel commented 7 months ago

Thanks all! I'm cleaning up my next round of responses now.

@pscicluna, I've made a new branch over at https://github.com/arjunsavel/cortecs/tree/diff_filetype. I've output some of the quickstart data into text files there for you to check out. Please let me know if that works for you — we can definitely iterate in an issue in the cortecs repo. Thanks again for following up on this!

arjunsavel commented 7 months ago

also, @ivastar I'm thinking of submitting this paper as an arxiv preprint sometime soon — should I just ping the editorialbot?

arjunsavel commented 7 months ago

@editorialbot commands

editorialbot commented 7 months ago

Hello @arjunsavel, here are the things you can ask me to do:


# List all available commands
@editorialbot commands

# Get a list of all editors's GitHub handles
@editorialbot list editors

# Check the references of the paper for missing DOIs
@editorialbot check references

# Perform checks on the repository
@editorialbot check repository

# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist

# Set a value for branch
@editorialbot set joss-paper as branch

# Generates the pdf paper
@editorialbot generate pdf

# Generates a LaTeX preprint file
@editorialbot generate preprint

# Get a link to the complete list of reviewers
@editorialbot list reviewers
arjunsavel commented 7 months ago

@editorialbot generate preprint

editorialbot commented 7 months ago

:page_facing_up: Preprint file created: Find it here in the Artifacts list :page_facing_up:

arjunsavel commented 7 months ago

@editorialbot generate preprint

editorialbot commented 7 months ago

:page_facing_up: Preprint file created: Find it here in the Artifacts list :page_facing_up:

arjunsavel commented 7 months ago

Thank you again to the reviewers for such informative and thoughtful responses! We’ve addressed them below.

Also, we’ve added back in that the project is available on conda, because it is now conda-installable: https://github.com/conda-forge/cortecs-feedstock

Response to @doriannblain

The first one is that I still cannot reproduce the factor 13 compression claimed in the paper, with PCA compression and nc = 3. I however did not try with the referenced opacities (which, as the authors state, have 18 pressure points instead of 13). Could the authors provide a link to the opacities used?

Yes, the opacities are located here: https://www.dropbox.com/sh/0cxfolfmrs8ip37/AACV3dp5X3tJ-YpXaxkPtWE6a/RETRIEVAL/ABSCOEFF?dl=0&subfolder_nav_tracking=1

I think a more "theoretical" explanation on how the compression factor is achieved is necessary. For example, something along the line of "instead of having n pressure points, we now have (n/~13 =) 3 (?) PCA coefficients [...]".

We include a statement to this effect at the beginning of the quickstart notebook now: “Compressing with PCA essentially allows us to swap an NTEMP x NPRESSURE x NWAV array for an NTEMP x NCOMPONENT x NWAV array, achieving approximately NPRESSURE/NCOMPONENT compression. “

For the pRT opacities, this is the code I needed, I added suggestions and questions in comments with # :

Thank you so much for trying this out on a new opacity format! This is a very instructive example. I've added these as io options. I could also add a tutorial showing how to add a new opacity source, which would expose some more functionality to users.

I thank the authors for making the tensorflow and keras modules optional. However, these modules are still imported with cortecs.fit.fit (required to use the Fitter object), whether or not they have been installed. I suggest using a try/except ModuleNotFoundError block in the cortecs.fit.fit_neural_net submodule, or adding intermediate functions in the Fitter object that would import the required submodule functions as needed, e.g.:

We have wrapped this as suggested.

I appreciate the changes made in the "Test the installation" section. However, python3 -m unittest discover src still does not run any test. I see that there are tests implemented, but this is what I obtain:

Ah, I misunderstood — thank you for clarifying. I’ve updated the tests command (python -m unittest discover src/cortecs/tests), and that should work now.

I think the fact that the fitted T-P dependency of an opacity at the chosen wavelength (wav_ind) is assumed for all the opacity's wavelengths should be highlighted even more. I suggest adding a step where you explain how and why you chose wav_ind=-2.

We have highlighted as such: “Again, the temperature--pressure dependence at all wavelengths is fitted as a linear combination of the eigenvectors calculated at wav_ind.”

Related to the above, I think your optimize_pca function can find the best (or a "good enough") wavelength to perform the fit. This is very useful, so maybe this should be highlighted here as well.

We have highlighted as such: “See the optimizing fits notebook for more information on how to choose the best wav_ind.”

I noted several inconsistencies between the text and the code (e.g. "We pass nc=3", then in cell 7, nc=5). This should be fixed.

We fix these inconsistencies: “We pass nc=5”, and “wav_ind=-2”

I appreciate the extended discussion on nc, but I think it should be extended even more. What happens if one increase/decrease nc? What are the benefits/trade-offs? This should be discussed.

We add the following phrase: “Increasing nc can provide a better fit, but doing so will take up more RAM. Decreasing nc may yield a worse opacity fit, but it will consume less memory.”

"This is the wavelength at which the eigenvectors are found". Do you mean "calculated" instead of "found"?

Yes, we change to “calculated.”

It is not clear in the example what the eval_pca arguments are and what they do. You could explicit them in the cell and describe them in the text.

Done.

Currently, the save function saves the "fitting parameters" in 2 separate npy files. What about saving them in one npy file, or one npz file with numpy.savez_compressed, or even in a HDF file, since you have h5py as requirement?

Done! Now using savez_compressed.

I think the single point error evaluation (cells 14--18) and related comments could be removed entirely to only show the results from the calc_metrics function, which provide a more global view of the error.

Done.

The accuracy map in cell 6 that I obtain is different from the one in the documentation. I suspect it has not been updated.

We’ve rerun this and enforced seeds for reproducibility. Thank you!

The function save_neural_network does not exist. It should be save_neural_net (although in my opinion, the first name is the better one).

We’ve changed to “save_neural_net” for now (will change in future)

The "Fitting with polynomials" example has not been updated, as I obtain a different value at the end of cell 31.

We’ve fixed this.

Same with the "Fitting with polynomials" example. Also, the all_wl argument does not exist.

We removed the “all_wl” argument.

I am glad that this is implemented. However, I think there should be no sampling of T and P by default. In the "Quickstart" example, sampling can be used but they should be explicited.

Done!

Also, the function requires the attribute species_weight, which is not in e.g. loader_base.

Yes, the function assumes that the opac object has been loaded in with the species_weight attribute assigned. I’ve added a try / except clause in the calc_metrics function. thoughts?

l. 35: you could mention that in many cases, not all wavelengths need to be loaded.

We have added: “In many cases, not all wavelengths need to be loaded, e.g. if the user is down-sampling resolution.”

l. 49: I suggest rephrasing the sentence to make it clearer that "reasonable" comes from the benchmark, e.g. " Additionally, our benchmark only shows that the amounts of error from our compression technique is reasonable in the spectra of exoplanet atmospheres at pressures greater than a microbar for a single composition".

Done.

l. 56 and 57: you should add a reference to support this claim.

Done (and relaxed to “often modeled in hydrostatic equilibrium”

l. 63: I acknowledge that this is a hypothetical statement, but are you aware of an opacity with a sharp temperature-pressure dependence? You could add it as an example.

We’ve added a nod to the temperature-sensitive Lyman-alpha transitions.

I find the transparent lines hardly visible. Maybe zoom-in around 1.8 micron?

We have zoomed accordingly.

The units on the top panel and the units of the corresponding residuals on the bottom panel are not the same. Why?

Unit error. Thanks for catching!

Fix the "$" left in the captions.

Done.

About the 1e27 m column, I agree that you need an unrealistically long column, but are you sure of this value?

I did an order of magnitude estimate — should I link the calculation in the paper? I solved ds = tau / alpha, where tau=1 for the optically thick scenario, alpha = kappa * mass density, mass density is derived from ideal gas, and kappa is set to 1e-33 cm^2/g (equivalent to the cross-section of sigma ~ 5e-60 m^2). I also changed my value to 1e34 —my calculation was expressed in jupiter radii.

l. 112: as said above, I still cannot get the 13 compression factor. There is also a residual draft comment ("cite them") left.

Removed the comment.

l. 126: could you be more precise? An order of magnitude (e.g. a few minutes of differences for a run of several hours) would be enough.

We have clarified: “(about 5 hours of extra runtime on a roughly 2 day-long calculation)”

Fig. 2: I thank the authors for fixing the gold and teal labels, but now they read identical values, while it is clear that the posteriors have some differences.

Copied the wrong labels — thanks for catching this!

Table 1: I thank the authors for adding this table, but the decompression time for PCA (230 s) does not seem to match the value given at l. 120 (789 ns). Maybe in the first case the time is to obtain the opacities at all wavelengths, while in the other it is at one wavelength only? This should be mentioned.

Thank you for catching this — you are correct. We have clarified that the ns estimate is “Accessing/evaluating a single opacity”. And for the table, “The timing is over the entire wavelength range.”

Thanks for the clarification. In that case, I suggest adding a comment in the code that this value was chosen to follow the PLATON code.

The comment now reads: “# set a floor to the opacities, as per the floor used in the PLATON code's opacity files.”

Following the "Quickstart" example, I find instead that pca_coeffs is of shape (N_WAVE, N_PC, N_TEMP) (i.e. 4616, nc+1, 30), while vectors is of shape (N_PRESS, N_PC) (i.e. 13, nc+1). If that is not expected, then I suggest automatizing the fitted axe in favour of the second larger one (to prevent fitting over wavelengths).

Ah, great catch — thank you! We’ve automated this now.

Indeed, I apologize. But in that case, you should prevent an error to be raised when the exact temperature/pressure needed is not passed. The function isclose is not sufficient, the code should instead fetch the nearest value. In the future, you could interpolate the opacities in that case instead.

Great point. We now eschew isclose for finding the nearest value.

Yes indeed. Maybe at the beginning of "Quickstart" to have an overview of the possibilities right away?

Done!

I think if you plan to discuss about Optimizer in this submission, then you should indeed implement it for the polynomial method. Otherwise, this can be the object of a future update.

Sure, we can have it be the object of a future update. It’s not listed in the paper, I believe.

Response to @pscicluna

The description of potential failure modes is very helpful. The authors' suggestion of a figure showing it would be greatly appreciated! Seeing how it fails will help users diagnose their own issues. Adding this to the documentation as well would be great.

We have added a new Fig. 1 showing the optical depth and e^-tau as a function of grid size.

The idea behind Figure 1 is definitely what I had in mind. However, I find it a bit difficult to read - the opacity spectra are so rich that it is hard to see any details. It is also unclear from the figure which compression method was used. It might be helpful to enlarge the figure and/or zoom in on narrow wavelength ranges of the spectra. I realise that the residual plot shows that the differences are generally small except in extreme cases, but seeing the details would be much more illustrative of the power of cortecs than the current plot it. Ideally, the results of multiple different compression methods would be shown on the same data, to give a sense of the trade-offs between them.

I’ve zoomed in on this region per the other reviewer’s suggestion. I’ve also made the line widths thinner and made the plot taller. Is this helpful?

The improved API docs are greatly appreciated - however, it appears that these changes have not yet synced to RTD. I assume this is because no new version has been tagged yet, in which case it's not a big deal.

Thank you for catching this! We hadn’t included matplotlib in a few doc requirements, so a few modules could not be loaded. We have changed this, and the API should now reflect this accordingly.

I still find the description of some details a bit lacking. For example, the optimisation is incomplete. I see from the code that it is performing a grid-search to find the hyperparameters with the lowest MSE, but it would be helpful to have this described in the documentation. It's also not clear to me from the code how the maximum size of the compressed data is considered in the optimisation - the code to me looks like it is just performing a grid search over the hyperparameters, and not considering the size of the compressed data.

I’ve added this to the documentation: “This process essentially performs a grid search to find the hyperparameters with the lowest mean-squared error.” I’ve also removed the size of the compressed data as a consideration, as this was left over from an earlier version of the code.

I noticed that the package requires that devs update the version number manually. This can be easy to forget, and result in a mismatch between tags and the version number in the code. It may be worth considering using setuptools-git-versioning and making the version a dynamic value in pyproject.toml. This ensures that the git tag (or even commit hash) is the ultimate arbiter of version numbers, and can help prevent mismatches. Since this would also involve updating the package from setup.py to pyproject.toml, it may be worth either waiting until the next major release, or trying to use python-versioneer instead of setuptools-git-versioning.

This is a great point. We’ll defer this until our next major release, so that we don’t have to greatly update the packaging at this point.

I'm still facing issues with loading the data for the quickstart tutorial. This may be related to my environment, but I can't be sure. Running on Colab with numpy==1.23.5 I receive the following error:

Interesting! Thank you for investigating this. These are the PLATON files, so I don’t want to go too far into redoing the format entirely. That being said, I have .txt files in the a new branch: https://github.com/arjunsavel/cortecs/tree/diff_filetype

ivastar commented 7 months ago

@arjunsavel have you been able to reproduce the issue @pscicluna is having in Colab? That should be fairly easy to reproduce, no? what versions of python/numpy are you using? Serializing files as pkl can be problematic across platforms and is generally not recommended. Since your code depends on pkl files I think it is critical that this issue gets resolved. Does the diff_filetype branch actually use the txt files or just has them? The io.py code around the lines of the error look unchanged.

doriannblain commented 7 months ago

Hi all, here is my review on the new version.

General comment

I once again appreciate most of the changes made by the author. The compression factor is now clearly explained, and the documentation is now clearer and cleaner. The way to launch the implemented automated tests is now correctly indicated in the documentation. I have not tested the optional tensorflow fix, but looking at the code I think it should work as expected. Having outputs in a single file now makes the outputs easier to manage.

I still have some grips with the code behaviour, but I think fixing these are not required for this submission. The code works (mostly, see below) as expected if one limit themself to the paper's use case and the provided examples. Making it fully work in more cases (including at least for the already implemented opacities) is very important but can be the subject of future updates.

I thank the author for providing the opacities they used in their paper; however, I still could not manage to reach the 13 compression factor reported: I obtained a factor 10 instead (which is still respectable). I might have done something wrong though. Below is the code I used.

import numpy as np
import sys
import os

sys.path.insert(0, os.path.abspath("../../src"))

import cortecs
from cortecs.opac.opac import *
from cortecs.fit.fit import *
from cortecs.fit.fit_pca import *
from cortecs.eval.eval import *
from cortecs.fit.metrics import *

opac_obj = Opac('xsecarrCO_HITEMP_HELIOS_samp_3800_10000_R500000.h5', loader="chimera")
fitter = Fitter(opac_obj, wav_ind=-2, nc=3)
fitter.fit()
fitter.save("test_pca_13_compression_factor")

On my OS, xsecarrCO_HITEMP_HELIOS_samp_3800_10000_R500000.h5 is reported to be 2.005 GB, as reported in the paper, but test_pca_13_compression_factor.npz is 194.309 MB, not 154 MB.

If I load test_pca_13_compression_factor.npz, I obtain a shape of (373260, 4, 18) for pca_coeffs, and (39, 4) for vectors. I also tried with nc=2 and obtain NCOMPONENT (+1) = 3, but this time I find a PCA file of 143.090 MB (so a compression factor of 14). This discrepancy is my only left major issue, but I think it can be quickly solved.

I have also other minor remarks listed below.

Code- and documentation-related comments

Paper-related comments

Answers

I could also add a tutorial showing how to add a new opacity source, which would expose some more functionality to users.

I think this would be great, and that you should definitly do it at some point. However I do not think this is necessary for this submission.

We have highlighted as such: “See the optimizing fits notebook for more information on how to choose the best wav_ind.”

As mentioned above, the notebook does not talk about PCA optimizing, which is problematic. Otherwise, I think this part is clear enough.

We have added: “In many cases, not all wavelengths need to be loaded, e.g. if the user is down-sampling resolution.”

I was more thinking about wavelength selection during loading (this is possible with HDF5 without RAM overhead, e.g. h5py_file['xsec'][:, :, w_id_min:w_id_max]), but that's fine too.

I did an order of magnitude estimate — should I link the calculation in the paper?

That would be preferable. It is always a good thing to not have to find by yourself an equation if you do not know it. Otherwise, I agree with the 1e35 m value as put in the ArXiv version of the paper.

arjunsavel commented 7 months ago

Thanks again to @doriannblain for such careful attention to our work! Responses below.

I thank the author for providing the opacities they used in their paper; however, I still could not manage to reach the 13 compression factor reported: I obtained a factor 10 instead (which is still respectable). I might have done something wrong though. Below is the code I used.

Apologies for miscommunicating w.r.t. The number of PCA components. I indeed had NPC=2 for my calculations, and in the previous file format that led to sizes of 161.2 MB. My OS is also reporting the size of xsecarrCO_HITEMP_HELIOS_samp_3800_10000_R500000.h5 as 2.1 GB.

Thanks for adding the table. However I'm surprised that in your article, the Polynomials compression has the lowest deviation, while here it has by far the largest. Does that mean that the accuracy of the different methods heavily depends on the dataset? If so, that must be highlighted.

We have added this in italics: Note that the accuracy of the different methods heavily depends on the dataset.

About NCOMPONENT: if this is supposed to be the number of principal components, should not NCOMPONENT+1 instead of NCOMPONENT be used to represent the shape of pca_coeffs and vectors in the introduction?

Correct, thanks!

Section "Fitting the opacity", it is stated that passing nc=5 will launch a fit with in "3" eigenvectors. I think it should be "5" instead.

Good catch, thanks.

Instead of loading "test_pca.npz" twice, you could use:

Done, thanks for the suggestion!

The optimizing notebook that is linked to serve as a guide for choosing wav_ind for PCA talks only about neural networks. You should at least add a work in progress section in that notebook for PCA optimization.

We briefly demonstrate PCA optimization now, as well.

calc_metrics does not works with the "CHIMERA" opacities, as species_weight is still needed to run the function. Instead of using a try/except block, you could use an if block to handle the case where species_weight is not present in the Opac object. Alternatively, species_weight should be an argument common to all loader_base objects (and be set to 1 if not relevant).

Now using an if block!

I think at the end of the notebook are leftover debug cells. These should be cleaned.

Done, thanks!

In the fitting polynomial example, there are also some debugging cells left at the end of the notebook. These should be cleaned.

Done.

In the neural network example, I still get results different from the ones in the documentation, despite the seed. I attached figures of what I obtain.

Interesting. I get results much closer to yours when I run the notebook on my local machine. There’s something interesting deep in the readthedocs OS setup…for now, I’ll mention that there’s some inherent discrepancies between OS, but that runs on an individual machine are reproducible.

l.35: load_kwargs default value is a mutable, which can lead to unexpected behaviours because the default value can change during runtime. Instead, you should use None as default value and then if load_kwargs is None: \ load_kwargs = {}.

Done, thanks for the suggestion!

load_kwargs could be named with the standard **kwargs instead, which would be even cleaner than the above. It depends on if you see Opac as an extension of loader_base or not though.

I’ve kept this as load_kwargs to differentiate between loader_kwargs, added based on the below suggestion.

while I appreciate the addition of numerous useful attributes to the loader_base object, they cannot be set through an Opac object. There are multiple ways to fix this, but I think the quickest way would be to add a loader_kwargs argument to Opac.init, and pass it to _get_loader. Otherwise, I think that conceptually Opac should be a child of loader_base, but this would require a quite significant redisign of the code.

Added loader_kwargs argument.

test_fit.py, l.99: this line is creating a warning during the tests, you could replace it by np.ones((3, 3)) * np.nan.

Done!

I think the documentation really lacks a detailed notebook on how to load the opacities (this could also be a "Quickstart" section as it would be rather small). You could simply list those next to their respective loader key. Right now one has to look into the source code to know that several opacities are (mostly) handled, which is not ideal. But as stated above, this can be the object of future updates.

We’ve added this table in the quickstart.

Fig. 1: I do not think this figure is necessary. The 3 points above detailing the limitations of cortecs are sufficient in my opinion. If you want to keep the figure, I have multiple remarks.

Sounds good — I've removed the figure.

The label of the bottom panel should be log_10(absolute residual).

Changed to 'Absolute residuals of\n' + '$\log{10}\sigma{\lambda}$ (m$^2$)' (it’s the difference between the transparent and thick lines in the top plot)

Mention explicitely in the caption that sigma_lambda is the opacity.

Done.

Add the units of sigma_lambda in the caption.

Done!

Mention that you cut-off your opacity at 1e-104, explaining the shape of the residuals in teal and dark red.

Done.

Benchmark, 2nd paragraph: again, I find 194 MB instead of 154 MB. See the response to the referee's first point.

As mentioned above, the notebook does not talk about PCA optimizing, which is problematic. Otherwise, I think this part is clear enough.

We show PCA optimizing now, as well.

That would be preferable. It is always a good thing to not have to find by yourself an equation if you do not know it. Otherwise, I agree with the 1e35 m value as put in the ArXiv version of the paper.

We’ve included the equation, and are now more precise about the value.

arjunsavel commented 7 months ago

@ivastar @pscicluna i'm unable to reproduce the colab error with the pickle files, unfortunately (on numpy==1.23.5, python 3.10.12).

I'm happy to move the quickstart away from pickle files and not use them internally. The issue is that some other tools (like PLATON) package their data in pickle files, which my package then interfaces with. I can chat with their developers to see if there's a workaround here!

doriannblain commented 6 months ago

Hi, is there a way I can see the latest version of the paper? The arXiv preprint is one month old and the JOSS preprint is 2 months old.

Otherwise, I tested the latest version of cortecs with the implemented fixes. I had a handful of issues.

All of these issues are however unrelated to what is reported in the paper, so fixing them is not strictly necessary.

In the "Quickstart" notebook: you still report a 9% error in log10 abundance, but the error printed is 0.055%.

About the compression factor, if you used nc=2, then in the paper it should read: "preserving 2 principal components". Also, please update the file sizes with those obtained with latest version.

Otherwise, I'm satisfied with the changes. I now think the basic functionalities of the code are sufficiently documented, and the examples provided are enough to use PLATON opacities in a real-world case. The code could easily be much more general and may not handle all edge cases very well (although it does so much better now than in the first version), but it works as advertised.

I now just need to have access to the latest version of the article, so I can tick the last boxes of my checklist.

arjunsavel commented 6 months ago

Thanks for the helpful feedback, @doriannblain! I've addressed them below. I'll also regenerate the paper PDF in the next comment.

The tensorflow optional dependency is not correctly handled. I suggest replacing the first lines of fit_neural_net.py with:

Thanks. We’ve implemented this better catch.

The auto-detection of wavenumber sorting when loading an Opac object may not work properly in all cases. I suggest replacing l. 84 of io.py with: if np.all(np.diff(wl)) <= 0:. This is still quite dangerous as one can imagine e.g. a faulty opacity file where all wavenumber/wavelength are not correctly ordered. A more complex but more secure check would be ideal, but on the other hand this should work in almost all cases already.

Great point. We’ve updated this line as suggested.

calc_metrics may still not work as expected with non-PLATON opacities, because of the hardcoded conversion val = np.log10(val species_weight AMU * 1e-4).

This seems like a bit of a refactor to do correctly, so we’ll punt to later work. We’ve made an issue: https://github.com/arjunsavel/cortecs/issues/32

In the "Quickstart" notebook: you still report a 9% error in log10 abundance, but the error printed is 0.055%.

Done.

About the compression factor, if you used nc=2, then in the paper it should read: "preserving 2 principal components". Also, please update the file sizes with those obtained with latest version.

Done.

Also, please update the file sizes with those obtained with latest version.

Done.

arjunsavel commented 6 months ago

@editorialbot generate pdf

editorialbot commented 6 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

doriannblain commented 6 months ago

Thank you for the fixes, and for the article. I have only 2 comments left on the latter:

Once this is fixed, I will have no comments left and will complete my checklist.

arjunsavel commented 6 months ago

Thanks for your comments! I've addressed these changes. I appreciate all of your thoughtful feedback!

arjunsavel commented 6 months ago

@editorialbot generate pdf

editorialbot commented 6 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

doriannblain commented 6 months ago

Thank you for the update. My last remark is that at l. 126 of the article, 2.1 / 0.143 GB is more a compression factor of 15 than 13.

Otherwise, I'm fine with the submission in its current state. I completed my checklist.

I will stay around to see if I can help with pscicluna's issues. Maybe this is an issue with the numpy version? In numpy's (1.26.3) source code, the pickle format in numpy.load is tried as a last resort, after checking if the file is in the .npz or the .npy format. I assume that the opacity files are indeed npy-formatted, and not pickle-formatted. So in pscicluna's case, numpy has somehow failed to identify the npy format. That could be because the npy identification is different in his version. This is strange though, because he has the same version than me and the same OS, and I have no issue. The only difference is that I have Python 3.12.1.

arjunsavel commented 5 months ago

@pscicluna just checking in! Were you able to access the data in the different format? Would the best path forward be to remove the pickle files in the quickstart and warn users that working with PLATON opacities might throw an error?

ivastar commented 5 months ago

Hi @arjunsavel, I was able to reproduce the quick start example by downloading some of the example files here: https://colab.research.google.com/drive/1vkCY_mCNzruTxRbMXjSXSybzChLdHn8N?usp=sharing

Is this the outstanding issue? @pscicluna can we iterate on this so we can complete the review? Maybe you can share the notebook where the issue appears?

ivastar commented 4 months ago

@arjunsavel and @pscicluna, pinging this issue again. I'm not sure if the example I tested above really addresses the issue you encountered. Could you please check and comment? The submission is really close and it would be great if we can wrap it up in the next couple of weeks. Thank you!

ivastar commented 4 months ago

@arjunsavel looks like we are a bit stuck. Can you please confirm that the example I was able to reproduce is the same that the reviewer had an issue with? If not, can you please give me more info on the issue?

arjunsavel commented 4 months ago

@ivastar thanks for taking a look at this! It seems like this was the same that @pscicluna had an issue with.

arjunsavel commented 3 months ago

@ivastar just following up here! What should my next steps be?

arjunsavel commented 2 months ago

@ivastar just checking in!