openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
707 stars 37 forks source link

[REVIEW]: quickBayes: An analytical approach to Bayesian loglikelihoods #6230

Open editorialbot opened 8 months ago

editorialbot commented 8 months ago

Submitting author: !--author-handle-->@AnthonyLim23<!--end-author-handle-- (Anthony Lim) Repository: https://github.com/ISISNeutronMuon/quickBayes Branch with paper.md (empty if default branch): 82_paper Version: 1.0.0b18 Editor: !--editor-->@osorensen<!--end-editor-- Reviewers: @JonasMoss, @mattpitkin, @prashjet Archive: Pending

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/2439ebe608e0c76d3147e00735dab556"><img src="https://joss.theoj.org/papers/2439ebe608e0c76d3147e00735dab556/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/2439ebe608e0c76d3147e00735dab556/status.svg)](https://joss.theoj.org/papers/2439ebe608e0c76d3147e00735dab556)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@JonasMoss & @mattpitkin & @prashjet, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @osorensen know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Checklists

πŸ“ Checklist for @JonasMoss

πŸ“ Checklist for @mattpitkin

πŸ“ Checklist for @prashjet

editorialbot commented 8 months ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 8 months ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.15 s (676.3 files/s, 69408.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          69           1537           2092           5281
reStructuredText                16            201            316            279
YAML                             8             39             15            220
Markdown                         3             21              0             62
TeX                              1              3              0             40
DOS Batch                        1              8              1             26
make                             1              4              7              9
-------------------------------------------------------------------------------
SUM:                            99           1813           2431           5917
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 8 months ago

Wordcount for paper.md is 666

editorialbot commented 8 months ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/0921-4526(92)90036-R is OK
- 10.1103/RevModPhys.83.943 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 8 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mattpitkin commented 8 months ago

Review checklist for @mattpitkin

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

JonasMoss commented 8 months ago

Review checklist for @JonasMoss

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

mattpitkin commented 8 months ago

Hi @AnthonyLim23, I've just got a few minor things to suggest initially:

@software{quasielasticbayes,
      author = {{Sivia}, D. and {Howells}, S.},
      title = {quasielasticbayes version 0.2.0}
}

@article{bayesPaper,
     title = {Bayesian analysis of quasielastic neutron scattering data},
     journal = {Physica B: Condensed Matter},
     volume = {182},
     number = {4},
     pages = {341-348},
     year = {1992},
     note = {Quasielastic Neutron Scattering},
     issn = {0921-4526},
     doi = {10.1016/0921-4526(92)90036-R},
     url = {https://www.sciencedirect.com/science/article/pii/092145269290036R},
     author = {{Sivia}, D.~S. and {Carlile} C.~J. and {Howells}, W.~S. and {KΓΆnig}, S.},
     abstract = {We consider the analysis of quasielastic neutron scattering data from a Bayesian point-of-view. This enables us to use probability theory to assess how many quasielastic components there is most evidence for in the data, as well as providing an optimal estimate of their parameters. We review the theory briefly, describe an efficient algorithm for its implementation and illustrate its use with both simulated and real data.}
}
}

@article{bayesReview,
         title = {Bayesian inference in physics},
         author = {{von Toussaint}, Udo},
         journal = {Rev. Mod. Phys.},
         volume = {83},
         issue = {3},
         pages = {943--999},
         year = {2011},
         month = {Sep},
         publisher = {American Physical Society},
         doi = {10.1103/RevModPhys.83.943},
         url = {https://link.aps.org/doi/10.1103/RevModPhys.83.943}
}

@book{bayesBook,
      author = {{Sivia}, D. and {Skilling}, J.},
      edition = {Second},
      publisher = {Oxford University Press},
      title = {Data Analysis A Bayesian Tutorial},
      year = {2006},
      isbn = {9780198568322}
}
mattpitkin commented 8 months ago

In the developer section of the docs, could you please add a sentences or two stating how users/developers can:

2) Report issues or problems with the software 3) Seek support

i.e., just say that users can submit issues in the GitHub repo.

AnthonyLim23 commented 8 months ago

I have updated the PR for the paper with your bib file. I then edited the authors list for the software so the shared names are the same as the paper.

The other comments have been addressed in https://github.com/ISISNeutronMuon/quickBayes/pull/88

mattpitkin commented 8 months ago

@editorialbot generate pdf

editorialbot commented 8 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mattpitkin commented 8 months ago

@AnthonyLim23 It looks like I made a typo in my version of the .bib file and {Carlile} C.~J. is missing a comma, so should be {Carlile}, C.~J.. Also, is there any possible date and URL that could be use for the Quasielasticbayes v0.2.0 reference?

While, I'm at it - I spotted a minor typo. In the acknowledgements quickBayes is spelled wrong.

AnthonyLim23 commented 7 months ago

For the Quasielasticbayes 0.2.0 I dont have a date or url. It was a programme written late 80's early 90's for internal use by scientists. So was never properly documented. I can put the github that it has since been moved to (and updated to work with Python), what do you think?

osorensen commented 7 months ago

@AnthonyLim23, I think you can put the GitHub link that is has been moved to.

AnthonyLim23 commented 7 months ago

That should be everything updated.

JonasMoss commented 7 months ago

I have now read much of the documentation and can only recommend a full rewrite of it.

Since this package is about Bayesian statistics, I need to know:

To fix these problems it may help to take a look at, e.g., the docs for linear regression in pymc. Here it's clearly stated:

More precise terminology would also help. I would also recommend fewer attempts at intuitive explanations of the statistics.

Here are a couple of more detailed responses regarding the loglikelihood page.

osorensen commented 7 months ago

Thanks a lot for your thorough review of the documentation, @JonasMoss!

mattpitkin commented 7 months ago

I agree with @JonasMoss's comments - the documentation needs very considerable work. The introduction should be a clear succinct summary of what the package is for - the example given is not very clear, and if included (I would actually recommend removing it) needs to be explained in more detail. As stated in the previous comments this should include: the model used, the model parameters that are being fit, how are they being fit, an explanation of the input data, e.g., something about choosing two different noise levels to show a particular effect. (As an aside, in the introduction example with the Gaussian profiles, the docs/code state that these have an amplitude of 103, but the plots show a value of about 50; how is the amplitude being defined?)

The API documentation also needs major work. Currently, the API docs exists, but following down the chain of links shows most functions/classes are very minimally documented, without any clear explanation of what they do or what the input arguments and outputs are. Having a full API documentation is not a requirement for a JOSS paper, so long there are several usage cases shown in examples that are well described and cover a decent range of the most used package functionality. I would recommend focusing on a few good quality examples and remove most of the API from the docs unless it points to well documented material.

AnthonyLim23 commented 7 months ago

In order to reduce the amount of work we all need to do, I want to make sure that we are agreed on what changes need to be made. I hope that is ok. I agree that the documentation needs more work, but I want it to be appropriate for what the package does.

This package is a bit strange as it uses some Bayesian statistics, but not directly. So I want to check how much detail on Bayesian statistics is appropriate.

My understanding is that normally Bayesian inference uses multiple "walkers"/"samples" that are initially distributed across the parameter space according to a prior distribution (usually flat). However, in this package a prior is not defined by the user. Instead they define some initial start parameters, which are then used for a fit (the user can choose the fit engine/method). In that sense there is only a single point for the prior (unless a multistart method is used).

There is currently no posterior distribution produced by the code (I would like to add it someday, but would want to verify the results against DREAM from bumps). At present the code just produces the peak value of the distribution. This result is then used to determine which model is more likely.

The models are defined as fitting functions, which when called return an array of the corresponding y values. These are used in the optimisation to calculate the local minima. I have included some "special" functions that allow the construction of more complicated models with minimal work (e.g. composite for addition and convolution). The models are then tried within a workflow to calculate the most likely models (e.g. is the data linear or quadratic).

Input data is generally just x, y, e and the user can also include additional x, y, e data that correspond to a resolution for Quasi Elastic Neutron Scattering instruments (e.g. Osiris at ISIS).

Most of the approximations are hidden in the derivation of the peak posterior probability. The big assumption is that it is being evaluated at the local minima, which is why we can use fitting to calculate the peak posterior probability.

@JonasMoss for a real world example is it sufficient to just have images and show a line of code reading some data (from the tests) for the input?

I have included the log rules because my colleagues are unlikely to know them and I wanted them to have the information they needed to get from the equation in the book to the one implemented in the code.

@mattpitkin for the API examples would you like to see examples for the individual pieces (e.g. fitting functions, fitting engines) or just a few example workflows/model selection? Most of the API docs are autogenerated. I did try to write the docs in the style of a tutorial, but I assume that those examples are insufficient. What changes would you recommend that I make to those examples to improve the API documentation?

mattpitkin commented 6 months ago

Hi @AnthonyLim23, sorry for the slow response.

@mattpitkin for the API examples would you like to see examples for the individual pieces (e.g. fitting functions, fitting engines) or just a few example workflows/model selection? Most of the API docs are autogenerated. I did try to write the docs in the style of a tutorial, but I assume that those examples are insufficient. What changes would you recommend that I make to those examples to improve the API documentation?

I think it would be good to see, say, two fully worked examples demonstrating a realistic pipeline of usage of the code, including either real or simulated datasets. One could be, say, determining the number of Gaussian-like "lines" in some data and one could also incorporated some modelling of a background (or whatever you think appropriate). If there's a real analysis that it has been used on where the data is open and you a free to post in the docs that would be great, e.g., the QENS example that you have in the paper. You could do these in examples in Jupyter notebooks and either just link to them in the docs or get Sphinx to render them, so that you can easily include figures (although this is not a requirement).

Some other bits of the API do need work though. The fit functions all at least need a short doc string saying what they are and what input parameters they require, otherwise they are not really useful.

In terms of other general comments (and again in relation to @JonasMoss's comments), I think the first paragraph of the paper nicely described the use case of the code, so that should be echoed in the docs. Essentially this code is for model comparison and for each model you are (I think/assume) approximating the marginalised likelihood for a given model (also known as the evidence), where you marginalising over all the unknown model parameters (can you hold an of the parameters at fixed values or do you have to marginalise over them all?) as opposed to an analytical maximum likelihood approach. Maybe this is done using Laplace's method? You can state this in the docs and paper, but you should also briefly mention other approximation methods such as the AIC, BIC and nested sampling (which is the Monte Carlo method described at the end of the Sivia book). So, when you talk about loglikelihood in the docs I think you really mean marginal likelihood/evidence - please check if this is what you mean and make changes. You should also be explicit that priors on all model parameters are assumed to be flat and infinite in bounds (unless they are not?). This effects how you interpret the model comparison, so it does need to be stated. Is there any way to specifically set the prior ranges, or use different priors (say Gaussian priors, which can also be analytically marginalised over in certain cases)?

In the docs, it currently states that:

"Fit functions are the mathematical representations of the data"

but I don't really think this is quite what they are. The fit functions are a (potentially approximate) mathematical model for a process/phenomenon that you have collected data to measure. The data will be a combination of that modelled process and some noise (where the noise could just be measurement noise, or a combination of other unmodelled processes). So, you could say something like: "The fit functions are approximate mathematical models of the process that we intend to measure.".

AnthonyLim23 commented 6 months ago

@mattpitkin I have data in the tests for QENS and simulated muon data (number of exp decays). I can do two examples using them.

I will look into the rest of the API docs and make them clearer.

I believe that it is the marginalised likelihood, but I will double check. At presents all of the parameters have to be free, I do want to make it easier to add fixes (the current method is https://github.com/ISISNeutronMuon/quickBayes/blob/main/src/quickBayes/fit_functions/stretch_exp_fixed.py which is a bit clunky.

In terms of priors, from a user perspective it is purely the parameter range and start values for fit parameters. I believe the flat priors are baked into the assumptions used for deriving the analytical equation. I can touch on it, but the user will never be able to do anything about it.

prashjet commented 6 months ago

Review checklist for @prashjet

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

prashjet commented 6 months ago

Hi there. Thanks for your patience waiting for my review, I've just got round to it after a busy period.

After reading the documentation and paper, I am still quite confused about which calculations this software is doing, and what information those calculations are telling me. I think the comment from @JonasMoss provides an excellent checklist of what should be included in the documentation so that someone can clearly understand what your software does and when to use it. I won't repeat the entire checklist here, but I'll highlight some key points in a response to your last message. You say...

This package is a bit strange as it uses some Bayesian statistics, but not directly. So I want to check how much detail on Bayesian statistics is appropriate.

In principle, I think that black box software for Bayesian statistics may sometimes be useful (i.e. software that provides a solution, without necessarily going into details) but only if you can clearly explain what situations the black-box is suitable for. In the case of quickBayes, to answer this question you'll need to understand Equation 4.20 of the Sivia & Skilling book. This will require:

That last point is a little opaque, so I'll try to clarify it with an example. Currently, your loglikelihood function takes an argument N_peaks. In the code snippet on loglikelihood page, you set N_peaks=1 for both examples of fit_functions, even though one of them is a model with a single Gaussian, and one with two Gaussians. This leaves me asking what is the meaning of N_peak and how (if it all?) does it relate to the number of peaks I have defined in my fit_function?

I think unravelling Equation 4.20 of the Sivia & Skilling is likely to be difficult task, however, given that this calculation is currently at the heart of your software package, I think it is unavoidable in order for anyone know if quickBayes is the right tool for them. I think this will involve an entire re-write of the documentation.

p.s. I also had a problem with installation.

osorensen commented 6 months ago

Thanks to @JonasMoss, @prashjet and @mattpitkin for your thorough reviews!

@AnthonyLim23, I understand that a large number of issues need to be addressed here. It is ok if it takes time, but please keep us updated on any progress in this thread.

AnthonyLim23 commented 6 months ago

@osorensen thank you. It will take me a while as I have another project deadline that I need to focus on first.

AnthonyLim23 commented 4 months ago

I have outlined my plan for the new documentation https://github.com/ISISNeutronMuon/quickBayes/issues/92 I think this should cover all of the comments.

I also have an issue to do the API changes https://github.com/ISISNeutronMuon/quickBayes/issues/93

I have a potential fix for https://github.com/ISISNeutronMuon/quickBayes/issues/89

osorensen commented 3 months ago

πŸ‘‹ @AnthonyLim23, how is it going with the work towards the issues mentioned above?

AnthonyLim23 commented 3 months ago

@osorensen I have written an introduction to Bayes and done a derivation of the main equation, using simplified notation (I have also extended it). I still need to clean it up a bit, I also need to write the comparison to other methods and the examples. Unfortunately, I have to do this around other things. I am sorry it is taking so long. I hope to get some more done next week (I am happy to stick it in a PR so you can have a look).

I think I have fixed the issue with not being able to install it https://github.com/ISISNeutronMuon/quickBayes/issues/89 and the fix https://github.com/ISISNeutronMuon/quickBayes/pull/90

AnthonyLim23 commented 3 months ago

The docs are working in progress but what I have so far are here: https://quickbayes--94.org.readthedocs.build/en/94/ The PR is https://github.com/ISISNeutronMuon/quickBayes/pull/94

I still need to do:

However, it should give a good indication of the new documentation and if it is suitable.

osorensen commented 3 months ago

Thanks for the update @AnthonyLim23

osorensen commented 1 month ago

Could you please give us a progress update, @AnthonyLim23?

AnthonyLim23 commented 1 month ago

@osorensen I'm sorry this is taking so long, I am having to do around other projects. I believe the only things left to do for the docs are:

I am hoping to get some time this week to do some work on it.