ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

Submission of bssm for Bayesian state space modelling #489

Open helske opened 2 years ago

helske commented 2 years ago

Reviewers: Submitting Author: Jouni Helske (@helske) Other Package Authors: (delete if none) Name (@mvihola) Repository: https://github.com/helske/bssm Version submitted: 2.0.0 Submission type: Stats Badge grade: silver Editor: !--editor-->@bbolker<!--end-editor-- Reviewers: !--reviewers-list-->@kingaa<!--end-reviewers-list--

Due date for @kingaa: 2022-05-27

Archive: TBD Version accepted: TBD Language: en

Package: bssm
Type: Package
Title: Bayesian Inference of Non-Linear and Non-Gaussian State Space
        Models
Version: 2.0.0
Authors@R: 
    c(person(given = "Jouni",
           family = "Helske",
           role = c("aut", "cre"),
           email = "jouni.helske@iki.fi",
           comment = c(ORCID = "0000-0001-7130-793X")),
      person(given = "Matti",
           family = "Vihola",
           role = "aut",
           comment = c(ORCID = "0000-0002-8041-7222")))
Description: Efficient methods for Bayesian inference of state space models 
    via particle Markov chain Monte Carlo (MCMC) and MCMC based on parallel 
    importance sampling type weighted estimators 
    (Vihola, Helske, and Franks, 2020, <doi:10.1111/sjos.12492>). 
    Gaussian, Poisson, binomial, negative binomial, and Gamma
    observation densities and basic stochastic volatility models 
    with linear-Gaussian state dynamics, 
    as well as general non-linear Gaussian models and discretised 
    diffusion models are supported.
License: GPL (>= 2)
Depends: R (>= 3.5.0)
Suggests: 
    covr,
    ggplot2 (>= 2.0.0),
    KFAS (>= 1.2.1),
    knitr (>= 1.11),
    MASS,
    rmarkdown (>= 0.8.1),
    ramcmc,
    sde,
    sitmo,
    testthat
Imports: 
    magrittr,
    checkmate,
    coda (>= 0.18-1),
    diagis,
    dplyr,
    posterior,
    Rcpp (>= 0.12.3),
    rlang,
    tidyr
LinkingTo: ramcmc, Rcpp, RcppArmadillo, sitmo
SystemRequirements: C++11, pandoc (>= 1.12.3, needed for vignettes)
VignetteBuilder: knitr
BugReports: https://github.com/helske/bssm/issues
URL: https://github.com/helske/bssm
ByteCompile: true
Encoding: UTF-8
NeedsCompilation: yes
RoxygenNote: 7.1.2
Roxygen: list(markdown = TRUE, 
  roclets = c("namespace", "rd", "srr::srr_stats_roclet"))

Pre-submission Inquiry

General Information

State space models provide a flexible framework for statistical inference of a broad class of time series and other dynamic data. The bssm package aims to provide easy to use and efficient functions for the Bayesian estimation of commonly used as well as more general user-defined state space models, which are usable in various application areas.

This is the first software to implement the IS-MCMC by Vihola, Helske, and Franks (2020) and first R package to implement delayed acceptance pseudo-marginal MCMC for state space models. The IS-MCMC method is also available in walker package for a limited class of time-varying GLMss (a small subset of the models supported by this package). Some of the functionality for exponential family state space models is also available in KFAS, and those models can be converted easily to bssm format for Bayesian analysis.

Not applicable.

Badging

Silver sounds appropriate.

The bssm complies with a large number of standards both in the general category as well as in Bayesian Software category and their sub-categories. I see the package complying with several Time Series Software standards as well, although many of those standards do not seem to be well suited or applicable to general time series modelling via state space models and/or bssm, so at least for now I have focused on the General and Bayesian standards.

The modelling framework and the implemented algorithms are very general, and since the early versions, the usability and features of the bssm are greatly improved to quite general models and applications (Currently bssm has most of the same features and many more as in the popular KFAS package for state space modelling which has been used in various domains).

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

Code of conduct

ropensci-review-bot commented 2 years ago

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot commented 2 years ago

:rocket:

The following problem was found in your submission template:

:wave:

ropensci-review-bot commented 2 years ago

Checks for bssm (v2.0.0)

git hash: 835eba3a

Package License: GPL (>= 2)


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

:heavy_check_mark: All applicable standards [v0.1.0.007] have been documented in this package (92 complied with; 32 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in C++ (73% in 43 files) and R (27% in 31 files) - 2 authors - 4 vignettes - 5 internal data files - 9 imported packages - 77 exported functions (median 24 lines of code) - 261 non-exported functions in R (median 7 lines of code) - 291 R functions (median 29 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-------:|----------:|:----------| |files_R | 31| 89.1| | |files_src | 43| 98.4| | |files_vignettes | 9| 99.0| | |files_tests | 16| 93.5| | |loc_R | 3992| 93.2| | |loc_src | 10961| 93.8| | |loc_vignettes | 1452| 95.9|TRUE | |loc_tests | 1705| 90.7| | |num_vignettes | 4| 96.0|TRUE | |data_size_total | 1153190| 96.2|TRUE | |data_size_median | 2342| 69.4| | |n_fns_r | 338| 93.3| | |n_fns_r_exported | 77| 92.7| | |n_fns_r_not_exported | 261| 93.3| | |n_fns_src | 291| 98.0|TRUE | |n_fns_per_file_r | 6| 69.2| | |n_fns_per_file_src | 5| 43.4| | |num_params_per_fn | 4| 67.6| | |loc_per_fn_r | 8| 33.9| | |loc_per_fn_r_exp | 24| 55.8| | |loc_per_fn_r_not_exp | 7| 29.9| | |loc_per_fn_src | 29| 86.1| | |rel_whitespace_R | 17| 91.9| | |rel_whitespace_src | 15| 98.3|TRUE | |rel_whitespace_vignettes | 23| 97.5|TRUE | |rel_whitespace_tests | 22| 96.8|TRUE | |doclines_per_fn_exp | 78| 83.4| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 1035| 97.9|TRUE | ---

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/helske/bssm/workflows/R-CMD-check/badge.svg)](https\:\/\/github) **GitHub Workflow Results** |name |conclusion |sha |date | |:-----------|:----------|:------|:----------| |R-CMD-check | |8c52ea |2021-11-25 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following note: 1. checking installed package size ... NOTE installed size is 69.1Mb sub-directories of 1Mb or more: data 1.1Mb doc 3.4Mb libs 64.0Mb R CMD check generated the following check_fail: 1. rcmdcheck_reasonable_installed_size #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 80.54 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following functions have cyclocomplexity >= 15: function | cyclocomplexity --- | --- bsm_ng | 34 bsm_lg | 30 predict.mcmc_output | 30 check_y | 28 run_mcmc.nongaussian | 25 as_bssm | 22 create_regression | 19 run_mcmc.ssm_nlg | 19 run_mcmc.ssm_sde | 19 check_u | 17 run_mcmc.lineargaussian | 16 summary.mcmc_output | 16 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 85 potential issues: message | number of times --- | --- Lines should not be more than 80 characters. | 85


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.3.52 | |pkgcheck |0.0.2.149 | |srr |0.0.1.141 |


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

noamross commented 2 years ago

@ropensci-review-bot assign @bbolker as editor

ropensci-review-bot commented 2 years ago

Assigned! @bbolker is now the editor

noamross commented 2 years ago

@ropensci-review-bot seeking reviewers

ropensci-review-bot commented 2 years ago

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/489_status.svg)](https://github.com/ropensci/software-review/issues/489)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

Athene-ai commented 2 years ago

Ok thanks

noamross commented 2 years ago

@ropensci-review-bot assign @kingaa to reviewers

ropensci-review-bot commented 2 years ago

@kingaa added to the reviewers list. Review due date is 2022-05-27. Thanks @kingaa for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 2 years ago

@kingaa: If you haven't done so, please fill this form for us to update our reviewers records.

Athene-ai commented 2 years ago

Hi!

I would like to review this package

maelle commented 2 years ago

@Athene-ai I'd like to kindly remind you not to volunteer in all issues especially as you've already got one review in progress (thank you!). #523

We also tend not to ask the same people to review twice in a row.

Athene-ai commented 2 years ago

right .. so sorry for the mistake ..

kingaa commented 2 years ago

The documentation of standards is admirable.

Examining the source code, it appears to be very carefully written and in a good style. I expect it will be easy to track down bugs and to maintain the code written in this style.

Adopting the user's point of view, I attempted to plunge in. I found it more difficult to do so than I suspect the authors would like. This leads to some suggestions regarding the documentation.

First, it would be good if package?bssm gave documentation on the package as a whole, with orientation toward the major features for the novice user. The help available from library(help=bssm) is mnemonic but not a good introduction.

I attempted to follow the "bssm" vignette. The first example, which begins

data("nhtemp", package = "datasets")
prior <- halfnormal(1, 10)

left me wondering about the "half-normal" prior. The man page on priors is minimally informative. I do not see formulae for the prior densities, nor are there descriptions of their parametrization, nor even plots. The examples there seem mainly to be for automated-checking purposes: they shed little light for the user. Improved documentation would be helpful, as would examples of their actual usage, and some plots.

Continuing with the example, I did

bsm_model <- bsm_lg(y=nhtemp,sd_y=prior,sd_level=prior,sd_slope=prior)
mcmc_bsm <- run_mcmc(bsm_model, iter = 4e4, seed = 1)
summary(mcmc_bsm)
mcmc_bsm
plot(mcmc_bsm)

This did give some information, though the plot method failed. Following the vignette, I was able to plot the approximate posterior densities using

mcmc_bsm |> 
  as.data.frame() |> 
  ggplot(aes(x=value,group=variable))+
  geom_density()+
  facet_wrap(~variable,scales="free")

However, I imagine other experienced R users, like me, would appreciate more informative outputs from the summary and plot methods.

Also, I immediately found myself wanting to know:

I notice that, among the list of standards deemed inapplicable by the package authors are some that speak to these questions, and to the issues with understanding prior distributions which I mentioned before.

I also noticed that this vignette mentions and discusses, but does not demonstrate, nonlinear, non-Gaussian models. Since such models are a major feature of the package, some demonstration would be appreciated.

Turning to the "growth_model" vignette, I was intrigued to see that there are facilities for including snippets of C++ code. However, I was not able to follow the vignette sufficiently well as to be able to reproduce those calculations myself. I would appreciate more detailed, step-by-step instructions on how to compile the snippets shown. (For example, I get errors regarding the unknown namespace arma).

Though there is more exploration I would like to do, I will stop here for now. It is my understanding (though I am happy to be corrected) that this review process is intended to include back-and-forth. Some comments from the authors may help me complete what I hope will be a useful review.

helske commented 2 years ago

Thanks, @kingaa for your helpful comments. I have now updated to the package based on your suggestions, mainly by improving the documentation and adding a new plot method.

I opted to just refer to the R Journal paper and the vignette in ?bssm instead of repeating the material there, mainly because the mathematical formulas are more readable in those. I did however expand this now a bit by noting the main functions of the package (model building functions and run_mcmc), with some additional comments to the Nile example and pointers regarding what to do with the obtained samples from the run_mcmc.

Good point about the prior documentation, I now added bit more details about the definititions priors, although these are fairly standard in terms of the pdfs. I also now note that the prior for the general models (e.g. ones defined via ssm_ulg) are defined as a user-defined R function.

I also added a default plot method for the MCMC output, mimicking the classic density + trace plot style of coda etc. The reason why there aren't many default visualizations in the package is mainly that the exact needs tend to depend on the user and the model, so in the end, in my experience users (at least myself) tend to build their custom plots manually anyway. But this kind of default plot for the hyperparameters does of course make sense. For combined plot of priors and posteriors, it is pretty difficult to construct such a default plot because if the priors are defined via user defined function (in case of say ssm_ulg model), we can only access the joint log-prior density of the model parameters.

Regarding the summary method, I opted to give the typical details of the model in the print method, whereas summary method provides the actual summaries of the model parameters.

For the MCMC diagnostics there are some basic diagnostics available via check_diagnostics function, and the new default plot method provides some graphical hints about the convergence. We tend to refer to the posterior and bayesplot packages regarding these, instead of our own re-implementations.

Regarding the multiple chains, as stated in the NA standards, this is not automatically supported, but the posterior package provides a relatively easy way to combine the samples from multiple runs, as illustrated in ?as_draws_df.mcmc_output.

The non-linear models are indeed not discussed in detail in the main vignette, but in the other vignettes (growth_model and sde_model). I'm not sure what causes the namespace error on your end, as the model seems to compile fine on my computer and on CRAN. These are a bit difficult to debug and to be honest likely a bit difficult to use for those not familiar with the C++ (and perhaps with the Armadillo library).

maurolepore commented 1 year ago

Sorry to jump in. My EiC rotation just started and I'm checking the status of every open issue.

I'm curious about a two things:

@kingaa RE

Though there is more exploration I would like to do, I will stop here for now. It is my understanding (though I am happy to be corrected) that this review process is intended to include back-and-forth.

Typically we ask reviewers to complete their review in one go. The authors address the comments of both reviewers. Then the reviewers either (a) approve the changes or (b) request more changes.

However, before you use more of your valuable time let's see what response I get.

ldecicco-USGS commented 7 months ago

Since there was no response to the last query, I'm going to tag this issue as on "hold". If work resumes, we can update the tags as necessary.

ldecicco-USGS commented 7 months ago

@ropensci-review-bot put on hold

ropensci-review-bot commented 7 months ago

Submission on hold!

ropensci-review-bot commented 4 months ago

@ldecicco-USGS: Please review the holding status