ropensci / software-review

rOpenSci Software Peer Review.
286 stars 104 forks source link

Review of dfms: Efficient Dynamic Factor Models for R? #555

Closed SebKrantz closed 1 year ago

SebKrantz commented 1 year ago

Submitting Author Name: Sebastian Krantz Submitting Author Github Handle: !--author1-->@SebKrantz<!--end-author1-- Other Package Authors Github handles: !--author-others-->@rbagd<!--end-author-others-- Repository: https://github.com/SebKrantz/dfms Submission type: Pre-submission Language: en


Package: dfms
Version: 0.1.1
Title: Dynamic Factor Models
Authors@R: c(person("Sebastian", "Krantz", role = c("aut", "cre"), email = "sebastian.krantz@graduateinstitute.ch"),
             person("Rytis", "Bagdziunas", role = "aut"))
Description: Efficient estimation of Dynamic Factor Models using the Expectation Maximization (EM) algorithm 
  or Two-Step (2S) estimation, on datasets with missing data. The implementation follows advances in the econometric 
  literature: estimation can be done either by running the Kalman Filter and Smoother once with initial values 
  from PCA - following Doz, Giannone and Reichlin (2011) (2S) - or via iterated Kalman Filtering and Smoothing until EM 
  convergence - following Doz, Giannone and Reichlin (2012) - or using the adapted EM algorithm of Banbura and Modugno 
  (2014), allowing estimation with arbitrary patterns of missing data. The implementation makes heavy use of the 
  Armadillo C++ library and the collapse package, providing for particularly speedy estimation. A comprehensive set of 
  methods supports interpretation/visualization of the model and forecasting. Information criteria to choose the number 
  of factors are also provided - following Bai and Ng (2002).
  --- Key References: ---
  Doz, C., Giannone, D., & Reichlin, L. (2011). A two-step estimator for large approximate dynamic 
       factor models based on Kalman filtering. Journal of Econometrics, 164(1), 188-205.
  Doz, C., Giannone, D., & Reichlin, L. (2012). A quasi-maximum likelihood approach for large, approximate 
       dynamic factor models. Review of economics and statistics, 94(4), 1014-1024.
  Banbura, M., & Modugno, M. (2014). Maximum likelihood estimation of factor models on datasets with arbitrary 
       pattern of missing data. Journal of Applied Econometrics, 29(1), 133-160.
URL: https://sebkrantz.github.io/dfms/
BugReports: https://github.com/SebKrantz/dfms/issues
Depends: R (>= 3.0.0)
Imports: Rcpp (>= 1.0.1), collapse (>= 1.8.0)
LinkingTo: Rcpp, RcppArmadillo
Suggests: xts, vars, magrittr, testthat (>= 3.0.0), covr
License: GPL-3
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c ("namespace", "rd", "srr::srr_stats_roclet"))
RoxygenNote: 7.1.2
Config/testthat/edition: 3

Scope

Dynamic factor models are a time series modelling and dimensionality reduction technique.

I'm working on this.

Anybody working with time series. The package is useful for dimensionality reduction and forecasting with a large amount of time series.

See README.md. In short: dfms is the much faster, provides multiple estimation methods, and has a comprehensive set of methods for exploring the model and forecasting. It is less specialized than economic nowcasting packages.

No Applicable.

First, I would like to ask if you think you'll be able to review this package in a statistical sense. Then, I will likely not be able to comply with all of your standards, as I intend to export some C++ level helper function (mainly efficient Kalman Filtering and Smoothing functions) without any checks on the inputs. My hope here is in part to provide infrastructure that more specialized software (such as nowcasting packages) can take advantage of. The iterative filtering and smoothing performed in the estimation of dynamic factor models via expectation maximization (EM) algorithms does not square well with R-level checks in those functions (which would be executed many times per second).

adamhsparks commented 1 year ago

Hi @SebKrantz, thanks for your inquiry.

I've checked with some of the other editors. We flexibility around input checking in low-level R or C code intended for export in other packages, but this package appears have a high-level interface and so that interface should be standards-compliant.

Additionally, some of us have had difficulties installing the package as it currently is written. @mpadge had difficulties installing due to "undefined symbol: dpocon_". Error perfectly repeatable in a rocker/tidyverse container. However, I am able to build it on my M1 MacBook with no errors. This should be addressed before we can proceed.

SebKrantz commented 1 year ago

Hi @adamhsparks. Thanks for these initial comments. Let me work a bit further on it still to ensure the package passes your checks as good as possible and we can discuss further about it. What I can definitely offer to do it rigorous checking of inputs at the C++ level in the Kalman Filtering functions. I've not done this so far because Armadillo already has build in checks and throws appropriate error messages, making it practically impossible to pass something wrong. But let me see if I can appease your tests.

Regarding the installation issue. I'm sorry for that, it turned out that BLAS and LAPACK libraries were linked through a Makevars.win file, instead of a global Makevars file. I have fixed this now.

SebKrantz commented 1 year ago

I have now brought dfms to a state where I am happy with it. The software is quite robust and inputs are rigorously checked. I still get some autotest issues, particularly in the main DFM() function, but I don't understand those, as all inputs parameters received the full extent of checking, including for data type, case insensitivity and permissible range.

I have quickly gone through the srr standards, and in my opinion dfms broadly meets most of them. I have made some comments in srr-stats-standards.R below standards where I have done things a bit differently.

Regarding unit testing, there is room for much improvement, in particular one could set up R translations of the authors original Matlab codes (which are provided under misc/) to test against. For now I have manually verified equivalence of dfms to these codes, and test against some hard-coded parameter values.

I general, my time on this package is a bit limited due to more important development commitments that I have, but it would be nice to get some review, and also to be able to release it to CRAN before Christmas. Let me know how you think about this.

adamhsparks commented 1 year ago

Hi @SebKrantz, good to hear that you've managed to improve the package.

There are no requirements for the package review and CRAN. That is, you can put it on CRAN and have it reviewed asynchronously, they don't affect each other.

We can initiate the review process as soon as you can open a new issue requesting the review. We would strive to have the first reviews done within three weeks of the reviewers being assigned. This of course is flexible to work with the reviewers' own commitments as well. So we might have something by Christmas for the reviews for you, but I'd not be surprised if it slipped a little past that.

SebKrantz commented 1 year ago

Thanks @adamhsparks for the clarification. I will then open an issue requesting review, and also prepare a first CRAN release.