ropensci / software-review

rOpenSci Software Peer Review.
286 stars 104 forks source link

Submission - melt: Multiple Empirical Likelihood Tests #550

Closed markean closed 1 year ago

markean commented 1 year ago

Date accepted: 2022-09-20

Submitting Author Name: Eunseop Kim Submitting Author Github Handle: !--author1-->@markean<!--end-author1-- Repository: https://github.com/markean/melt Version submitted: 1.6.0.9000 Submission type: Stats Badge grade: silver Editor: !--editor-->@Paula-Moraga<!--end-editor-- Reviewers: @pchausse, @awstringer1

Due date for @pchausse: 2022-09-05 Due date for @awstringer1: 2022-08-29

Archive: TBD Version accepted: TBD Language: en

Type: Package
Package: melt
Title: Multiple Empirical Likelihood Tests
Version: 1.6.0.9000
Authors@R: c(
    person("Eunseop", "Kim", , "kim.7302@osu.edu", role = c("aut", "cre")),
    person("Steven", "MacEachern", role = c("ctb", "ths")),
    person("Mario", "Peruggia", role = c("ctb", "ths"))
  )
Description: Performs multiple empirical likelihood tests for linear and
    generalized linear models. The core computational routines are
    implemented using the 'Eigen' C++ library and 'RcppEigen' interface,
    with OpenMP for parallel computation. Details of multiple testing
    procedures are given in Kim, MacEachern, and Peruggia (2021)
    <arxiv:2112.09206>.
License: GPL (>= 2)
URL: https://github.com/markean/melt, https://markean.github.io/melt/
BugReports: https://github.com/markean/melt/issues
Depends: 
    R (>= 4.0.0)
Imports: 
    graphics,
    methods,
    Rcpp,
    stats
Suggests: 
    covr,
    ggplot2,
    knitr,
    microbenchmark,
    rmarkdown,
    spelling,
    testthat (>= 3.0.0),
    withr
LinkingTo: 
    BH,
    dqrng,
    Rcpp,
    RcppEigen
VignetteBuilder: 
    knitr
Config/testthat/edition: 3
Encoding: UTF-8
Language: en-US
LazyData: true
NeedsCompilation: yes
Roxygen: list (markdown = TRUE, roclets = c ("namespace", "rd",
    "srr::srr_stats_roclet"))
RoxygenNote: 7.2.0

Pre-submission Inquiry

General Information

Badging

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

Code of conduct

ropensci-review-bot commented 1 year ago

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot commented 1 year ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 1 year ago

Checks for melt (v1.6.0.9000)

git hash: 1b9bbebd

Package License: GPL (>= 2)


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

:heavy_check_mark: All applicable standards [v0.1.0] have been documented in this package (86 complied with; 30 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate. |type |package | ncalls| |:----------|:--------------|------:| |internal |base | 146| |internal |melt | 66| |imports |methods | 83| |imports |stats | 66| |imports |graphics | 8| |imports |Rcpp | NA| |suggests |covr | NA| |suggests |ggplot2 | NA| |suggests |knitr | NA| |suggests |microbenchmark | NA| |suggests |rmarkdown | NA| |suggests |spelling | NA| |suggests |testthat | NA| |suggests |withr | NA| |linking_to |BH | NA| |linking_to |dqrng | NA| |linking_to |Rcpp | NA| |linking_to |RcppEigen | NA| Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats()', and examining the 'external_calls' table.

base

list (15), q (14), attr (13), warning (13), numeric (12), c (7), nrow (7), integer (6), length (5), print (5), as.integer (4), match.call (4), names (4), ncol (4), tryCatch (4), do.call (3), if (3), mean (3), as.vector (2), cbind (2), logical (2), match (2), rownames (2), as.matrix (1), colMeans (1), is.null (1), matrix (1), NROW (1), rbind (1), sqrt (1), sum (1), t (1), table (1)

methods

setGeneric (36), setMethod (36), setClass (11)

melt

error (13), el_control (8), validate_optim (4), calibrate (3), compute_EL (3), validate_weights (3), compute_generic_EL (2), get_max_threads (2), test_GLM (2), test_LM (2), validate_family (2), compute_bootstrap_calibration (1), compute_confidence_intervals (1), compute_confidence_region (1), compute_ELD (1), el_eval (1), el_glm (1), el_lm (1), el_mean (1), el_sd (1), get_rank (1), test_hypothesis (1), test_multiple_hypotheses (1), validate_alpha (1), validate_b (1), validate_calibrate (1), validate_cv (1), validate_hypotheses (1), validate_hypothesis (1), validate_keep_data (1), validate_level (1), validate_lhs (1), validate_x (1)

stats

formula (14), df (8), offset (6), optim (6), na.action (5), weights (5), family (4), pchisq (3), step (3), contrasts (2), model.response (2), model.weights (2), sd (2), glm.fit (1), pf (1), qchisq (1), qf (1)

graphics

par (8)

**NOTE:** Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


3. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in C++ (53% in 15 files) and R (47% in 29 files) - 1 authors - 2 vignettes - 1 internal data file - 4 imported packages - 59 exported functions (median 4 lines of code) - 120 non-exported functions in R (median 7 lines of code) - 91 R functions (median 14 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 29| 88.8| | |files_src | 15| 95.4| | |files_vignettes | 2| 85.7| | |files_tests | 17| 95.3| | |loc_R | 1629| 80.0| | |loc_src | 1827| 74.0| | |loc_vignettes | 129| 33.7| | |loc_tests | 1243| 89.7| | |num_vignettes | 2| 89.2| | |data_size_total | 1311| 61.2| | |data_size_median | 1311| 65.8| | |n_fns_r | 179| 88.1| | |n_fns_r_exported | 59| 90.2| | |n_fns_r_not_exported | 120| 87.0| | |n_fns_src | 91| 78.1| | |n_fns_per_file_r | 3| 55.1| | |n_fns_per_file_src | 6| 55.6| | |num_params_per_fn | 2| 11.9| | |loc_per_fn_r | 7| 16.0| | |loc_per_fn_r_exp | 4| 4.3|TRUE | |loc_per_fn_r_not_exp | 7| 18.0| | |loc_per_fn_src | 14| 46.6| | |rel_whitespace_R | 8| 59.8| | |rel_whitespace_src | 5| 49.5| | |rel_whitespace_vignettes | 8| 6.5| | |rel_whitespace_tests | 8| 72.2| | |doclines_per_fn_exp | 34| 40.2| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 253| 90.8| | ---

3a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


4. goodpractice and other checks

Details of goodpractice checks (click to open)

#### 3a. Continuous Integration Badges [![check-standard.yaml](https://github.com/markean/melt/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/markean/melt/actions) **GitHub Workflow Results** | id|name |conclusion |sha | run_number|date | |----------:|:--------------------------|:----------|:------|----------:|:----------| | 2673111985|pages build and deployment |success |6ac77c | 46|2022-07-14 | | 2673079345|pkgcheck |failure |1b9bbe | 51|2022-07-14 | | 2673079344|pkgdown |success |1b9bbe | 49|2022-07-14 | | 2673079342|R-CMD-check |success |1b9bbe | 71|2022-07-14 | | 2673079354|test-coverage |success |1b9bbe | 153|2022-07-14 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following note: 1. checking installed package size ... NOTE installed size is 77.8Mb sub-directories of 1Mb or more: libs 77.2Mb R CMD check generated the following check_fail: 1. rcmdcheck_reasonable_installed_size #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 99.78 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following function have cyclocomplexity >= 15: function | cyclocomplexity --- | --- el_glm | 18 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 5 potential issues: message | number of times --- | --- Avoid library() and require() calls in packages | 4 Lines should not be more than 80 characters. | 1


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.1.1.3 | |pkgcheck |0.0.3.76 | |srr |0.0.1.167 |


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

emilyriederer commented 1 year ago

Hi @markean - I just wanted to provide a brief update here. I apologize for the delay in next steps here; it's not for lack of excitement to kick off this review! I hope to have a handling editor ready to assign next week.

markean commented 1 year ago

Thanks for the update @emilyriederer!

emilyriederer commented 1 year ago

@ropensci-review-bot assign @Paula-Moraga as editor

ropensci-review-bot commented 1 year ago

Assigned! @Paula-Moraga is now the editor

emilyriederer commented 1 year ago

@markean , I'm delighted to share that @Paula-Moraga will serve as editor for this package. She is away this week but is available to commence next week. We will be in more touch soon!

markean commented 1 year ago

@emilyriederer, thanks for the news!

Paula-Moraga commented 1 year ago

Hi @markean. I am pleased to handle this package. I will look for reviewers.

Paula-Moraga commented 1 year ago

@ropensci-review-bot seeking reviewers

ropensci-review-bot commented 1 year ago

I'm sorry @Paula-Moraga, I'm afraid I can't do that. That's something only editors are allowed to do.

markean commented 1 year ago

Hi @Paula-Moraga. Thank you for your time and effort. Please let me know if there is anything I can help in the review process. @emilyriederer Could you update the label?

emilyriederer commented 1 year ago

Done! I'll ask the bot team on Slack if they know why @Paula-Moraga couldn't herself since she is, in fact, the editor!

Paula-Moraga commented 1 year ago

@ropensci-review-bot assign @pchausse as reviewer

ropensci-review-bot commented 1 year ago

@pchausse added to the reviewers list. Review due date is 2022-08-25. Thanks @pchausse for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 1 year ago

@pchausse: If you haven't done so, please fill this form for us to update our reviewers records.

Paula-Moraga commented 1 year ago

@ropensci-review-bot set due date for @pchausse to 2022-09-05

ropensci-review-bot commented 1 year ago

Review due date for @pchausse is now 05-September-2022

markean commented 1 year ago

Hi @pchausse. Thank you so much for reviewing our package. Please let me know if you have any question. A draft version of pdf vignette in vignettes folder might be of help.

Paula-Moraga commented 1 year ago

@ropensci-review-bot assign @awstringer1 as reviewer

ropensci-review-bot commented 1 year ago

@awstringer1 added to the reviewers list. Review due date is 2022-08-29. Thanks @awstringer1 for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 1 year ago

@awstringer1: If you haven't done so, please fill this form for us to update our reviewers records.

markean commented 1 year ago

Hi @awstringer1! Thank you for reviewing the package. Please let me know if you have any question. A draft version of pdf vignette in vignettes folder might be of help.

@pchausse and @awstringer1: By an external request, a newer version of the package will be released to CRAN in August with some minor updates. It will not interfere with the review process, but I will tag it after the release. Then the reviewer comments will be reflected to the package in a subsequent release.

awstringer1 commented 1 year ago

I am not sure if this is something the authors did, or something the rOpenSci template does, but the code line references in the "srr report" do not seem to match up with the locations of the tags in the roxygen documentation. They are a few lines off. This is fine mostly but sometimes I cannot find the relevant tag, e.g. for "Standards in on line#306 of file R/AllGenerics.R:" Can the authors comment? Thank you.

mpadge commented 1 year ago

@awstringer1 That's because the repo has been updated since the report linked to above was generated. As stated in the automated checks, you can - and in this case should - generate a local, and up-to-date, version of the report by running srr::srr_report(path), where path is where your local clone of melt is. By default that will open the full report in your web browser, in which all reported line numbers should match current state of repo.

awstringer1 commented 1 year ago

Result of running srr_report on a fresh install of the development version of melt on my machine:

> srr::srr_report(file.path(.libPaths(),'melt'))
ℹ Loading melt
Writing 'NAMESPACE'
Writing 'NAMESPACE'
Loading required namespace: rmarkdown
Error in flist[sapply(blocks, inherits, "try-error")] : 
  invalid subscript type 'list'
In addition: Warning messages:
1: Failed to load at least one DLL: must specify DLL via a “DLLInfo” object. See getLoadedDLLs() 
2: In setup_ns_exports(path, export_all, export_imports) :
  Objects listed as exports, but not present in namespace: el_control, el_eval, el_glm, el_lm, el_mean, el_sd

I got a different error when trying on the CRAN version. Session information:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5
markean commented 1 year ago

@awstringer1 The path should be your local clone of the package, not the installed path. I can generate the report with srr::srr_report(path) after cloning. I guess that is what @mpadge meant above.

markean commented 1 year ago

@pchausse and @awstringer1: v1.7.0 has been released to CRAN. No changes were made with respect to the srr tags. The code on the repository will be updated after reviewers’ comments. Thank you.

awstringer1 commented 1 year ago

Here is my review, I hope it is helpful!

Package Review


Compliance with Standards

The following standards currently deemed non-applicable (through tags of @srrstatsNA) could potentially be applied to future versions of this software: (Please specify)

Please also comment on any standards which you consider either particularly well, or insufficiently, documented.

For packages aiming for silver or gold badges:

The package seems to have been developed for future general use in mind. The authors have provided an interface that is clearly intended to be used beyond the examples considered in their associated preprint. A large number of the standards described in the rOpenSci review materials seem to have been considered by the authors when designing their software.

Comments

Overall, the authors appear to have convincingly motivated that the package meets the rOpenSci standards. Many statements in the authors' @srrstats documentation tags contain incomplete references to the package documentation, which precludes a more thorough review. See below for a non-exhaustive list of examples. I have the following comments about the authors' justifications:


General Review

Documentation

The package includes all the following forms of documentation:

I do not see such a statement. I think this statement could be helpful: a short paragraph describing why empirical likelihood methods are useful in linear and generalized linear models could be informative to the user, since (a) empirical likelihood is a somewhat "specialized" topic and (b) there are other methods available for doing hypothesis tests in linear and generalized linear models, and such methods also rely on limiting Chi-square distributional arguments. It is at least not trivially clear why empirical likelihood is useful and why we'd want a package for it, so I think being very up front about this could better motivate the package.

I see no such statement.

The documentation is almost there, but I am going to leave this unchecked for now. I think adding more detailed "Details" in all core functions, even if this involves a lot of copy and pasting, is a good idea, see (https://r-pkgs.org/man.html)[https://r-pkgs.org/man.html]. For example, ?el_lm contains very detailed documentation, but ?el_mean does not. It is helpful to users to have everything on one page, even if the material is repeated across many functions' documentation.

Algorithms

The core algorithms are implemented in C++. I will do my best to provide appropriate comments here, but I am not an expert on C++.

Testing

The testing appears thorough, I have no comments.

Visualisation (where appropriate)

The visualizations provided seem to serve their stated purpose. No comments.

Package Design

External: the package is designed for the specific task of computing empirical likelihood ratio tests and intervals for the parameters of linear and generalized linear models. The package is set up to take input similar to lm and glm and then call the associated .fit function from core R. Some suggestions for broadening scope:

Internal: see comments about algorithms above. In general the use of C++ for the numerical analysis seems appropriate, notwithstanding the specific comments above.


Estimated hours spent reviewing: 8

markean commented 1 year ago

Hi @awstringer1, thank you for your review and feedback. We have made many changes to the main branch based on your comments. The package is in a better shape than before. Please find the responses below.

Response to @awstringer1

Comments

  • "@srrstats {RE1.4} Some of the core distributional assumptions are documented.": this is potentially confusing to users. Either all of the regularity conditions should be clearly documented, or a link to a specific section in a paper where these conditions are clearly documented should be given. Although arguably common in the literature, the phrase "under some regularity conditions" is incomplete and not meaningful, and I suggest its removal.

The phrase is removed. Interested users are directed to the references for more details.

  • "@srrstats {G2.1, G2.1a} Assertions on types of inputs are clarified throughout the package documentation.": it would be helpful to the reviewers to give specific references to the package documentation. There are a number of similar "throughout the package documentation" statements that render it challenging to assess compliance with standards, beyond a general sense of such.

It is documented in @param sections of el_mean.R, where the relevant srr tags are attached as well. These tags can be also located in any other user-facing functions of the package. All the user-facing functions state the required types of inputs explicitly.

  • Possible typo: "na.cation()" in el_glm.R, line 64.

Corrected to na.action().

  • "@srrstats {RE1.0, RE1.1} Formula interface is used to the formula argument, and how it is converted to a matrix input is documented as well.": a reference to the function, and section of documentation where this is located would better facilitate review.

It is documented in @details section of el_lm.R, where the relevant srr tags are attached as well. Explanation on how the formula input relates to the model.frame and model.matrix objects is not provided since there is nothing special about the usage of formula in the package. Interested users can refer to the documentation of model.frame() or model.matrix() of the stats package.

  • "@srrstats {RE4.11} logLik() method extracts the empirical log-likelihood value which can be interpreted as a measure of goodness-of-fit.": more detail could be provided about how this is to be used as a measure of goodness of fit, perhaps a link to where this is discussed in the package documentation or the preprint?

The srr tag has been moved to logL-methods.R. logL() and logLR() extract empirical log-likelihood (ratio) at a specified parameter value. The formulation of empirical likelihood problem can be viewed as allocating probabilities to each observation while minimizing a goodness-of-fit criterion (Cressie–Read discrepancy) subject to constraints. Hence, the computed empirical log-likelihood (ratio) values can also be interpreted as a goodness-of-fit measure. @references sections are added to the package documentation in AllGenerics.R (lines 404--406, 441--443).

  • _"@srrstatsNA {RE2.3} It is beyond the scope of estimating functions that the package considers. There seems to be no simple way to pass this information to the internal routines. Users can use el_eval() to manually construct estimating functions with the offset included, but any further method not applicable to the output of el_eval().": this appears to be in response to standard RE2.3 which states: "RE2.3 Where applicable, Regression Software should enable data to be centred (for example, through converting to zero-mean equivalent values; or to z-scores) or offset (for example, to zero-intercept equivalent values) via additional parameters, with the effects of any such parameters clearly documented and tested."_. The present package uses lm and glm, as well as a standard formula interface, all of which facilitates centering and offsets, so it is not clear why these tasks are out of scope for the present package.

el_lm() and el_glm() now accepts an optional offset argument. With offsets, the estimating functions were over-identified in the previous package design, and over-identified estimating equations are beyond the scope of the package. All the related C++ functions have been rewritten to accommodate offsets. The functionality with offset has been tested internally, and it works with other methods of the package. However, it is still in an experimental stage, and further changes could be made in a future release.

Documentation

  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience in README

I do not see such a statement. I think this statement could be helpful: a short paragraph describing why empirical likelihood methods are useful in linear and generalized linear models could be informative to the user, since (a) empirical likelihood is a somewhat "specialized" topic and (b) there are other methods available for doing hypothesis tests in linear and generalized linear models, and such methods also rely on limiting Chi-square distributional arguments. It is at least not trivially clear why empirical likelihood is useful and why we'd want a package for it, so I think being very up front about this could better motivate the package.

Overview section of README file is expanded to address the comment.

  • [ ] Community guidelines including contribution guidelines in the README or CONTRIBUTING

I see no such statement.

CONTRIBUTING is in the .github folder since it is a GitHub specific file.

The documentation is sufficient to enable general use of the package beyond one specific use case

The documentation is almost there, but I am going to leave this unchecked for now. I think adding more detailed "Details" in all core functions, even if this involves a lot of copy and pasting, is a good idea, see (https://r-pkgs.org/man.html)[https://r-pkgs.org/man.html]. For example, ?el_lm contains very detailed documentation, but ?el_mean does not. It is helpful to users to have everything on one page, even if the material is repeated across many functions' documentation.

@details section is added in el_mean.R.

Algorithms

EL.cpp line 103: is there justification for this step-halving mechanism? It is only applied once within the loop, so how can it ensure that the function value is increased? Would a while loop be more appropriate, or perhaps use of an algorithm that guarantees the desired property?

Related: how often does the Newton step here not increase the function value in practice? If often, perhaps Newton's method is unsuitable here, and a Trust Region procedure could be considered?

Much of the code for the computation is based on Owen (2001, chapter 12).

Here, set_el() performs different tasks depending on whether the convex hull constraint is satisfied or not. When the convex hull constraint is satisfied, the objective function is strictly concave without any constraint. Thus, the Newton–Raphson method converges quadratically to the unique solution. Step-halving is a safe-guard against possible numerical instability.

When the convex hull constraint is violated, the PsuedoLog class (explained below) increases the function value until it reaches a set threshold, or until the algorithm hits the maximum iteration number. Since the optimization problem is infeasible in this case, a while loop for the step-halving often runs indefinitely. The underlying principle is that we do not check the convex hull constraint explicitly for the EL class. Indirectly inspecting the constraint from the output is much faster and more efficient.

Currently, the package only considers infinitely differentiable estimating functions, and evaluating empirical likelihood itself with Newton–Raphson method is not a challenging task. Other slower but safer methods such as the Quasi-Newton or the Levenberg-Marquardt can be considered in the future when needed.

EL.cpp line 181: is explicitly creating a projection matrix necessary? The combination of this being a dense matrix, and relying on explicitly calling inverse (which may have performance and stability implications), leads me to question whether operations involving this matrix could instead be coded more efficiently. For example, if we have P = I - X(X X^T)^-1 X^T, and you want to compute a matrix/vector product Pv, you could write Pv = v - Xb where (XX^T)b = X^Tv, and b is obtained by solving this system. This avoids the need to ever compute and store an explicit inverse.

New proj() in helpers.cpp (lines 31--35) performs the same projection operation. It does not compute an explicit inverse.

Below is a benchmark result of the same projection operation with a random 5 by 50 LHS matrix. The version without the inverse is much faster, and the gap will grow as the dimension of the matrix increases.

Unit: microseconds 
        expr   min    lq     mean  median     uq      max neval cld 
     inverse 9.491 9.906 11.25626 10.7715 12.307   45.142 10000   b 
w/o inverse 3.073 3.293  4.49282  3.5470  4.102 7289.781 10000  a  


What is the purpose of the PsuedoLog class? Commenting this code (lines 347--394) could help facilitate a more thorough review. This looks like a generic mathematical operation-- is there no pre-existing implementation to rely on?

A brief comment is added to EL.h (lines 129--137).

Numerical computation of empirical log-likelihood ratio statistics heavily relies on a pseudo-log function. It makes the associated optimization problem easier to solve without affecting the solution when the convex hull constraint is satisfied. When the condition is violated, the pseudo-log function also prevents the statistic from being undefined (Inf) so that the algorithm can fail gracefully. The PsuedoLog class has relevant member variables and static member functions needed for the computation. We have not found any pre-existing routines for these specific tasks.

Package Design

Could methods for objects of class lm and glm (and so on) be provided, so that the package provides its output to users who have already fit their models using these alternative methods, and hence compare them?

Providing empirical likelihood methods to existing lm or glm classes directly would require dedicated functions like summary_el() or confint_el(). Otherwise, they would mask existing ones like summary() or confint(), which would not be desirable. Although el_lm() and el_glm() purposefully mimic the syntax and output of lm() and glm(), these functions perform significance tests, validate inputs, and prepare objects for subsequent methods provided by the package. Since the options provided by el_lm() and el_glm() are different from lm() and glm(), currently we are not considering extending methods to existing S3 classes.

The output of el_lm is an object of class LM, a new class from the melt package. Could this output be made to inherit from the core R lm class, to use the associated functionality with which most R users are familiar? Or related, could the package provide a straightforward manner by which to fit an lm object from a fitted LM object? Again, thinking about a user who wishes to compare the two. Similar comments for glm.

The idea of inheriting from lm class was considered in an early stage of development, but it was abandoned due to the following reasons:

We plan to add some of the methods to LM directly in a future release.

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

We are grateful for your insightful comments and suggestions on the package. Your name and role were added to the updated DESCRIPTION file. We tagged you in NEWS file for the update on projection operation. We would also like to include @pchausse when we receive comments from both reviewers.

ropensci-review-bot commented 1 year ago

:calendar: @awstringer1 you have 2 days left before the due date for your review (2022-08-29).

Paula-Moraga commented 1 year ago

Many thanks @awstringer1 for your review! You can of course ignore the previous bot message :)

Paula-Moraga commented 1 year ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/550#issuecomment-1218524923 time 8

ropensci-review-bot commented 1 year ago

Logged review for awstringer1 (hours: 8)

Paula-Moraga commented 1 year ago

Many thanks @markean for answering @awstringer1's comments so quickly. @awstringer1 can tell you whether he is satisfied with them or has additional comments. @pchausse will submit his review soon. Thanks all!

pchausse commented 1 year ago

melt-review-chausse-exported.pdf

Here is my review. I was trying to make some of my points with some R codes, but I think it does not work here. I will also attach the pdf.

My review is based on Version 1.7.0 from CRAN.

Package Review


Compliance with Standards

The following standards currently deemed non-applicable (through tags of @srrstatsNA) could potentially be applied to future versions of this software:

G1.6 Software should include code necessary to compare performance claims with alternative implementations in other R packages.: Other packages compute empirical likelihood tests and confidence intervals. We would like to see how the melt package compares with the others. For example, my momentfit (or gmm) package produces the same type of confidence intervals or confidence areas for linear models and the mean. The package emplik now includes regression models.

This is clearly not a package suited for a unique application. the algorithm to solve restricted empirical likelihood models is efficient and can be applied to many more models not yet implemented by the package. For example, The author seems to discard the idea of applying the method to over-identified models (from the NOT TO DO section of CONTRIBUTING.md), but it could easily be implemented since constrained empirical likelihood is in fact over-identified. I could also see other packages being based on melt for specific applications.

Comments

"@srrstats {RE1.4} Some of the core distributional assumptions are documented.": The list of regularity conditions is not that long. We do see some distribution assumptions in the EL documentation where the EL method is presented for general estimating equations $E[g(X,\theta)]=0$, but it would be important to add at least the assumption on the variance of $g(X,\theta)$. Users using only el_mean or el_sd should know what distribution assumptions are required. For these two functions, the detail section is empty. At least, we should see what the estimating equations are and then be given a link to the EL documentation for more details.

"@srrstats {RE4.11} logLik() method extracts the empirical log-likelihood value which can be interpreted as a measure of goodness-of-fit": It is really not clear how to interpret the value. Can we have a more detailed explanation? First, it is the empirical likelihood value of what? For example, if we estimate a linear model:

 set.seed(112233)
 x <- rnorm(40); y <- rnorm(40)
 fit <- el_lm(y~x)
 print(logLik(fit), digits=10)

This is the log-likelihood of what? If I understand correctly (not clear from the documentation), It is not the one from the model that restrict the slope coefficient to be 0, which is what we have in the slot logl or if we sum the log of the implied probabilities:

print(fit@logl, digits=10)
print(sum(fit@logp), digits=10)

After more testing, it seems to be the one from the unrestricted model for which the implied probabilities are all equal to $1/n$:

print(40*log(1/40), digits=10)

I think this needs to be explained in more details.

"@srrstats {RE6.0} plot method is available for an ELD object that is returned by the eld() method.": The authors may consider adding some diagnostic-type plot methods for EL (or any other) class as we have for lm objects. It is great to have a plot for eld() to detect outliers, but it would be even better I think if the plot was applied directly to EL class. For example, a Cook Distance exists separately for lm, but we can plot it directly from lm objects by selecting which=4:

## clearly not the best method, but you see the idea
setMethod("plot", "EL", function(x, which = 1, ...) {
    if (which==1)
        plot(eld(x))
    else
        cat("other diagnostic plots if any\n")
})

```{r, out.width='60%', fig.align='center', fig.height=5}
plot(fit, which=1)

General Review

This is a very well designed package. The outputs of the different estimating or testing methods are easy to read and the algorithms seem to be quite efficient. The following are mostly suggestions that in my view could improve the package.

Documentation

The package includes all the following forms of documentation:

I think we should see a more detailed presentation of why one would need to use the melt package to perform hypothesis tests or to construct confidence intervals. For example, when the empirical likelihood starts to produce confidence intervals with better coverage than lm does for linear regression? In the example section of el_lm, we see an application with iid and normally distributed variables. In this case, even when the same size is very small, a least square confidence interval has an exact coverage while EL does not. So why should we use the melt package?

There is room for improvement here.

The optim slot

We see that a slot optim exists in the LM object created by el_lm through the CEL-class. But it is not clear from the documentation what these values are.

set.seed(5649)
df <- data.frame(y = rnorm(50), x = rnorm(50))
fit <- el_lm(y ~ x, df)
slotNames(fit)

From the CEL-class documentation, it says that par is $\hat{\theta}$ (subject to the constraints which we do not have here) and lambda is $\hat{\lambda}$. But it is not:

coef(fit)
fit@optim$par
fit@optim$lambda

What is par? What is lambda? Since the model is just-identified, the latter should be a vector of zeroes when there are no constraints. Are they the estimates when we restrict all coefficients to be zero? (except for the intercept). Can that be added to the documentation? If we explore further, we learn that it seems to be the case:

set.seed(5649)
df2 <- data.frame(y = rnorm(50), x = rnorm(50), z = rnorm(50))
fit2 <- el_lm(y ~ x+z, df2)
fit2@optim$par

I think it is important information that may be useful for some users.

In the el_lm and el_glm (maybe others?) it is written in the details that the test for the null $H_0:~\theta_1=\theta2=\cdots=\theta{p-1}=0$ is in optim and the test parameter by parameter is in sigTests, but there is no test result in optim. It only includes the parameters, the lambdas, the number of iterations and the convergence code. Also, why is it important to return the implied probabilities (the logp slot) and the lambdas for the joint test and not for the individual ones?

The other EL slots

The slots are documented in EL, but it is not easy to get it for other objects. For example, in the el_glm documentation, the Value is "An object of class GLM". If we click in it, we only see that it inherits from LM. If we click on it, we see some slots, but not all. It inherits from CEL, so we click on it and only the optim slot is defined. We finally get the description of all slots by clicking on EL. It is not an easy process from someone like me who only work in terminals. It requires to call help() several times. It would be helpful if the slots were defined for all objects that inherits from EL.

Also, it is not always clear what the slots are. In EL, we see "A numeric vector of the log probabilities obtained from empirical likelihood" for the logp slot. But which probabilities? I figured out for el_lm that it is for the restricted model when all parameters are 0 except the intercept. It should be specified. It is the same for logl, loglr, etc. This is something we see across the different help files. It is another reason the author should define all slots for all functions that return an EL object.

The Empirical Likelihood method

I would expect a little more details about the EL method. It is well presented in the EL documentation for general estimating equations $g(X,\theta)$, but I think it would be helpful for the method to also be explained in el_lm, el_glm, el_sd, el_mean and even el_eval. The EL method does not need to be repeated, but at least we should see the specific estimating equation used for each model and a link to EL for the details. For example, it is not clear to me which estimating equation is used for the standard deviation or for generalized linear models.

It is not clear what is the purpose of the el_eval. It says that it is for general estimating equation, but the input is a matrix, so it is an estimating equation without parameters. The exact same result can be obtained with 'el_mean': the following two produced identical optim, logp, logl and loglr:

x <- matrix(rnorm(200), 50, 4)
fit <- el_eval(x)
fit2 <- el_mean(x,rep(0,4))

So, why el_eval does not simply call el_mean? If the output has to be a list, it can returned as a such.


Other comments

As a suggestion, the package can be greatly improved by adding more general nonlinear regression models. In el_glm is a nonlinear model, so it should bot be to difficult to include nonlinear regression like in nls or even more general cases. With a method to estimate general nonlinear models, it becomes straightforward to test nonlinear restrictions, if they are limited to restrictions that can be written as $H_0:~\thetai = f(\theta{-i})$. if not too difficult to implement because it only implies replacing $\theta_i$ by $f(\theta_i)$ in the estimating equation function $g(X,\theta)$.

Many recent applications rely on nonlinear estimating equations that are often over-identified. It is the case of all methods recently developed in the missing value literature (or the causal effect literature, which is just a special case).

Algorithms

The confint method

It seems that all methods for the EL, QGLM and SD classes are almost identical and can be reduced to a single method applied to a union class. I can see some benefits from avoiding repeating the same codes in several methods, especially if changes need to be applied to them.

Why the getMethodEL(object) is used to get the type of method for the classes EL and QGLM and not for the class SD? Is it because the getMethodEL method does not exist for this class? For the latter, the type is explicitly added (see line 214 of confint-methods.R)

The elt method

Would be nice to see the hypothesis being tested when the result is printed (see for example the linearHypothesis function from the car package).

Testing

Don't see any issue with the different tests.

Visualisation (where appropriate)

No issue here other than the above suggestion about the plot method.

Package Design


Estimated hours spent reviewing: 16

Paula-Moraga commented 1 year ago

Thanks so much @pchausse for your review!

Paula-Moraga commented 1 year ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/550#issuecomment-1235901756 time 16

ropensci-review-bot commented 1 year ago

Logged review for pchausse (hours: 16)

markean commented 1 year ago

Hi @pchausse, thank you for your review and feedback. Your comments have greatly improved the documentation. Some notable updates on the code include: a new logProb() function is added to extract the logp slot, plot() is directly applicable to EL class, elt() allows a symbolic description for specifying a hypothesis, print() shows the tested hypothesis when applied to ELT class. The repository currently has v1.7.0.9600 of the package. The CRAN version will be updated after the review process. Please find the responses below.

Response to @pchausse

Compliance with Standards

G1.6 Software should include code necessary to compare performance claims with alternative implementations in other R packages.: Other packages compute empirical likelihood tests and confidence intervals. We would like to see how the melt package compares with the others. For example, my momentfit (or gmm) package produces the same type of confidence intervals or confidence areas for linear models and the mean. The package emplik now includes regression models.

Since we do not make performance claims compared to other packages, please check the benchmark result below for reference only. We constructed EL confidence intervals for the mean with 100 observations using the melt, momentfit, and emplik packages. Default settings were used for each package.

library(microbenchmark)
library(melt)
library(emplik)
library(momentfit)
set.seed(134095)
x <- rnorm(100)
fit <- el_mean(x, 0)
out <- microbenchmark(
  ci_melt = confint(fit),
  ci_momentfit = confint(x, gelType = "EL"),
  ci_emplik = findUL(fun = function(theta, x) {
    el.test(x, mu = theta)
  }, MLE = mean(x), x = x),
  times = 1000
)
print(out, unit = "relative")
Unit: relative
         expr      min        lq      mean    median        uq       max neval cld
      ci_melt  1.00000  1.000000  1.000000  1.000000  1.000000  1.000000  1000 a  
 ci_momentfit  6.18093  5.738152  5.672995  5.549923  5.304345  5.693226  1000  b 
    ci_emplik 33.49088 29.969031 31.332941 30.790802 31.826105 69.604799  1000   c



Comments

"@srrstats {RE1.4} Some of the core distributional assumptions are documented.": The list of regularity conditions is not that long. We do see some distribution assumptions in the EL documentation where the EL method is presented for general estimating equations $E[g(X, \theta)] = 0$, but it would be important to add at least the assumption on the variance of $g(X, \theta)$. Users using only el_mean or el_sd should know what distribution assumptions are required. For these two functions, the detail section is empty. At least, we should see what the estimating equations are and then be given a link to the EL documentation for more details.

The updated documentation of el_mean() and el_sd() state in Details section the optimization problems with the estimating functions. Assumptions on the variance are stated as well. The See Also section gives a link to the EL documentation. Links to relevant documentation have been added to all the other user-facing exported functions as well.

"@srrstats {RE4.11} logLik() method extracts the empirical log-likelihood value which can be interpreted as a measure of goodness-of-fit": It is really not clear how to interpret the value. Can we have a more detailed explanation?

logLik() only gives $-n\log n$ with $p_i = 1/n$ for all $i$. We have expanded the Details section of logLik() (line 436 of AllGenerics.R) with more details. logLik() was originally designed to offer other functionality, but its role has been superseded by logL(). The srr tag has also been moved to logL-methods.R.

"@srrstats {RE6.0} plot method is available for an ELD object that is returned by the eld() method.": The authors may consider adding some diagnostic-type plot methods for EL (or any other) class as we have for lm objects. It is great to have a plot for eld() to detect outliers, but it would be even better I think if the plot was applied directly to EL class. For example, a Cook Distance exists separately for lm, but we can plot it directly from lm objects by selecting which=4:

eld() is now applicable to EL objects directly (line 35 of plot-methods.R). It implicitly calls eld() since the package currently provides no other diagnostic plot. More options like which can be added in a future version.

General Review

Documentation

I think we should see a more detailed presentation of why one would need to use the melt package to perform hypothesis tests or to construct confidence intervals. For example, when the empirical likelihood starts to produce confidence intervals with better coverage than lm does for linear regression? In the example section of el_lm, we see an application with iid and normally distributed variables. In this case, even when the same size is very small, a least square confidence interval has an exact coverage while EL does not. So why should we use the melt package?

Please check the updated README file. We have added more details in the Overview section. The new Main functions section introduces some available functions with short descriptions. More examples have also been added in the Usage section. The examples of el_lm() now rely on a new internal dataset thiamethoxam.

The optim slot

What is par? What is lambda? Since the model is just-identified, the latter should be a vector of zeroes when there are no constraints. Are they the estimates when we restrict all coefficients to be zero? (except for the intercept). Can that be added to the documentation?

The updated Details sections of LM, GLM, and QGLM contain more information on the optim slot and its components. The following slots are all related to the overall test: optim, logp, logl, loglr, statistic, df, and pval.

In the el_lm and el_glm (maybe others?) it is written in the details that the test for the null $H_0 : \theta_1 = \theta2 = \cdots = \theta{p - 1} = 0$ is in optim and the test parameter by parameter is in sigTests, but there is no test result in optim. It only includes the parameters, the lambdas, the number of iterations and the convergence code. Also, why is it important to return the implied probabilities (the logp slot) and the lambdas for the joint test and not for the individual ones?

Besides optim, the overall test result needs to involve logp, logl, loglr, statistic, df, and pval. Since they are all derived from optim, we stated that the overall test result is in optim. In order to avoid confusion, we have removed the original sentence "The test results are returned as optim and sigTests, respectively." from el_lm() and el_glm(). The updated documentation of el_lm() and el_glm() now focuses on what it does rather than specific slots of its return value which are not explained.

We can include the full results(optim, logp, logl, loglr, statistic, df, and pval) for each significance test, but it will make the sigTests slot have nested structure with possibly large size. Currently, sigTest() applied to SummaryLM returns the estimates, the test statistics, and the $p$-values. elt() can be if the user wants more information on a specific test.

Additionally, we have added a new accessor method logProb() that extracts logp.

The other EL slots

The slots are documented in EL, but it is not easy to get it for other objects. For example, in the el_glm documentation, the Value is "An object of class GLM". If we click in it, we only see that it inherits from LM. If we click on it, we see some slots, but not all. It inherits from CEL, so we click on it and only the optim slot is defined. We finally get the description of all slots by clicking on EL. It is not an easy process from someone like me who only work in terminals. It requires to call help() several times. It would be helpful if the slots were defined for all objects that inherits from EL.

Also, it is not always clear what the slots are. In EL, we see "A numeric vector of the log probabilities obtained from empirical likelihood" for the logp slot. But which probabilities? I figured out for el_lm that it is for the restricted model when all parameters are 0 except the intercept. It should be specified. It is the same for logl, loglr, etc. This is something we see across the different help files. It is another reason the author should define all slots for all functions that return an EL object.

The updated documentation (in AllClasses.R) lists all slots for all classes with some class-specific details.

The Empirical Likelihood method

I would expect a little more details about the EL method. It is well presented in the EL documentation for general estimating equations $g(X, \theta)$, but I think it would be helpful for the method to also be explained in el_lm, el_glm, el_sd, el_mean and even el_eval. The EL method does not need to be repeated, but at least we should see the specific estimating equation used for each model and a link to EL for the details. For example, it is not clear to me which estimating equation is used for the standard deviation or for generalized linear models.

Each model-building function (including el_eval()) now states estimating equations and the associated EL problem in the Details section.

el_eval() is a function that can be used to compute EL when the user has a specific estimating function not covered by the package. For a just-identified estimating function $g$ with a specified parameter $\theta$, the user needs to prepare the input matrix g computed with data. el_eval() only solves the optimization problem and returns the result. Since the main functions, such as elt() and elmt(), cannot be used with an arbitrary estimating equation, the return value is a list with the same components of EL.

Other comments

As a suggestion, the package can be greatly improved by adding more general nonlinear regression models. In el_glm is a nonlinear model, so it should bot be to difficult to include nonlinear regression like in nls or even more general cases. With a method to estimate general nonlinear models, it becomes straightforward to test nonlinear restrictions, if they are limited to restrictions that can be written as $H_0: \thetai = f(\theta{-i})$. if not too difficult to implement because it only implies replacing $\theta_i$ by $f(\theta_i)$ in the estimating equation function $g(X, \theta)$.

Although we agree that nls is probably the most natural way for an extension of the package, the development is currently focused on adding families and link functions for el_glm(). The next release version (v1.8.0) will be shipped with identity link option for quasipoisson family, which was added recently in C++ files. So it will take several more releases for el_glm() to provide most of the options in our roadmap.

Based on your comments above, nls methods could be implemented by improving el_eval() or making new functions such as el_nls() or el_g(). These functions should take a formula or function argument for estimating functions and an additional argument for gradient functions. Then, in principle, we could construct confidence intervals or perform hypothesis testing. The main concern would be performance. The package exclusively relies on the Eigen C++ library and OpenMP directives for parallel computing, and user-defined R functions have no simple thread-safe way of accessing them. We will revisit this issue when the development of el_glm() is finished.

Algorithms

The confint method

It seems that all methods for the EL, QGLM and SD classes are almost identical and can be reduced to a single method applied to a union class. I can see some benefits from avoiding repeating the same codes in several methods, especially if changes need to be applied to them.

Why the getMethodEL(object) is used to get the type of method for the classes EL and QGLM and not for the class SD? Is it because the getMethodEL method does not exist for this class? For the latter, the type is explicitly added (see line 214 of confint-methods.R)

confint() methods for QGLM and SD classes have been removed. One method for EL class can be applied to all other classes using a new internal generic getEstimates() (line 663 of AllGenerics.R). We have applied the same procedure to confreg() and elmt() methods and reduced the amount of duplicate code.

The elt method

Would be nice to see the hypothesis being tested when the result is printed (see for example the linearHypothesis function from the car package).

pirnt() method now prints the tested hypothesis. A new internal function describe_hypothesis() in describe_hypothesis.R translates the hypothesis matrix into a character vector for printing. Additionally, elt() now accepts a character vector for the argument lhs, allowing a symbolic description of a hypothesis. The updated documentation of elt.R shows different ways to specify a hypothesis with new examples.

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Please check the updated DESCRIPTION file. We have also added you and @awstringer1 in the Acknowledgments section of the pdf vignette in vignettes/article.pdf.

Paula-Moraga commented 1 year ago

Many thanks @markean for working on the comments so quickly! @awstringer1 and @pchausse, can you please take a look and say whether your comments were well addressed or you have additional questions? Thanks!

awstringer1 commented 1 year ago

Thanks to @markean for the response. I was waiting until both reviews and responses were in, to avoid redundancy. All my major comments have been addressed.

My only minor comment is about the optimization. I don't think it is true that "the objective function is strictly concave without any constraint. Thus, the Newton–Raphson method converges quadratically to the unique solution."– see Nocedal and Wright's Numerical Optimization, Theorem 3.5, page 46. There are non-trivial initial conditions, namely that the starting value has to be "close enough" to the optimum, else it will diverge. If Newton with step-halving has worked so far, then that's good enough for me (and perhaps there is some additional theory here about why it works in this context that I do not know), but if it ever becomes a practical problem in the future, the authors could consider trust region optimization. See Chapter 4 in that same book, or the trust or trustOptim R packages. I have found this very useful in cases where Newton was not working satisfactorily.

pchausse commented 1 year ago

Thanks @markean for your response. All my comments have also been addressed.

markean commented 1 year ago

Thanks again, @pchausse and @awstringer1.

As for the optimization, it can be seen from Art Owen's Empirical Likelihood (2001, page 63) that the objective function is twice continuously differentiable with the Lipschitz continuous Hessian by using a pseudo-logarithm function. The initial $\lambda$ is set to the zero vector. This is related to the theoretical result that $\lambda(\theta_0) = o_p(n^{-1/2})$, where $\theta_0$ is the true value. Although there have been some suggestions for using a data-dependent initial value for $\lambda$ (e.g., Hjort, McKeague, and van Keilegom 2009, page 1104), we have not found non-convergence issues during the unit tests when the convex hull constraint is satisfied. We will keep track of the issue and consider other alternatives as we add more estimating functions and methods, including the Trust Region procedure.

markean commented 1 year ago

Hi @Paula-Moraga. Can we proceed with this at your discretion if the reviewers have no further comments?

Paula-Moraga commented 1 year ago

Yes! Thanks so much @markean @awstringer1 @pchausse for your time and work. Very happy to approve the melt package!

Paula-Moraga commented 1 year ago

@ropensci-review-bot approve melt

ropensci-review-bot commented 1 year ago

Approved! Thanks @markean for submitting and @pchausse, @awstringer1 for your reviews! :grin:

To-dos:

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

markean commented 1 year ago

Thank you @Paula-Moraga!