[REVIEW]: RandomForestsGLS: An R package for Random Forests for dependent data

whedon commented 3 years ago

Submitting author: !--author-handle-->@ArkajyotiSaha@fabian-s<!--end-editor-- Reviewers: @mnwright, @pdwaggoner Archive: 10.5281/zenodo.6257157

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/8c02fcd364d7c57b0936715328dda548"><img src="https://joss.theoj.org/papers/8c02fcd364d7c57b0936715328dda548/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/8c02fcd364d7c57b0936715328dda548/status.svg)](https://joss.theoj.org/papers/8c02fcd364d7c57b0936715328dda548)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@mnwright & @pdwaggoner, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @fabian-s know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @mnwright

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@ArkajyotiSaha) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @pdwaggoner

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@ArkajyotiSaha) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @mnwright, @pdwaggoner it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 3 years ago

Wordcount for paper.md is 2766

whedon commented 3 years ago

Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.03 s (594.6 files/s, 87811.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C++                              2            304             55            980
R                               11             92             42            547
Markdown                         2             76              0            142
Rmd                              1            102            143            137
TeX                              1             11              0            108
C                                1              4              4             20
C/C++ Header                     1             11             12             16
-------------------------------------------------------------------------------
SUM:                            19            600            256           1950
-------------------------------------------------------------------------------

Statistical information for the repository '08378bffd337f88f29666b71' was
gathered on 2021/09/29.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Arkajyoti Saha                   4          1837            431          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Arkajyoti Saha             1406           76.5          0.6                5.12

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1080/01621459.2021.1950003 is OK
- 10.1007/bf00058655 is OK
- 10.1023/A:1010933404324 is OK
-  10.7717/peerj.5518 is OK
- 10.1080/10106049.2019.1595177 is OK
- 10.1016/j.najef.2018.06.013 is OK
- 10.1080/01621459.2015.1044091 is OK
- 10.1109/99.660313 is OK
- 10.1201/9781315139470 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

fabian-s commented 3 years ago

👋🏼 @ArkajyotiSaha @mnwright @pdwaggoner

this is the review thread for the paper. All of our communications will happen here from now on.

Both reviewers have checklists at the top of this thread with the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. There are also links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. When doing so, please mention openjournals/joss-reviews#REVIEW_NUMBER so that a link is created to this thread (and I can keep an eye on what is happening). Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

We aim for reviews to be completed within about 2-4 weeks (Marvin already told me he might need a little bit more time, that's fine). Please let me know if you expect additional delays. We can also use Whedon (our bot) to set automatic reminders if you know you'll be away for a known period of time.

Please feel free to ping me (@fabian-s) if you have any questions/concerns.

pdwaggoner commented 3 years ago

@ArkajyotiSaha @fabian-s et al. - Overall, this package is great. A useful extension of RF, and a great complement to the paper introducing the method. My feedback is mostly focused on high level items and involves fixes to ease consumption of the paper and code, and thus application and interpretation. No PRs as nothing major needed to be changed, by me at least. I hope there are some useful comments here for the authors. Thanks!

Re: the code, how is optimization defined when param_estimate = TRUE in the context of unknown covariance parameters? More defining and defending this (ideally in the paper and code/documentation) would be useful.

Re: the code, and specifically this criterion from JOSS: “A summary describing the high-level functionality and purpose of the software for a diverse, non-specialist audience.”, the summary (and statement of need by extension) don’t fully meet this standard. The language does a job focusing on the computational benefits of RandomForestsGLS, as well as the value in a statistical sense. But the functionality and focus of the package (rather than the method), is lacking. The details and value of the method, though needed at a high level to understand the package, are fully unpacked in the saha2021random paper. So, I wanted much more focus on introducing and convincing a non-specialist, skeptical audience of the need and value of this software tool. To be sure, the details of the package construction and design are well-discussed. But the implementation of the package, and how it might be tied into the modal ML workflow, for example, are missing. Of note: Once addressed, I will check off the related item in the review form. Ping me (@pdwaggoner) once addressed so I can complete the review form.

Why only choose autoregression for the time series dependency? As with any method, there are several assumptions with this approach/method (namely, assuming autoregressive errors). Its definitely widely used and AR is often the most common type of history dependence, and thus a good starting place. But I’d recommend, perhaps even for later package versions, other time series methods to be included in this framework, both parametric and nonparametric (e.g., ECM, ARFIMA, random walk, and so on).

Re: the paper, there were many grammatical issues throughout (e.g., "felicitates” in the Statement of Need), as well as informal syntax (contractions like “doesn't” used throughout). I recommend cleaning up and revising the manuscript several times across several readers. These types of mistakes are a bit distracting. Of note: Once addressed, I will check off the related item in the review form. Ping me (@pdwaggoner) once addressed so I can complete the review form.

Re: the paper, I wanted to see a more explicit and clearer definition of the core concept, “dependency” up front. It is mentioned a lot throughout and in the title. The authors do a good job of relating the similarity of OLS -> GLS, for the current move from RF -> RF-GLS. And there is a reference to “spatial and temporal correlation” in the Summary. But other than this, I was a bit confused and often left wondering about the many other contexts, definitions and cases that “dependency” could mean. So a crisper set up and definition for such a central concept would really benefit the paper and help situate the reader right off the bat.

Though in the vignette, I don’t get the purpose of the following in the RFGLS_estimate_timeseries.Rd manual page:

rmvn <- function(n, mu = 0, V = matrix(1)){
  p <- length(mu)
  if(any(is.na(match(dim(V),p))))
    stop("Dimension not right!")
  D <- chol(V)
  t(matrix(rnorm(n*p), ncol=p)\%*\%D + rep(mu,rep(n,p)))
}

I couldn’t see anywhere rmvn was called. Could’ve missed something.

I could imagine core functions (e.g., RFGLS_estimate_spatial) being slow with big data sets. On replicating some of the parts of the vignette, it was pretty fast. But perhaps wrapping computation in a progress bar would be a nice UI addition. For example, something like:

RFGLS_estimate_spatial <- function(coords, y, X, Xtest = NULL, nrnodes = NULL, nthsize = 20, mtry = 1, pinv_choice = 1, n_omp = 1, ntree = 50, h = 1,
                                   sigma.sq = 1, tau.sq = 0.1, phi = 5, nu = 0.5, n.neighbors = 15, cov.model = "exponential", search.type = "tree",
                                   param_estimate = FALSE, verbose = FALSE){

progressr::with_progress( # start progress bar here via `progressr`

  n <- nrow(coords)
  nsample <- n
  if(is.null(nrnodes)){
    nrnodes <- 2 * nsample + 1
  }

  if(is.null(Xtest)){
    Xtest <- X
  }
  if(ncol(Xtest) != ncol(X)){ stop(paste("error: Xtest must have ",ncol(X)," columns\n"))}

  if(param_estimate){
    sp <- randomForest(X, y, nodesize = nthsize)
    sp_input_est <- predict(sp, X)
    rf_residual <- y - sp_input_est
    if(verbose){
      cat(paste(("----------------------------------------"), collapse="   "), "\n"); cat(paste(("\tParameter Estimation"), collapse="   "), "\n"); cat(paste(("----------------------------------------"), collapse="   "), "\n")
    }
    est_theta <- BRISC_estimation(coords, x = matrix(1,n,1), y = rf_residual, verbose = verbose, cov.model = cov.model)
    sigma.sq <- est_theta$Theta[1]
    tau.sq <- est_theta$Theta[2]
    phi <- est_theta$Theta[3]
    if(cov.model =="matern"){
      nu <- est_theta$Theta[4]
    }
  }

  cov.model.names <- c("exponential","spherical","matern","gaussian")
  cov.model.indx <- which(cov.model == cov.model.names) - 1
  storage.mode(cov.model.indx) <- "integer"

  ##Parameter values
  if(cov.model!="matern"){
    initiate <- c(sigma.sq, tau.sq, phi)
    names(initiate) <- c("sigma.sq", "tau.sq", "phi")
  }
  else{
    initiate <- c(sigma.sq, tau.sq, phi, nu)
    names(initiate) <- c("sigma.sq", "tau.sq", "phi", "nu")}

  alpha.sq.starting <- sqrt(tau.sq/sigma.sq)
  phi.starting <- sqrt(phi)
  nu.starting <- sqrt(nu)

  storage.mode(alpha.sq.starting) <- "double"
  storage.mode(phi.starting) <- "double"
  storage.mode(nu.starting) <- "double"

  search.type.names <- c("brute", "tree")
  if(!search.type %in% search.type.names){
    stop("error: specified search.type '",search.type,"' is not a valid option; choose from ", paste(search.type.names, collapse=", ", sep="") ,".")
  }
  search.type.indx <- which(search.type == search.type.names)-1
  storage.mode(search.type.indx) <- "integer"

  ##Option for Multithreading if compiled with OpenMp support
  n.omp.threads <- as.integer(n_omp)
  storage.mode(n.omp.threads) <- "integer"

  fix_nugget <- 1
  ##type conversion
  storage.mode(n) <- "integer"
  storage.mode(coords) <- "double"
  storage.mode(n.neighbors) <- "integer"
  storage.mode(verbose) <- "integer"

  if(verbose){
    cat(paste(("----------------------------------------"), collapse="   "), "\n"); cat(paste(("\tRFGLS Model Fitting"), collapse="   "), "\n"); cat(paste(("----------------------------------------"), collapse="   "), "\n")
  }

  res_BF <- .Call("RFGLS_BFcpp", n, n.neighbors, coords, cov.model.indx, alpha.sq.starting, phi.starting, nu.starting, search.type.indx, n.omp.threads, verbose, PACKAGE = "RandomForestsGLS")
  res_Z <- .Call("RFGLS_invZcpp", as.integer(length(res_BF$nnIndxLU)/2), as.integer(res_BF$nnIndx), as.integer(res_BF$nnIndxLU), as.integer(rep(0, length(res_BF$nnIndxLU)/2)), as.integer(0*res_BF$nnIndx), as.integer(rep(0, length(res_BF$nnIndxLU)/2 + 1)), as.integer(rep(0, length(res_BF$nnIndxLU)/2)), PACKAGE = "RandomForestsGLS")

  p <- ncol(X)
  storage.mode(p) <- "integer"
  storage.mode(nsample) <- "integer"

  storage.mode(nthsize) <- "integer"
  if(is.null(nrnodes)){
    nrnodes <- 2 * nsample + 1
  }
  storage.mode(nrnodes) <- "integer"

  storage.mode(mtry) <- "integer"
  treeSize <- 0
  storage.mode(treeSize) <- "integer"

  storage.mode(pinv_choice) <- "integer"

  ntest <- nrow(Xtest)
  storage.mode(ntest) <- "integer"
  if(is.null(h)){h <- 1}

  q <- 0
  storage.mode(q) <- "integer"

  local_seed <- sample(.Random.seed, 1)

  if(h > 1){
    cl <- makeCluster(h)
    clusterExport(cl=cl, varlist=c("X", "y", "res_BF", "res_Z", "mtry", "n", "p",
                                   "nsample", "nthsize", "nrnodes", "treeSize", "pinv_choice", "Xtest", "ntest",
                                   "n.omp.threads", "RFGLS_tree", "q", "local_seed"),envir=environment())
    if(verbose == TRUE){
      cat(paste(("----------------------------------------"), collapse="   "), "\n"); cat(paste(("\tRF Progress"), collapse="   "), "\n"); cat(paste(("----------------------------------------"), collapse="   "), "\n")
      pboptions(type = "txt", char = "=")
      result <- pblapply(1:ntree,RFGLS_tree, X, y, res_BF, res_Z, mtry, n, p,
                         nsample, nthsize, nrnodes, treeSize, pinv_choice, Xtest, ntest,
                         n.omp.threads, q, local_seed, cl = cl)
    }
    if(verbose != TRUE){result <- parLapply(cl,1:ntree,RFGLS_tree, X, y, res_BF, res_Z, mtry, n, p,
                                            nsample, nthsize, nrnodes, treeSize, pinv_choice, Xtest, ntest,
                                            n.omp.threads, q, local_seed)}
    stopCluster(cl)
  }
  if(h == 1){
    if(verbose == TRUE){
      cat(paste(("----------------------------------------"), collapse="   "), "\n"); cat(paste(("\tRF Progress"), collapse="   "), "\n"); cat(paste(("----------------------------------------"), collapse="   "), "\n")
      pboptions(type = "txt", char = "=")
      result <- pblapply(1:ntree,RFGLS_tree, X, y, res_BF, res_Z, mtry, n, p,
                         nsample, nthsize, nrnodes, treeSize, pinv_choice, Xtest, ntest,
                         n.omp.threads, q, local_seed)
    }

    if(verbose != TRUE){
      result <- lapply(1:ntree,RFGLS_tree, X, y, res_BF, res_Z, mtry, n, p,
                       nsample, nthsize, nrnodes, treeSize, pinv_choice, Xtest, ntest,
                       n.omp.threads, q, local_seed)
    }
  }

  RFGLS_out <- list()
  RFGLS_out$P_matrix <- do.call(cbind, lapply(1:ntree, function(i) result[[i]]$P_index))
  RFGLS_out$predicted_matrix <- do.call(cbind, lapply(1:ntree, function(i) result[[i]]$ytest))
  RFGLS_out$predicted <- rowMeans(RFGLS_out$predicted_matrix)
  RFGLS_out$X <- X
  RFGLS_out$y <- y
  RFGLS_out$coords <- coords
  RFGLS_out$RFGLS_object <- list()
  RFGLS_out$RFGLS_object$ldaughter <- do.call(cbind, lapply(1:ntree, function(i) result[[i]]$lDaughter))
  RFGLS_out$RFGLS_object$rdaughter <- do.call(cbind, lapply(1:ntree, function(i) result[[i]]$rDaughter))
  RFGLS_out$RFGLS_object$nodestatus <- do.call(cbind, lapply(1:ntree, function(i) result[[i]]$nodestatus))
  RFGLS_out$RFGLS_object$upper <- do.call(cbind, lapply(1:ntree, function(i) result[[i]]$upper))
  RFGLS_out$RFGLS_object$avnode <- do.call(cbind, lapply(1:ntree, function(i) result[[i]]$avnode))
  RFGLS_out$RFGLS_object$mbest <- do.call(cbind, lapply(1:ntree, function(i) result[[i]]$mbest))

) # close progress bar here

  return(RFGLS_out)
}

If you like this, happy to open a PR and drop it in the functions for each if it would help. Let me know.

whedon commented 3 years ago

:wave: @mnwright, please update us on how your review is going (this is an automated reminder).

whedon commented 3 years ago

:wave: @pdwaggoner, please update us on how your review is going (this is an automated reminder).

pdwaggoner commented 3 years ago

Finished mine a while ago (14 days). See above in this thread

Ping @ArkajyotiSaha and @fabian-s

fabian-s commented 3 years ago

@ArkajyotiSaha while we wait for @mnwright to start their review, please adress @pdwaggoner points/questions/remarks from their comment above?

ArkajyotiSaha commented 3 years ago

Sounds great! I am working on addressing @pdwaggoner comments, will let @fabian-s and @pdwaggoner know, once I am done with them!

mnwright commented 3 years ago

I think this is a very useful extension of random forests and a promising package. The examples where the methods outperforms standard RF are quite impressing! I have a few general questions, some on the package and some on the paper:

General

From what I understand, there are two major differences to standard RF: The bootstrap procedure and the splitting rule. Why not take an existing RF package such as randomForest or ranger and make these changes instead of setting up a new package "borrowing some code from randomForest"?
Fitting RF-GLS is slower than standard RF. How much is it slower? How does it scale with the number of observations, covariates or other data or model parameters?
Is a real data example available? That would be of interest for the method itself (not the focus here) but also for the package to see how it scales and for which real data purpose it can be used.

Package

The C++ code is not documented/commented well and hard to understand.
The DESCRIPTIONS still contains a link to arxiv, not the published paper.
The README is missing a link to the paper.
Typo in README: criterion.
Tests just run examples and check output types/sizes. That could be improved with more tests and tests that check for correct output.
Is any kind of continuous integration used? I think it is useful to at least run the tests with each commit/PR.
Maybe too late to change that, but I think the package name is not a great choice. For example, at first try, I typed "randomForestGLS", then capitalized to "RandomForestGLS" and finally corrected to "RandomForestsGLS". It's also quite long and you have to remember the capitalization.

Paper:

The JASA paper is called "Random Forests for Spatially Dependent Data", the software has the same name without "Spatially". Are additional types of dependencies covered by the software, not described in the original JASA paper? If yes, please detail in the JOSS paper.
line 8: Should be "in these models"
lines 14-15: "hence is not optimal in mixed-model approach". I don't understand this. Wouldn't RF be used as an alternative to the mixed model and not IN the mixed model approach?
lines 18-19: Avoid linebreak in package name
line 63: for or model correlation?
line 176: optimizing a cost function (missing a)
line 211: of the RF-GLS method (missing the)
line 216: "Efficient implementation thorough" should be through?
line 217: Maybe remove "clever"?
References: Datta et al. is in title case, others in sentence case.
In general, many spelling errors, missing articles, etc.

fabian-s commented 2 years ago

@ArkajyotiSaha what's your timeline for adressing our reviewers' comments?

ArkajyotiSaha commented 2 years ago

@fabian-s I am working on the revision and am almost done with them. I plan to submit them by the end of the thanksgiving weekend (29th Nov). Please let me know if the timeline works for you. Thanks!

fabian-s commented 2 years ago

great, thanks for the update.

ArkajyotiSaha commented 2 years ago

@fabian-s , @pdwaggoner @mnwright We thank the Editor and the reviewers for their positive feedback and thoughtful comments which have helped to improve the manuscript. We have tried to address all the reviewer comments in the software and the paper. Updated versions of the package and the paper are available in the associated GitHub repository (https://github.com/ArkajyotiSaha/RandomForestsGLS). A detailed point-by-point response letter is available in https://github.com/ArkajyotiSaha/RandomForestsGLS/blob/main/JOSS_authors_response_letter.pdf . The response letter is divided in two sections, with each section addressing the comments of one of the reviewers (Section 1: @pdwaggoner ; Section 2: @mnwright). Please let me know if I can provide any additional information. Thanks for your time and consideration!

fabian-s commented 2 years ago

@whedon generate pdf

whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

pdwaggoner commented 2 years ago

Satisfied. Well done!

fabian-s commented 2 years ago

@mnwright please let us know if you see any remaining points that need to be adressed.

mnwright commented 2 years ago

Thanks for the extensive revision. It looks fine except one thing: I still cannot find the comments in the .cpp files.

ArkajyotiSaha commented 2 years ago

@mnwright Extremely sorry! Thanks so much for pointing this out. I have somehow missed updating the version of RFGLS.cpp in the GitHub with the commented version. I have now updated the version. The new comments can be found in the updated version of RFGLS.cpp. For your convenience, I am also adding a link to the latest edit history, which highlights the comments at https://github.com/ArkajyotiSaha/RandomForestsGLS/commit/8e1b11236274844ec92be371d10d8d08dfd33844 . Please let me know if this works. Thanks again!

fabian-s commented 2 years ago

@whedon generate pdf

fabian-s commented 2 years ago

@whedon check references

whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 2 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1080/01621459.2021.1950003 is OK
- 10.1007/bf00058655 is OK
- 10.1023/A:1010933404324 is OK
-  10.7717/peerj.5518 is OK
- 10.1080/10106049.2019.1595177 is OK
- 10.1016/j.najef.2018.06.013 is OK
- 10.1080/01621459.2015.1044091 is OK
- 10.1109/99.660313 is OK
- 10.1201/9781315139470 is OK

MISSING DOIs

- None

INVALID DOIs

- None

fabian-s commented 2 years ago

@ArkajyotiSaha It seems both reviewers are now satisified with the software and the paper, thank you for taking on these revisions.

However, JOSS submissions are typically 500-1000 words, while your article currently has >3000 words. Please shorten as much as you can. Especially l. 180 - 260 contain much more technical and mathematical detail than we expect for a JOSS paper, please remove anything that is not a high-level summary of what the software offers from a user perspective as well as anything that is also contained in more detail in your package vignettes or your methodological papers.

ArkajyotiSaha commented 2 years ago

Thank you so much @fabian-s . I am working on the article to shorten it such that it doesn't take anything away from a high-level summary, but does away with the intricate technical and mathematical details by referring to existing associated literature in package vignettes/methodological paper/github repo description of the package. Will update you once I am done with the update, thanks!

ArkajyotiSaha commented 2 years ago

@fabian-s I have updated the text. We have moved most of 180 - 260 in package README. Since the analysis of time series data is an important part of the package, and is not highlighted in the methodological paper, we wanted to keep it in this article. In order to further reduce the length of the article, we have alsomoved the discussion on unknown parameter estimation to README section. Please let me know if this works, thanks!

fabian-s commented 2 years ago

@whedon generate pdf

whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

fabian-s commented 2 years ago

Thanks @ArkajyotiSaha , this is much better already but we're still at ~2000 words.

I think by consistently cutting out all verbosity and rhetorical flourishes (e.g. by following some of the tips here: https://redwoodink.com/resources/10-tricks-to-reduce-your-word-count-in-academic-writing) you could easily reduce that by another 20-40% without losing any relevant conten -- could I ask you to try to do so? I'm sorry for all this back and forth, I should have raised this problem much sooner during the review.

As an example for the kind of editing you could do, consider my attempt at making your Discussion section more concise:

Original (240 words):

In this package, we have developed an efficient, parallel implementation of the RF-GLS method proposed in @saha2021random. This accounts for the correlation in the data by incorporating it in the node splitting criteria and the node representative update rule. The package accounts for spatial correlation, modeled with Matérn GP and serial autocorrelation. More often than not, the model parameters are unknown to the users, hence the package has inbuilt parameter estimation corresponding to both the covariance structures. Efficient implementation through C/C++ takes advantage of the NNGP approximation and scalable node splitting update rules which reduces the execution time. For details regarding package features, unknown parameter estimation and parallelization, we refer the readers to the package README. Present implementation of RF-GLS is slower than that of RF due to the additional computational complexity associated with the split criteria evaluation. In RF-GLS, evaluation of cost function, corresponding to a potential split has $O(t^3)$ computational complexity, where $t$ is the number of leaf nodes at the present split. For very deep trees, this would lead to significant added computational burden over RF, where this step has $O(1)$ complexity. As a future extension of the package, implementation of covariate specific parallel optimization corresponding to each node splitting and improving overall computational complexity of the algorithm can be of independent research interest. We also plan to implement additional forms of time series dependency and perform thorough empirical validation, prior to making it available in the software.

Concise (135 words):

This package provides an efficient, parallel implementation of the RF-GLS method proposed in @saha2021random which accounts for correlated data with modified node splitting criteria and node representative update rule. The package accounts for spatial correlation via Matérn GP or serial autocorrelation. It provides parameter estimation for both covariance structures. Efficient implementation through C/C++ takes advantage of the NNGP approximation and scalable node splitting update rules which reduces the execution time. More details on package features, parameter estimation and parallelization can be found in the package README. Since the computational complexity of evaluating potential splits is cubic in the number of leaf nodes for RF-GLS, but constant for standard RF, improving the computational complexity of the algorithm is of independent research interest. We also plan to implement and validate additional forms of time series dependency.

ArkajyotiSaha commented 2 years ago

@fabian-s Thank you so much for the detailed example and the reference! I will take a second pass through the paper and reduce it significantly. Will update you once I am done. Thanks again!

fabian-s commented 2 years ago

@ArkajyotiSaha please let us know how much time you'll need for your edits.

ArkajyotiSaha commented 2 years ago

@fabian-s I am done with the initial edits (which have reduced the word count ~800 words as per your recommendations) and presently working on finalizing them. Hope to submit the final version by the end of the next week!

fabian-s commented 2 years ago

@ArkajyotiSaha please let us know about your progress

ArkajyotiSaha commented 2 years ago

@fabian-s I completed my edits and sent them to my coauthors for comments. I am waiting to hear back from them. I have also sent them a reminder. Hope to submit the final version by end of the week.

ArkajyotiSaha commented 2 years ago

@fabian-s I have updated the manuscript accordingly, which has resulted in significant reduction in manuscript length. Please let me know your thoughts on this.

fabian-s commented 2 years ago

@whedon generate pdf

whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

fabian-s commented 2 years ago

Thanks -- almost done here, finally! Please take a look at the typos / grammatical errors I fixed at https://github.com/ArkajyotiSaha/RandomForestsGLS/pull/1 incorporate as necessary/sensible, then please do run another spell check and proof-read closely and ping me again.

ArkajyotiSaha commented 2 years ago

@fabian-s I have incorporated your corrections. Thanks so much for them! I have also gone through the manuscript for another round of proof-read and made some minor corrections.

fabian-s commented 2 years ago

@whedon generate pdf

fabian-s commented 2 years ago

@whedon check references

whedon commented 2 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1080/01621459.2021.1950003 is OK
- 10.1007/bf00058655 is OK
- 10.1023/A:1010933404324 is OK
-  10.7717/peerj.5518 is OK
- 10.1080/10106049.2019.1595177 is OK
- 10.1016/j.najef.2018.06.013 is OK
- 10.1080/01621459.2015.1044091 is OK
- 10.1109/99.660313 is OK
- 10.1201/9781315139470 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

fabian-s commented 2 years ago

:partying_face: @ArkajyotiSaha

Wonderful -- at this point could you:

[x] Make a tagged release of your software, and list the version tag of the archived version here.
[x] Archive the reviewed software in Zenodo or a similar service (e.g., figshare, an institutional repository)
[ ] Check the archival deposit (e.g., in Zenodo) has the correct metadata. This includes the title (should match the paper title) and author list (make sure the list is correct and people who only made a small fix are not on it). You may also add the authors' ORCID.
[x] Please list the DOI of the archived version here.

I can then move forward with accepting the submission.

ArkajyotiSaha commented 2 years ago

I made minor updates in the package vignette and updated the package version.

The version tag of the tagged release in GitHub is v0.1.4 .
The archived software in Zenodo has the DOI 10.5281/zenodo.6192456 .

Please let me know if any additional information is needed.

fabian-s commented 2 years ago

@ArkajyotiSaha thanks, almost done,

please make sure the title of your Zenodo deposit matches the title of your JOSS paper -- it's

"ArkajyotiSaha/RandomForestsGLS: v0.1.4"

right now, it should be

"RandomForestsGLS: An R package for Random Forests for dependent data"

openjournals / joss-reviews