canaper: Categorical Analysis of Neo- And Paleo-Endemism in R

joelnitta commented 3 years ago

Date accepted: 2022-09-14 Submitting Author: !--author1-->@joelnitta<!--end-author1-- Other Package Authors: (delete if none) Shawn Laffan (@shawnlaffan) Repository: https://github.com/joelnitta/canaper Version submitted: 0.0.1 Submission type: Stats Badge grade: silver Editor: @tdhock Reviewers: @KlausVigo, @luismurao

Due date for @KlausVigo: 2021-11-24 Due date for @luismurao: 2021-12-21

Archive: TBD Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: canaper
Title: Categorical Analysis of Neo- And Paleo-Endemism
Version: 0.0.1
Authors@R: 
    c(
    person(given = "Joel H.",
           family = "Nitta",
           role = c("aut", "cre"),
           email = "joelnitta@gmail.com",
           comment = c(ORCID = "0000-0003-4719-7472")),
    person(given = "Shawn W.",
           family = "Laffan",
           role = c("ctb", "dtc")),
    person(given = "Brent D.",
           family = "Mishler",
           role = c("ctb", "dtc")),
    person(given = "Wataru",
           family = "Iwasaki",
           role = c("ctb"),
           comment = c(ORCID = "0000-0002-9169-9245"))
           )
Description: Provides functions to conduct categorical analysis of neo- and paleo-endemism (CANAPE).
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c ("collate", "namespace", "rd", "roxyglobals::global_roclet", "srr::srr_stats_roclet"))
RoxygenNote: 7.1.2
URL: https://github.com/joelnitta/canaper
BugReports: https://github.com/joelnitta/canaper/issues
Imports: 
    ape,
    assertr,
    assertthat,
    dplyr,
    future.apply,
    phyloregion,
    progressr,
    purrr,
    stats,
    tibble,
    vegan
Suggests: 
    rmarkdown,
    knitr,
    future,
    tictoc,
    patchwork,
    tidyverse,
    testthat (>= 3.0.0),
    roxyglobals (>= 0.2.1),
    stringr,
    magrittr,
    covr,
    picante
Config/testthat/edition: 3
Depends: 
    R (>= 3.5.0)
VignetteBuilder: knitr
Remotes: 
    anthonynorth/roxyglobals

Pre-submission Inquiry

[x] A pre-submission inquiry has been approved in issue #469

General Information

Who is the target audience and what are scientific applications of this package?
- Target audience is evolutionary biologists. Scientific application is the analysis of phylogenetic endemism.
Paste your responses to our General Standard G1.1 here, describing whether your software is:
- The first implementation within R of an algorithm which has previously been implemented in other languages or contexts
- Please include hyperlinked references to all other relevant software: https://github.com/shawnlaffan/biodiverse
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
- (Not applicable)

Note also there is a package website available: https://joelnitta.github.io/canaper/

Badging

What grade of badge are you aiming for? (bronze, silver, gold)
- Silver, but open to suggestions from the reviewers and editors.
If aiming for silver or gold, describe which of the four aspects listed in the Guide for Authors chapter the package fulfils (at least one aspect for silver; three for gold)
- Demonstrates excellence in compliance with multiple standards from at least two broad sub-categories
- Compliance with a good number of standards beyond those identified as minimally necessary. I am in compliance with most of the standards for "General" and "Dimensionality Reduction, Clustering, and Unsupervised Learning".

Technical checks

Confirm each of the following by checking the box.

[x] I/we have read the guide for authors and rOpenSci packaging guide.
[x] I/we have read the Statistical Software Peer Review Guide for Authors.
[ ] I/we have run autotest checks on the package, and ensured no tests fail.
- autotest hangs (doesn't produce any output after waiting ca. 20 minutes) on most functions. I was only able to use it on cpr_classify_endem().
[x] The srr_stats_pre_submit() function confirms this package may be submitted.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
- Also, I intend to submit a paper about this package to Methods in Ecology and Evolution after ROpenSci review.

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

ropensci-review-bot commented 3 years ago

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot commented 3 years ago

:rocket:

The following problem was found in your submission template:

'author1' variable must be GitHub hanle only ('@myhandle') Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

:wave:

ropensci-review-bot commented 3 years ago

Checks for canaper (v0.0.1)

git hash: 43a9f0a2

:heavy_check_mark: Package name is available
:heavy_check_mark: has a 'CITATION' file.
:heavy_check_mark: has a 'codemeta.json' file.
:heavy_check_mark: has a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_check_mark: Package coverage is 100%.
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.

Package License: MIT + file LICENSE

1. rOpenSci Statistical Standards (`srr` package)

This package is in the following category:

Dimensionality Reduction, Clustering and Unsupervised Learning

:heavy_check_mark: All applicable standards [v0.1.0] have been documented in this package

Click here to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 11 files) and - 1 authors - 3 vignettes - 8 internal data files - 11 imported packages - 4 exported functions (median 41 lines of code) - 30 non-exported functions in R (median 7 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 11| 59.3| | |files_vignettes | 2| 83.0| | |files_tests | 8| 85.7| | |loc_R | 650| 52.9| | |loc_vignettes | 162| 64.9| | |loc_tests | 939| 83.5| | |num_vignettes | 3| 93.1| | |data_size_total | 66301| 81.2| | |data_size_median | 509| 59.9| | |n_fns_r | 34| 35.6| | |n_fns_r_exported | 4| 15.6| | |n_fns_r_not_exported | 30| 42.6| | |n_fns_per_file_r | 3| 42.7| | |num_params_per_fn | 4| 67.6| | |loc_per_fn_r | 10| 40.0| | |loc_per_fn_r_exp | 41| 75.3| | |loc_per_fn_r_not_exp | 7| 29.9| | |rel_whitespace_R | 13| 43.9| | |rel_whitespace_vignettes | 0| 0.0|TRUE | |rel_whitespace_tests | 7| 84.7| | |doclines_per_fn_exp | 55| 68.2| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 9| 22.0| | ---

2a. Network visualisation

Interactive network visualisation of calls between objects in package can be viewed by clicking here

3. `goodpractice` and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![github](https://github.com/joelnitta/canaper/workflows/R-CMD-check/badge.svg)](https://github.com/joelnitta/canaper/actions) **GitHub Workflow Results** |name |conclusion |sha |date | |:---------------|:----------|:------|:----------| |pkgdown |success |43a9f0 |2021-10-26 | |R-CMD-check |success |43a9f0 |2021-10-26 | |Render codemeta |failure |43a9f0 |2021-10-26 | |test-coverage |success |43a9f0 |2021-10-26 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 100 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following functions have cyclocomplexity >= 15: function | cyclocomplexity --- | --- cpr_rand_test | 29 calc_biodiv_random | 16 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 353 potential issues: message | number of times --- | --- Lines should not be more than 80 characters. | 353

Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.2.16 | |pkgcheck |0.0.2.86 | |srr |0.0.1.120 |

Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

tdhock commented 3 years ago

@ropensci-review-bot seeking reviewers

ropensci-review-bot commented 3 years ago

I'm sorry @tdhock, I'm afraid I can't do that. That's something only editors are allowed to do.

noamross commented 3 years ago

@ropensci-review-bot assign @tdhock as editor

ropensci-review-bot commented 3 years ago

Assigned! @tdhock is now the editor

joelnitta commented 3 years ago

I forgot to mention, I plan to submit this as a paper to Methods in Ecology and Evolution after ROpenSci review (have also added this to the original description)

tdhock commented 3 years ago

@ropensci-review-bot assign @KlausVigo as reviewer

mpadge commented 3 years ago

@tdhock Only just realised the command there has the wrong syntax - should be add <@name> to reviewers, like this example. Could you please try again? Nevertheless good that you got this one wrong, because it made us realise we need the bot to tell you it didn't understand. Thanks!

tdhock commented 3 years ago

@ropensci-review-bot add @KlausVigo to reviewers by the way it would be more user friendly to have the add editor and add reviewer commands be consistent / analogous, I guess that is what you are doing in https://github.com/ropensci-org/buffy/pull/42 ?

mpadge commented 3 years ago

sorry @tdhock, another bot glitch there - bot commands have to be kept as just that, with no additional comments. And yes, the point of the above-linked issue is to aid consistency between analogous commands, so that will hopefully improve soon. In the meantime, could you please try again, again, with just the command? With due apologies and gratitude! :bowing_man:

tdhock commented 3 years ago

@ropensci-review-bot add @KlausVigo to reviewers

ropensci-review-bot commented 3 years ago

Can't assign reviewer because there is no editor assigned for this submission yet

maelle commented 3 years ago

@tdhock sorry for all the hiccups! The initial issue body had been edited after your username was added so your username was no longer in the string  @tdhock I edited it again so you can try again. Thanks for your patience!

tdhock commented 3 years ago

@ropensci-review-bot add @KlausVigo to reviewers

ropensci-review-bot commented 3 years ago

@KlausVigo added to the reviewers list. Review due date is 2021-11-24. Thanks @KlausVigo for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 3 years ago

@KlausVigo: If you haven't done so, please fill this form for us to update our reviewers records.

tdhock commented 2 years ago

@ropensci-review-bot add @luismurao to reviewers

ropensci-review-bot commented 2 years ago

@luismurao added to the reviewers list. Review due date is 2021-12-21. Thanks @luismurao for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 2 years ago

@luismurao: If you haven't done so, please fill this form for us to update our reviewers records.

luismurao commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[ ] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[ ] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing:

~ 9
[ ] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

The canaper package provides functions to infer the evolutionary processes underlying neoendemism and paleoendemism. The software is an R implementation of the Categorical Analysis of Neo- and Paleo-endemism, which is only available in the Biodiverse software.
I enjoyed reviewing the package, but I found some things that might improve the package.
- One concern is about parallelization: in the current stage of the package, parallelization calls are done outside the functions (via future::plan(future::multisession)). I think it would be better to have parameters for parallelization in functions that allow running analysis in parallel (so calls are done inside them).

Parallelization calls as in the current version of the package

library(canaper) # This package :)
library(future) # For parallel computing
library(tictoc) # For timing things
library(patchwork) # For composing mult-part plots
library(tidyverse) # For data-wrangling and plottin

data(acacia)

plan(multisession,workers=ncores)
set.seed(071421)
acacia_rand_res <- cpr_rand_test(
  acacia$comm, acacia$phy, 
  null_model = "curveball",
  n_reps = 100, n_iterations = 100000,
  tbl_out = TRUE)
plan(sequential)

I think this could be more appropriate and easy to use

acacia_rand_res <- cpr_rand_test(
  acacia$comm, acacia$phy, 
  null_model = "curveball",
  n_reps = 100, n_iterations = 100000,
  tbl_out = TRUE
  parallel = TRUE, # Logical var to run processes in parallel
  nworkes = 3, # Number of threads
  random_seed=071421) # Random seed

Instead of suggesting the future package I would import it

README

No code of conduct and contribution guidelines, please add it

Installation

There is a potential conflict in package dependencies as canaper depends on R (>= 3.5.0), but it imports phangorn version 2.8.1, which depends on R (>= 4.1.0). Please change parameter Depends: R (>= 3.5.0) to Depends: R (>= 4.1.0) in the description file or provide a way to install an earlier version of phangorn (version 2.5.5 should work).

I installed the package locally on macOS (Big Sur), Linux (Ubuntu 20.04.3 & 21.04), and Windows 10

Automated test

devtools::check() takes more than six minutes. Building of vignette outputs takes a lot of time (5m 5.2s). Consider using lees number of replicates in examples (i.e. n_reps of canaper::cpr_rand_test).

I got one error after upgrading from GEOS 3.9 to 3.10 in Linux. The issue is related to phyloregion dependency.

══ Warnings ════════════════════════════════════════════════════════════════════
── Warning (test-utils.R:143:3): Functions copied from phyloregion work ────────
rgeos: versions of GEOS runtime 3.10.1-CAPI-1.16.0
and GEOS at installation 3.9.1-CAPI-1.14.2differ
Backtrace:
1. testthat::expect_equal(dense2sparse(biod_example$comm), phyloregion::dense2sparse(biod_example$comm)) test-utils.R:143:2
4. base::loadNamespace(x)
7. base::loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]])
8. base:::runHook(".onLoad", env, package.lib, package)
13. rgeos:::fun(libname, pkgname)

devtools::test() finished with no fails and one warning

────────────────────────────────────────────────────────────────────
Warning (test-utils.R:143:3): Functions copied from phyloregion work
rgeos: versions of GEOS runtime 3.10.1-CAPI-1.16.0
and GEOS at installation 3.9.1-CAPI-1.14.2differ
Backtrace:
  1. testthat::expect_equal(dense2sparse(biod_example$comm), phyloregion::dense2sparse(biod_example$comm)) test-utils.R:143:2
  4. base::loadNamespace(x)
  7. base::loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]])
  8. base:::runHook(".onLoad", env, package.lib, package)
 13. rgeos:::fun(libname, pkgname)
────────────────────────────────────────────────────────────────────

══ Results ═════════════════════════════════════════════════════════
Duration: 71.8 s

[ FAIL 0 | WARN 1 | SKIP 0 | PASS 255 ]

devtools::spell_check() Change "signficance" to "significance" in line 132

signficance       canape.Rmd:132

Performance

When benchmarking, I found a strange behavior. After using more than four threads, the program takes more time to finish a randomization test. Here is the code I used (could you please check it? I might be missing something)

library(purrr)
library(tidyverse)
library(tictoc)
library(future)
library(ggplot2)
rm(list = ls())
data(acacia)
acom <- acacia$comm
acom[acom>0] <- 1
n_reps_vec <- c(100,500,1000,1500,2000)
nwokers <- 1:7
acom <- acacia$comm
acom[acom>0] <- 1

rand_times <- seq_along(n_reps_vec) %>% purrr::map_dfr(function(nrep){
  rand_t <- nwokers %>% purrr::map_dfr(function(nwork){
      if(nwork>1){
        plan(multisession,workers=nwork)
      }
      set.seed(071421)
      tictoc::tic() 
      # Run randomization test
      acacia_rand_res <- canaper::cpr_rand_test(
        acom, acacia$phy,
        null_model = "curveball",
        n_reps = n_reps_vec[nrep],
        n_iterations = 100000,
        tbl_out = TRUE)
      aa <- tictoc::toc()
      r1 <-data.frame(n_reps = n_reps_vec[nrep] ,
                      nworkers=nwork,
                      time_secs = aa$toc - aa$tic)
      plan(sequential)
      return(r1)
    })
  return(rand_t)
})

ggplot(rand_times, aes(x = ncores,y=time_secs,color=as.factor(n_reps))) +
  geom_line() + theme_classic()

Questions and comments

I was wondering if canaper has already all methods of the CANAPE version of Biodiverse. Could write something about it in the README?
Although the vignettes are very useful and well documented, it would be nice to have methods for plotting spatial data results.

@joelnitta please do not hesitate to contact me if you have any questions. Thanks for developing canaper.

joelnitta commented 2 years ago

@luismurao thanks so much for your review! I will try to address your comments soon.

joelnitta commented 2 years ago

@KlausVigo it would be great to have your comments as well so I can make changes incorporating both sets of reviews.

joelnitta commented 2 years ago

@KlausVigo, do you still intend to review this package?

If not, @tdhock can you please assign a new reviewer, as it is now multiple months past the due date.

Thanks!

KlausVigo commented 2 years ago

Hi @joelnitta! I will finish it today.

KlausVigo commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[ ] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 12

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

The canaper package provides functions to conduct categorical analysis of neo- and paleo-endemism (CANAPE). It is an reimplementation of the Biodiverse software, but there are some functions in the picante (ses.pd) and in the PhyloMeasures package for Faith's phylogenetic diversity with a similar scope.
The canaper package builds on functions from the vegan package to resample (dense) community matrices and borrows biodiversity measures (Faith's phylogenetic diversity and phylogenetic endemism) from the phyloregion package. The picante package also has some functions for swapping (community) matrices similar to vegan. A comparison between the packages would be useful.

Specific comments

The functions from the phyloregion package used in the package should be acknowledged properly. There is a reference in the README and in the source that the biodiversity measure are from the phyloregion package. However tools like depsy which trace back contributions will likely check only the DESCRIPTION file for imported packages, contributions etc.. If functions are copied the CRAN Repository Policy suggests to acknowledge contributors (ctb) in the Author field. I opened an issue to improve the development guide (https://github.com/ropensci/dev_guide/issues/388) with some more additional information. Copying functions can additionally cause copyright problems. An alternative is importing the functions from the package directly.
In contrast to the other reviewer I really like the use of the future package. Parallelization is dependent on the hardware, the operation system and on other things like if the functions run in a terminal or GUI. To avoid crashes this needs lots of internal checks and guessing and often will not lead to the best result. I prefer to give users the power to adopt the parallelization to their specific needs.

Profiling

In my opinion the main bottleneck for this kind of analysis is memory consumption and only minor part computing time. Dense community matrices are now frequently getting too large to be hold in memory (RAM). Luckily community matrices are usually extremely sparse and this allows efficient coding. For example the acacia data set, which comes with canaper, has less than 2% non-zero cells and a sparse representation takes only 1/20 of memory of the dense representation.

> library(canaper)
> library(Matrix)
> data("acacia")
> dim(acacia$comm)
[1] 3037  506
> comm <- as.matrix(acacia$comm)
> sum(comm > 0) / prod(dim(comm))
[1] 0.01898522
> object.size(comm)
12597112 bytes
> object.size(Matrix(comm, sparse=TRUE))
656576 bytes

The functions to compute the biodiversity measures make use of the sparsity, but the functions to sample the matrices not. We developed the phyloregion using sparse matrices which resulted in a far lower memory consumption and as a benefit it turned out faster in many cases (https://phyloregion.com/articles/Benchmark.html). With larger data sets we get into a trade-off. We want to use more cores, but this means several copies of the community matrices in memory. Providing a swapping algorithm which can handle sparse matrices would be a huge improvement for the community.

I used the following code for profiling:

library(canaper)
data(acacia)
Rprof(tmp <- tempfile(), memory.profiling = TRUE)

acacia_rand_res <- cpr_rand_test(
    acacia$comm, acacia$phy,
    null_model = "curveball",
    n_reps = 100,
    n_iterations = 100000)

Rprof()
summaryRprof(tmp, memory="both")
unlink(tmp)

I just highlight a few lines of the output I find interesting:

...
"calc_biodiv_random"              111.54     92.35   29477.9      0.02     0.02
"cpr_rand_comm"                    99.56     82.43   22420.0      1.30     1.08
"simulate.nullmodel"               83.18     68.87   11655.6      1.24     1.03
...
"phylo_community"                   5.72      4.74    2994.7      1.60     1.32
"phylo_endemism"                    5.14      4.26    2876.1      0.02     0.02
"assertthat::assert_that"           4.66      3.86    1867.0      0.10     0.08
...
"PD"                                2.56      2.12    1656.1      0.00     0.00 
...

Not unexpected the largest amount of time is spent in re-sampling the community matrices (simulate.nullmodel).
The canaper package uses defensive programming extensively (calling assertthat::assert_that), which is generally a very good thing. However, here are some exceptions: The functions calc_biodiv_random and cpr_rand_comm are only called internally and are called thousands of times inside of the loop. It should be save to remove all (most) of the calls to assertthat::assert_that from these functions as parameters have been checked already and are save to use! In fact the algorithm spent a similar amount of time in assertthat::assert_that than computing the biodiversity measures (phylo_endemism, PD).
When computing several biodiversity metrics at once a lot of intermediate results are shared and could be reused. This is not as important and the current approach allows to easily extend the cpr_rand_test function with additional biodiversity measures.

For submitting to CRAN:

There are some instances where more than 2 cores are used, e.g. the vignette canape.Rmd, line 83:

plan(multisession, workers = 3)

This could fail on CRAN, when submitting. There the use of maximal 2 cores is allowed and I speak from experience, I got Ripleyed for this once! Again the CRAN Repository Policy is a useful resource and gives some background information: "Checking the package should take as little CPU time as possible, as the CRAN check farm is a very limited resource and there are thousands of packages. Long-running tests and vignette code can be made optional for checking, but do ensure that the checks that are left do exercise all the features of the package. If running a package uses multiple threads/cores it must never use more than two simultaneously: the check farm is a shared resource and will typically be running many checks simultaneously. Examples should run for no more than a few seconds each: they are intended to exemplify to the would-be user how to use the functions in the package."

joelnitta commented 2 years ago

@KlausVigo thank you very much for the extensive and helpful review!

I just wanted to address one thing first: the reason I copied functions from phyloregion instead of importing them is because of a dependency of phyloregion (betapart), which has a package in Suggests (snow) that was resetting the random number generator and thus cpr_rand_test() was failing to return identical results when using parallel computing, even when set.seed() was used. There is an issue describing the problem in canaper, and a related discussion in the future package.

In the end, I opted to copy the phyloregion functions under the AGPL-3 license, since that fixed the issue and resulted in fewer dependencies. As @HenrikBengtsson mentioned though, it seems another way to fix this (and import phyloregion normally) would be to use future.packages = "snow" with future.apply::future_lapply(). Would you prefer that?

(By the way, it is possible that this behavior of snow may cause reproducibility issues with phyloregion as well. I suggest you look into it if you haven't already).

tdhock commented 2 years ago

hi @KlausVigo thanks for your detailed review. can you please tell us approximately how many hours you spent on reviewing? (you left that field blank in your review comment)

tdhock commented 2 years ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/475#issuecomment-1002754119 time 9

ropensci-review-bot commented 2 years ago

Logged review for luismurao (hours: 9)

KlausVigo commented 2 years ago

Hi @joelnitta,

I just made a commit to phyloregion cb819fb. This might solves the problem, at least sessionInfo() suggests that snow is not loaded via a namespace any more.

Regards, Klaus

@KlausVigo thank you very much for the extensive and helpful review!

I just wanted to address one thing first: the reason I copied functions from phyloregion instead of importing them is because of a dependency of phyloregion (betapart), which has a package in Suggests (snow) that was resetting the random number generator and thus cpr_rand_test() was failing to return identical results when using parallel computing, even when set.seed() was used. There is an issue describing the problem in canaper, and a related discussion in the future package.

In the end, I opted to copy the phyloregion functions under the AGPL-3 license, since that fixed the issue and resulted in fewer dependencies. As @HenrikBengtsson mentioned though, it seems another way to fix this (and import phyloregion normally) would be to use future.packages = "snow" with future.apply::future_lapply(). Would you prefer that?

(By the way, it is possible that this behavior of snow may cause reproducibility issues with phyloregion as well. I suggest you look into it if you haven't already).

joelnitta commented 2 years ago

@KlausVigo thanks for making that change: I switched back to importing phyloregion, and it passes the tests for consistent results using the same seed in parallel.

I just have one request: would you consider bumping the phyloregion version so that I can require the version with this fix in Imports? That will prevent this bug popping up for others.

Hi @joelnitta,

I just made a commit to phyloregion cb819fb. This might solves the problem, at least sessionInfo() suggests that snow is not loaded via a namespace any more.

KlausVigo commented 2 years ago

@joelnitta nice that this works and it seems to simplify things. I wasn't aware that ImportFrom(betapart, beta.pair) and betaprt::beta.pair made such a large difference with the NAMESPACE. I naively assumed it is about the same and the documentation about these issues does not tell you otherwise. @darunabas is the maintainer of phyloregion and has to submit it to CRAN. We probably need a week or so to clean up all changes and get the error messages sorted out as we haven't updated the package for a while.

@KlausVigo thanks for making that change: I switched back to importing phyloregion, and it passes the tests for consistent results using the same seed in parallel.

I just have one request: would you consider bumping the phyloregion version so that I can require the version with this fix in Imports? That will prevent this bug popping up for others.

joelnitta commented 2 years ago

@KlausVigo no problem: it is a very subtle error, and one that I would never have resolved without the help of @HenrikBengtsson. As he has mentioned this is likely a much wider problem across many packages on CRAN.

If you and @darunabas are able to release a new version of phyloregion in a reasonable amount of time on CRAN that would be great.

shawnlaffan commented 2 years ago

@KlausVigo - if you are updating phyloregion then it would be good to also address https://github.com/darunabas/phyloregion/issues/3

tdhock commented 2 years ago

I am wondering if we can move forward on this submission? @KlausVigo @luismurao have the comments/suggestions in your reviews been addressed?

joelnitta commented 2 years ago

@tdhock Sorry, I am the cause for the delay! Since the reviews came back I've gotten busy with other projects, and haven't been able to get back to this yet. I anticipate being occupied by other projects for another week or two. Please remind me again if you don't hear something soon after that.

emilyriederer commented 2 years ago

Hi @joelnitta and @tdhock ! I'm currently the rotating editor in chief for rOpenSci submissions. I wanted to check in and see if you expect to have time to revisit the review feedback soon? Otherwise, we may want to put this review on hold status

joelnitta commented 2 years ago

Thanks for checking in @emilyriederer. Sorry for falling behind on this. I am still occupied with other things at the moment, but I should be able to respond to the reviews by the end of August if not sooner.

joelnitta commented 2 years ago

Response to reviews

I am sorry for taking so long to respond to the reviewers. Thanks again to both reviewers for their helpful comments. Below I respond to each comment as appropriate.

Response to @luismurao

[x] Concern about parallelization. I think it would be better to have parameters for parallelization in functions that allow running analysis in parallel (so calls are done inside them).
Response: The second reviewer, @KlausVigo disagreed with this point and was in favor of leaving the current approach to parallelization, so I am making no changes.
[x] Instead of suggesting the future package I would import it
Response: I disagree with this. Parallelization is not required, it is an option to speed up analyses if desired. So I think it makes sense as Suggests instead of Imports. No changes made.

README

[x] No code of conduct and contribution guidelines, please add it
Response: There was apparently a misunderstanding; contribution guidelines are already provided at /.github/CONTRIBUTING.md. To make it easier to find though, I have added a link to this document from the README. I also added a link to the ROpenSci code of conduct using the suggested wording in the developer guide (d8aaf8de5193166d52ab890d8267090ea2f8ef6a).

Installation

[x] Please change parameter Depends: R (>= 3.5.0) to Depends: R (>= 4.1.0) in the description file or provide a way to install an earlier version of phangorn (version 2.5.5 should work).
Response: I changed to Depends: R (>= 4.1.0) (0f334965fb5a3fe5034eb1bc682b5051548eddb1).

Automated test

[x] devtools::check() takes more than six minutes. Building of vignette outputs takes a lot of time (5m 5.2s). Consider using lees number of replicates in examples (i.e. n_reps of canaper::cpr_rand_test).
Response: I had to include a non-trivial number replicates to produce reasonable results (e.g., in comparison with Mishler et al. 2014 for the CANAPE example vignette) and to demonstrate performance gains from parallelization. But I was able to decrease the number of replicates in the vignettes (f3af4671d679020aeaa97b1a038b3b5a96633f42). The vignettes now take 3m 40s on my computer to compile.
[x] I got one error after upgrading from GEOS 3.9 to 3.10 in Linux. The issue is related to phyloregion dependency.
Response: Yes: phylogregion imports rgeos, which is going to be retired soon and needs fixing. Correct installation of external dependencies to phyloregion is outside the scope of canaper. No change made (for now).
[x] devtools::test() finished with no fails and one warning
Response: On my computer, devtools::test() now finishes with no fails or warnings (at 93543d399fd34dd6546e56368986a8afa133f190).
[x] devtools::spell_check() Change "signficance" to "significance" in line 132
Response: Fixed (b04e2e0a8397568bcc615e4aaeb5a0d9b02ad85e)

Performance

[x] When benchmarking, I found a strange behavior. After using more than four threads, the program takes more time to finish a randomization test.
Response: Speed is not expected to increase linearly with the number of cores. It is well documented in other software that increasing cores above a certain point causes code to run more slowly due to the resources required to coordinate threads. It is up to the user to identify the optimal number of threads to use. I provide a vignette article discussing precisely this topic: https://joelnitta.github.io/canaper/articles/parallel.html.

Questions and comments

[x] I was wondering if canaper has already all methods of the CANAPE version of Biodiverse. Could write something about it in the README?
Response: I have added a section in the README comparing canaper other software (d8aaf8de5193166d52ab890d8267090ea2f8ef6a).
[x] Although the vignettes are very useful and well documented, it would be nice to have methods for plotting spatial data results.
Response: Thanks for the suggestion. I have already included color palettes for plotting results that should be accessible regardless of color vision type. IMHO there are too many customizations that users would want to make to try and implement a one-size-fits-all plotting method. Rather, the results are structured in a way that they should be easily plotted with e.g. ggplot2, and I demonstrate this in the vignettes.

Response to @KlausVigo

[x] There are some functions in the picante (ses.pd) and in the PhyloMeasures package for Faith's phylogenetic diversity with a similar scope.
Response: There are many R packages that calculate community diversity metrics, and many of them overlap in common metrics like Faith's PD. In the case of canaper, I need to calculate Faith's PD as part of the CANAPE calculation. I use the phyloregion::PD() to do this as it is the fastest AFAIK. No changes needed.
[x] The canaper package builds on functions from the vegan package to resample (dense) community matrices and borrows biodiversity measures (Faith's phylogenetic diversity and phylogenetic endemism) from the phyloregion package. The picante package also has some functions for swapping (community) matrices similar to vegan. A comparison between the packages would be useful.
Response: I have added a section to the README comparing canaper with other software (d8aaf8de5193166d52ab890d8267090ea2f8ef6a).

Specific comments

[x] The functions from the phyloregion package used in the package should be acknowledged properly... An alternative is importing the functions from the package directly.
Response: Please see response above. Functions from the phyloregion package are now imported, not copied (7e8b24950912bfa2a5e10cf4dd9cfa8910d74e76). Please note that phyloregion is still at v1.0.7 as of writing (same version as before @KlausVigo implemented the change that fixed the loading of snow), so I can't specify the version of phyloregion that would avoid canaper issue #2.
[x] In contrast to the other reviewer I really like the use of the future package.
Response: I agree, and have left parallelization as it is.

Profiling

[x] Providing a swapping algorithm which can handle sparse matrices would be a huge improvement for the community.
Response: I agree implementing sparse matrix encoding for generating random matrices would be a huge benefit and will consider it as a feature to be added possibly in the future (any help would be appreciated!).
[x] The functions calc_biodiv_random and cpr_rand_comm are only called internally and are called thousands of times inside of the loop. It should be save to remove all (most) of the calls to assertthat::assert_that from these functions as parameters have been checked already and are save to use!
Response: Actually, cpr_rand_comm() is exported, but point well-taken. I have deleted the redundant checks (bccd0278dd3de5853680d9979ca26c1aa38a7fb3).

For submitting to CRAN

[x] There are some instances where more than 2 cores are used. This could fail on CRAN, when submitting.
Response: Thanks for pointing this out. Fixed (12bbe36bf5516a2a63c1235c76234708c4c5432d, f3af4671d679020aeaa97b1a038b3b5a96633f42)

General comments

Both reviewers left Performance: Any performance claims of the software been confirmed unchecked. I am not sure why this is the case, as I include tests of increase in performance using parallel computing and demonstrate this feature in a vignette.
I have added @KlausVigo as a reviewer, but not @luismurao, according to their responses to Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer (6e9b3d55202e2294a6436280ca6b01db13d92974).

joelnitta commented 2 years ago

Update to response to reviewers

The longest running test is "Parallelization decreases calculation time". To make this test robust, I have to include a relatively high number of replicates, otherwise by random chance sometimes running in parallel won't be faster than running sequentially.

I have decided to follow 6.1.5.4 Extended tests of rOpenSci Statistical Software Peer Review and treat this as an extended test that can be optionally skipped (it is now skipped by default).

I have documented this in a comment as follows:

  # This test is skipped by default because of its long run time.
  # To include the test locally, first run
  # withr::local_envvar(CANAPER_EXTENDED_TESTS = "true") # nolint
  # To include the test in CI (github actions), include the phrase
  # 'run-extended' in the commit

devtools::test() on my computer now finishes in 36 s skipping the extended test, and 181 s when including it.

@tdhock, @emilyriederer sorry again for the long delay, but this should be ready to move forward. Please let me know if there is anything you need from my end.

emilyriederer commented 2 years ago

Also tagging @KlausVigo @luismurao . I know this review was a bit delayed and your availability may have changed. At your convenience, if you could please review @joelnitta 's responses and share whether these address your feedback (template available here) , that would be fantastic.

joelnitta commented 2 years ago

@emilyriederer If the reviewers have no further comments, can we proceed with this at your discretion as an editor?

emilyriederer commented 2 years ago

Hi @joelnitta - let me follow up with @tdhock since he's the managing editor on this one

tdhock commented 2 years ago

hi! Since the reviewers have been tagged two weeks ago and gave no feedback, I think it is OK to move ahead with this submission (the reviewer response is very complete/convincing).

tdhock commented 2 years ago

@ropensci-review-bot approve canaper

ropensci-review-bot commented 2 years ago

Approved! Thanks @joelnitta for submitting and @KlausVigo, @luismurao for your reviews! :grin:

To-dos:

[ ] Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You will need to enable two-factor authentication for your GitHub account. This invitation will expire after one week. If it happens write a comment @ropensci-review-bot invite me to ropensci/<package-name> which will re-send an invitation.
[ ] After transfer write a comment @ropensci-review-bot finalize transfer of <package-name> where <package-name> is the repo/package name. This will give you admin access back.
[ ] Fix all links to the GitHub repo to point to the repo under the ropensci organization.
[ ] Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
[ ] If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
- deactivate the automatic deployment you might have set up
- remove styling tweaks from your pkgdown config but keep that config file
- replace the whole current pkgdown website with a redirecting page
- replace your package docs URL with https://docs.ropensci.org/package_name
- In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar, https://github.com/ropensci/foobar
[ ] Fix any links in badges for CI and coverage to point to the new repository URL.
[ ] Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
[ ] We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
[ ] You can add this installation method to your package README install.packages("<package-name>", repos = "https://ropensci.r-universe.dev") thanks to R-universe.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

joelnitta commented 2 years ago

Thanks @tdhock!

joelnitta commented 2 years ago

@ropensci-review-bot finalize transfer of canaper

ropensci / software-review