birdsize: Estimate avian body size distributions

diazrenata commented 1 year ago

Date accepted: 2023-11-30

Submitting Author Name: Renata Diaz Submitting Author Github Handle: !--author1-->@diazrenata@maelle<!--end-editor-- Reviewers: @mstrimas, @qdread

Due date for @mstrimas: 2023-04-04 Due date for @qdread: 2023-04-11

Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: birdsize
Title: Estimate Avian Body Size Distributions
Version: 0.0.0.9000
Authors@R: 
    person("Renata", "Diaz", , "renata.diaz@weecology.org", role = c("aut", "cre"),
           comment = c(ORCID = "0000-0003-0803-4734"))
Description: Generate estimated body size distributions for populations or communities of birds, given either species ID or species' mean body size. Designed to work naturally with the North American Breeding Bird Survey, or with any dataset of bird species, abundance, and/or mean size data.   
License: MIT + file LICENSE
URL: https://github.com/diazrenata/birdsize
BugReports: https://github.com/diazrenata/birdsize/issues
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
Suggests: 
    covr,
    ggplot2,
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Depends: 
    R (>= 2.10)
Imports: 
    rlang,
    dplyr,
    magrittr,
    purrr,
    stats
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] data validation and testing
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [x] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

This package automates a workflow (seen in the wild e.g. here and here) of generating estimates of body size and basal metabolic rate, from the individual to ecosystem level, for birds. In my presubmission inquiry (#561, https://github.com/ropensci/software-review/issues/561), the editors determined this was best described as "field and lab reproducibility tools" because it's automating a workflow used by empirical ecologists to work with field data.

Who is the target audience and what are scientific applications of this package?

The target audience is ecologists/biodiversity scientists interested in studying the structure, function, and dynamics of bird populations and communities - specifically linking between abundances/population size and other dimensions of community function, like total biomass. Studying size-based and abundance-based properties of bird communities is key to understanding biodiversity and global change, but it is challenging for most ecologists because most survey methods do not collect size-related data. This package standardizes a computationally-intensive workaround for this challenge and makes it accessible to ecologists with relatively little computational training.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

I have not encountered another package that accomplishes this.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

N/A

If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

https://github.com/ropensci/software-review/issues/561

The editor who responded to me was @annakrystalli.

As part of that conversation, the question was raised of adding the authors of some of the datasets that this package draws on as "data contributors". I agree that this is an important consideration, and I wanted to go ahead and include a little more information here so we can make sure this is done in the most appropriate way.

This package uses two sources of "external" data:

First, the sd_table dataset included in the package includes (cleaned and selected) data values hand-entered from the CRC Handbook of Avian Body Masses (Dunning 2008; https://doi.org/10.1201/9781420064452). Neither Dunning, nor the authors of the studies cited in the CRC Handbook, were involved in this project. In the current iteration, I've followed the approach I would use for a paper - that is, the package and package documentation cite Dunning liberally, but I have not listed any additional authors as "data contributors" because I generally wouldn't list folks as co-authors without their knowledge and consent. In this context, would you encourage listing Dunning as a contributor, and/or reaching out to open that conversation?
Second, this package is designed to interface with the North American Breeding Bird Survey data (https://www.sciencebase.gov/catalog/item/5d65256ae4b09b198a26c1d7, doi:10.5066/P9HE8XYJ), but I have taken care not to redistribute any actual data from the Breeding Bird Survey in the package itself. The demo_route_raw and demo_route_clean data tables in birdsize are synthetic datasets that mimic data from the Breeding Bird Survey. That is, they have the same column names as BBS data, and valid AOU (species identifying codes) values, but the actual data values are simulated. The bbs-data vignette directs users to instructions for accessing the BBS data, and demonstrates using the functions in birdsize on BBS-like data using the demo routes. Again, the package cites the Breeding Bird Survey liberally, but stops short of redistributing data so as to encourage users to access and cite the creators directly.

For both of these, again, I'm happy to explore whatever approaches to citing/crediting the original data creators seems most appropriate! I'd appreciate any thoughts or guidance in this area.

Explain reasons for any pkgcheck items which your package is unable to pass.

N/A

Technical checks

Confirm each of the following by checking the box.

[x] I have read the rOpenSci packaging guide.
[x] I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage.

Publication options

[ ] Do you intend for this package to go on CRAN? tbd
[ ] Do you intend for this package to go on Bioconductor?
[X] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [x] The package is novel and will be of interest to the broad readership of the journal. - [x] The manuscript describing the package is no longer than 3000 words. - [x] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

ropensci-review-bot commented 1 year ago

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot commented 1 year ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 1 year ago

Checks for birdsize (v0.0.0.9000)

git hash: bbe4921e

:heavy_check_mark: Package name is available
:heavy_check_mark: has a 'codemeta.json' file.
:heavy_check_mark: has a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_check_mark: Package coverage is 88.3%.
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.

Package License: MIT + file LICENSE

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate. |type |package | ncalls| |:----------|:---------|------:| |internal |base | 124| |internal |birdsize | 26| |imports |magrittr | 40| |imports |dplyr | 16| |imports |stats | 12| |imports |purrr | 1| |imports |rlang | NA| |suggests |covr | NA| |suggests |ggplot2 | NA| |suggests |knitr | NA| |suggests |rmarkdown | NA| |suggests |testthat | NA| |linking_to |NA | NA| Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats()', and examining the 'external_calls' table.

base

list (30), as.character (28), as.numeric (17), mean (7), c (5), nrow (5), colnames (4), sum (4), which (4), for (2), length (2), log (2), tolower (2), all (1), any (1), data.frame (1), exp (1), F (1), matrix (1), names (1), ncol (1), return (1), sqrt (1), suppressMessages (1), unique (1)

magrittr

%>% (40)

birdsize

add_estimated_sds (2), clean_sp_size_data (2), get_sd_parameters (2), get_sp_mean_size (2), ind_draw (2), individual_metabolic_rate (2), is_unidentified (2), community_generate (1), community_summarize (1), filter_bbs_survey (1), find_nontarget_species (1), find_unidentified_species (1), generate_sd_table (1), identify_richness_designator (1), pop_generate (1), pop_summarize (1), species_define (1), species_estimate_sd (1), species_lookup (1)

dplyr

mutate (5), filter (4), n (3), left_join (1), row_number (1), select (1), summarize (1)

stats

sd (7), lm (2), formula (1), rnorm (1), var (1)

purrr

pmap_dfr (1)

**NOTE:** Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 15 files) and - 1 authors - 6 vignettes - 7 internal data files - 5 imported packages - 19 exported functions (median 13 lines of code) - 19 non-exported functions in R (median 22 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 15| 73.0| | |files_vignettes | 6| 97.9| | |files_tests | 11| 91.7| | |loc_R | 384| 39.4| | |loc_vignettes | 398| 72.3| | |loc_tests | 330| 66.1| | |num_vignettes | 6| 98.7|TRUE | |data_size_total | 50278| 79.8| | |data_size_median | 5209| 74.7| | |n_fns_r | 38| 47.6| | |n_fns_r_exported | 19| 65.9| | |n_fns_r_not_exported | 19| 39.2| | |n_fns_per_file_r | 3| 47.0| | |num_params_per_fn | 1| 1.6|TRUE | |loc_per_fn_r | 16| 51.4| | |loc_per_fn_r_exp | 13| 30.5| | |loc_per_fn_r_not_exp | 22| 66.9| | |rel_whitespace_R | 36| 61.4| | |rel_whitespace_vignettes | 69| 90.5| | |rel_whitespace_tests | 59| 84.6| | |doclines_per_fn_exp | 20| 14.0| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 14| 39.0| | ---

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

#### 3a. Continuous Integration Badges (There do not appear to be any) **GitHub Workflow Results** | id|name |conclusion |sha | run_number|date | |----------:|:--------------------------|:----------|:------|----------:|:----------| | 4263965118|check-coverage |success |999038 | 49|2023-02-24 | | 4263965032|pages build and deployment |success |999038 | 61|2023-02-24 | | 4263965125|pkgcheck |success |999038 | 56|2023-02-24 | | 4263965120|pkgdown |success |999038 | 59|2023-02-24 | | 4263965126|R-CMD-check |NA |999038 | 90|2023-02-24 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) rcmdcheck found no errors, warnings, or notes #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 88.26 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) No functions have cyclocomplexity >= 15 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 241 potential issues: message | number of times --- | --- Avoid 1:nrow(...) expressions, use seq_len. | 2 Avoid library() and require() calls in packages | 13 Lines should not be more than 80 characters. | 224 Use <-, not =, for assignment. | 2

Package Versions

|package |version | |:--------|:--------| |pkgstats |0.1.3 | |pkgcheck |0.1.1.11 |

Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

maurolepore commented 1 year ago

Thanks @diazrenata for your full submission!

Before I start the search for a handling editor can you please address these two minor yet important issues?

--

[ ] ml01. Please add "dtc" to DESCRIPTION

In the pre-submission I see:

One thing to note is that given the package contains external data, it would be appropriate to state the source in the DESCRIPTION file under authors using author type "dtc" which stands for "data contributor". -- @annakrystalli

But DESCRIPTION still shows role = c("aut", "cre") rather than role = c("aut", "cre", "dtc").

Can you please make that change or argue against it?

--

[x] ml02. Please make README standalone

On the package website I see a lot of documentation and examples, well done! However the landing page is README and I see very little there. an you please add the most important bits to README to help editors, reviewers, and users understand the package quickly?

The author's guide is a bit vague about whether you should duplicate in README documentation you already have elsewhere, but previous reviews and packages in the wild suggest a self-contained README is as important to a package as an abstract is to an academic paper.

A good guide for what to include in a helpful README is this one: https://devguide.ropensci.org/building.html#readme

diazrenata commented 1 year ago

Hi Mauro, thank you for the quick response! In response to your points....

[ ] ml01. Please add "dtc" to DESCRIPTION

I agree that this is an important conversation, and I want to be sure to handle this appropriately. I included a little more context on this in my initial submission, which I am quoting again below to make it easier to find:

As part of that conversation, the question was raised of adding the authors of some of the datasets that this package draws on as "data contributors". I agree that this is an important consideration, and I wanted to go ahead and include a little more information here so we can make sure this is done in the most appropriate way.

This package uses two sources of "external" data: First, the sd_table dataset included in the package includes (cleaned and selected) data values hand-entered from the CRC Handbook of Avian Body Masses (Dunning 2008; https://doi.org/10.1201/9781420064452). Neither Dunning, nor the authors of the studies cited in the CRC Handbook, were involved in this project. In the current iteration, I've followed the approach I would use for a paper - that is, the package and package documentation cite Dunning liberally, but I have not listed any additional authors as "data contributors" because I generally wouldn't list folks as co-authors without their knowledge and consent. In this context, would you encourage listing Dunning as a contributor, and/or reaching out to open that conversation?

Second, this package is designed to interface with the North American Breeding Bird Survey data (https://www.sciencebase.gov/catalog/item/5d65256ae4b09b198a26c1d7, doi:10.5066/P9HE8XYJ), but I have taken care not to redistribute any actual data from the Breeding Bird Survey in the package itself. The demo_route_raw and demo_route_clean data tables in birdsize are synthetic datasets that mimic data from the Breeding Bird Survey. That is, they have the same column names as BBS data, and valid AOU (species identifying codes) values, but the actual data values are simulated. The bbs-data vignette directs users to instructions for accessing the BBS data, and demonstrates using the functions in birdsize on BBS-like data using the demo routes. Again, the package cites the Breeding Bird Survey liberally, but stops short of redistributing data so as to encourage users to access and cite the creators directly.

For both of these, again, I'm happy to explore whatever approaches to citing/crediting the original data creators seems most appropriate! I'd appreciate any thoughts or guidance in this area.

[x] ml02. Please make README standalone

I have updated the README to be more comprehensive. Some of the text is borrowed from the Getting started vignette, and I am currently directing folks to the [community](https://diazrenata.github.io/birdsize/articles/community.html] vignette for worked examples. (If it would be preferable, and not too redundant, I can copy the contents of that vignette directly to the README).

I hope this helps provide context to get things started! Thank you again.

maurolepore commented 1 year ago

Thanks a lot @diazrenata for making your arguments about the data visible here. I'm more than happy with your careful consideration, and prefer for the handling editor or reviewers to follow up.

Also thanks for working on README. Anything that makes saves a bit of time to our reviewers will free them cognitive load to focus on the more interesting aspects of your work.

I'll start looking for a handling editor.

maurolepore commented 1 year ago

@ropensci-review-bot assign @maelle as editor

ropensci-review-bot commented 1 year ago

Assigned! @maelle is now the editor

maelle commented 1 year ago

Editor checks:

[x] Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
- [x] Is the case for the package well made?
- [x] Is the reference index page clear (grouped by topic if necessary)?
- [x] Are vignettes readable, sufficiently detailed and not just perfunctory?
[x] Fit: The package meets criteria for fit and overlap.
[x] Installation instructions: Are installation instructions clear enough for human users?
[x] Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
[x] Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
[x] License: The package has a CRAN or OSI accepted license.
[x] Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments

Thanks for your submission @diazrenata! I'll start looking for reviewers. A few comments in the meantime:

I'd recommend running desc::desc_normalize(): it will order DESCRIPTION fields in a standard way and order dependencies alphabetically.
You can remove the {remotes} installation instructions and keep the {devtools} ones only as {devtools} calls {remotes} (might call {pak} in the future?) and is the interface for users.
It'd be nice to add grouping to the reference index https://pkgdown.r-lib.org/reference/build_reference.html#reference-index Given your package's naming scheme, you might want to use the starts_with() helper.
In the test filenames, why use numbers? If the R files and test files have the same basename, it's easier to navigate between the two in RStudio IDE. https://r-pkgs.org/testing-basics.html#test-organisation
In your test files and scripts I see a lot of vertical space, more than one empty lines between some elements. I'd recommend using one empty line only as it increases the amount of code one can see at a time on the screen. It's still nice to organize the code in paragraphs, I am not suggesting to remove all empty lines. :grin:
In the README instead of "Package summary" as a header you could use the same text as this issue "birdsize: Estimate avian body size distributions" which is more informative.
Regarding data yes it might make sense to contact the author of the included dataset?
For comments such as https://github.com/diazrenata/birdsize/blob/7469df457989a9016ecc3761b0dd125497a3be51/R/simulate_community.R#L51 if you add 4 hyphens afterwards ----, the comment will appear in the script outline on the right in RStudio IDE (if you use that IDE), helping code navigation. https://blog.r-hub.io/2023/01/26/code-comments-self-explaining-code/#use-comments-for-the-scripts-outline
why have this "trivial" test file? https://github.com/diazrenata/birdsize/blob/main/tests/testthat/test-01_trivial.R
in test code you don't need to write birdsize::: as testthat loads your package code. Example https://github.com/diazrenata/birdsize/blob/7469df457989a9016ecc3761b0dd125497a3be51/tests/testthat/test-02_included_data.R#L13
why are some expectations commented out? https://github.com/diazrenata/birdsize/blob/7469df457989a9016ecc3761b0dd125497a3be51/tests/testthat/test-07_simulate_community.R#L34

maelle commented 1 year ago

@ropensci-review-bot seeking reviewers

ropensci-review-bot commented 1 year ago

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/577_status.svg)](https://github.com/ropensci/software-review/issues/577)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

maelle commented 1 year ago

@ropensci-review-bot add @mstrimas to reviewers

ropensci-review-bot commented 1 year ago

@mstrimas added to the reviewers list. Review due date is 2023-03-23. Thanks @mstrimas for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 1 year ago

@mstrimas: If you haven't done so, please fill this form for us to update our reviewers records.

maelle commented 1 year ago

@ropensci-review-bot set due date for @mstrimas to 2023-04-04

ropensci-review-bot commented 1 year ago

Review due date for @mstrimas is now 04-April-2023

mstrimas commented 1 year ago

Hi @diazrenata, I'm looking forward to reviewing the package! I won't be able to get to looking at in detail until later in March, but I took a quick skim through the code and sent a small PR with a few quick fixes. Also, I thought I'd mention a few general things now ahead of my full review in April since they may take some time to address:

I notice some very long lines in the code, especially in roxygen documentation (for example here), but also in the functions (for example here). I think these would be worth shortening to the standard 80 characters to aid readability. For comments/roxygen documentation I usually use Code > Reflow Comment in RStudio to help with this.
There are a lot of blank lines in the source code, sometimes multiple in a row, that make the code a little hard to read. For example here it's hard to know what pieces of code belong together thematically. Typically chunks of code that belong together are best with no empty lines between them and you can then separate these chunks with a single blank line, often with a comment describring what each chunk does. Looks like Maëlle also mentioned this above.
I see you use the magrittr %>% a lot within package functions. I don't think there's any strict rules against that, but I typically avoid it since it can make debugging trickier by producing a confusing call stack trace if there's an error. See this comment on the rOpenSci forum. The response notes that the native pipe |> avoids the issue, but I personally think just dropping pipes altogether can be cleaner in package functions. This is personal preference though, so if you really prefer keeping the pipes that's all good :) Of course, using pipes in examples, vignettes, readme, etc. is perfectly fine and I think aids readability.

maelle commented 1 year ago

Thanks @mstrimas!

Adding a reference about code comments https://blog.r-hub.io/2023/01/26/code-comments-self-explaining-code/

diazrenata commented 1 year ago

Thank you both for your time and attention! I'll incorporate these changes as quickly as I can, probably early next week!

maelle commented 1 year ago

@ropensci-review-bot add @qdread to reviewers

ropensci-review-bot commented 1 year ago

@qdread added to the reviewers list. Review due date is 2023-04-11. Thanks @qdread for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 1 year ago

@qdread: If you haven't done so, please fill this form for us to update our reviewers records.

mstrimas commented 1 year ago

I just finished up my review. This is a concise, well-documented package and was a pleasure to review. Note that the suggestions I made in an earlier comment on this issue, and in the PR I submitted, still apply and I haven't duplicated them below.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
- The README points readers to a vignette for example usage. I think it would be useful to include at least one simple example directly in the vignette to illustrate what the package does.
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
- Code chunks at the start of the vignettes where packages are loaded are currently hidden. These need to be shown on the package website so users know what packages they'll need to load when working through the vignette.
- I've noticed you refer to functions in the vignettes using, e.g., community_generate rather than community_generate(). If you add the (), pkgdown will recognize it as a function and generate a link to the documentation for that function on the website.
- As with the package source code, there's a lot of unnecessary white space in the code chunks of the vignettes. For example, many of the code chunks have an empty line at the beginning and end. These should be removed.
- You have a lot of set.seed(22) calls throughout the vignettes. I think it would be better to set the seed once at the top of each vignette rather than repeatedly setting it.
- In some of the vignette figures, results are shown colored by species. In these figures the legend show's the AOU code. If possible showing species name would be nice.
- At the top of the community vignette, you confusingly have demo_route_raw <- demo_route_raw. I think what you may want here is data(demo_route_raw). However, I think even that has fallen out of favor and it's best to just use the object directly (See "Warning" box at https://r-pkgs.org/data.html#sec-data-data).
- I like the try(species_define(genus = "taylor", species = "swift")) example :)
- In the scaling vignette, it feels odd that you call the internal function birdsize:::species_estimate_sd. If you're going to use this function in a vignette, it should be exported.
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software have been confirmed.
[x] Performance: Any performance claims of the software have been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 7

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

This is a useful, well-documented package with a clear and well defined scope. The documentation and vignettes have plenty of examples that I think most users should easily be able to follow. In addition to suggestions I've made in an earlier comment, I have a few suggestions that I think would improve the package documentation and align the source code better with best practices.

Working through the five vignettes, I did wonder if the vignette ordering and organization should be re-worked. For me, I found the population vignette to be the most useful for understanding what the package is actually doing under the hood. After working through this vignette, the community vignette made more sense to me. The species vignette covers an intermediate-level topic, so probably belongs next, followed by scaling since it's a more advanced topic that many users won't require knowledge of. Both the README and the "Getting Started" vignette contain essentially no code. This feels like missed opportunity to shown some of the most basic functionality, e.g. simulate a population of a single species and plot a histogram and/or simulate a community. Finally, there is fair amount of duplication in the vignettes, which makes me wonder if some of them could be combined together into one? All this may just be personal preference though, and other users may find the current organization better, but something to think about.

This package relies heavily on BBS data in all examples and package functionality. Given its prominence, I think there could more description of what the BBS is and the structure of the dataset. I initially assumed the bbs-data vignette would cover this, but it mostly duplicates what's already found in the community vignette without providing additional explanation of the BBS. I think at the very least a brief description of the route/stop structure, sampling design, and spatial/temporal coverage of the BBS is warranted. Also, I see the fields of demo_route_raw are described in the help for that dataset, but I think you should point users to that help file or directly include a description of the fields in the bbs-data vignette. I don't think you need to get into extensive detail since all of this is explained in other places, which you've referenced, but you should provide at least some explanation.

Defining species in function arguments could be clarified. In the arguments to pop_generate() and species_define(), species can be identified either by AOU code or scientific name. Given that both genus and species are required, it feels more intuitive to me to use a single argument (e.g. scientific_name). The documentation also doesn't make it clear that species is the species' epithet and not the scientific or common name. As it stands, it feels like you should be able to call pop_generate(100, species = "Selasphorus calliope") or even pop_generate(100, species = "Calliope Hummingbird").

I wonder if the _summarize() functions are necessary since they're essentially just calling group_by() %>% summarize(). Personally I'd prefer to do this directly myself with dplyr so I know exactly what's going on and the vignettes could demonstrate exactly how users should do it. However, I can imagine that some users aren't as comfortable with dplyr so these convenience functions could be useful.

The internal function ind_draw() seems dangerous to me since it has a while loop to get rid of negative sizes with potential to run for a very long time. This is especially true because there appears to be no checks to ensure the mean size is positive. For example the following will run indefinitely:

pop_generate(1000, mean_size = -1000, sd_size = 0.001)

At the very least, please add a check to ensure mean_size and sd_size are positive and some method to ensure the while loop won't run forever, e.g. after a certain number of iterations maybe it should stop and raise an error. Even better would be re-writing this function to use a less brute force method to ensure the sizes generated aren't negative. It's not immediately clear what that would look like, since simple solutions like take the absolute value, won't preserve the desired normal distribution.

Some additional, specific comments about the code:

I believe the standard for documenting package datasets is to document them all in R/data.R rather than individual R files. Not a huge issue, but I think this would help anyone interested find the source for the documentation more easily. See https://r-pkgs.org/data.html#sec-documenting-data.
Related to the previous point, you might consider renaming or re-grouping some of the other source files. For example, the only exported function in R/simulate_populations.R is pop_generate(), so why not call that file R/pop_generate.R so it's more obvious what's in the file?
dplyr has deprecated or superseded a lot of the "scoped" verbs, e.g. group_by_at(). Please replace these as appropriate to future proof your package. See https://dplyr.tidyverse.org/reference/scoped.html
I see some inconsistency in code styling. For example, in some places you use if() and in other places if (). I recommend picking one method of formatting your code and sticking with it. See https://style.tidyverse.org/.
In a few places I see you're using as.numeric(NA) or is.character(NA) (e.g. https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L56). Note that there are NAs specifically for different data types, i.e. NA_real_ and NA_character_. It seems cleaner to use these rather than casting NA to different data types.
There are a handful of unnecessarily nested if statements (e.g. https://github.com/diazrenata/birdsize/blob/main/R/species_define.R#L54). Anywhere you have if (A) {if (B) {...}} I think it's cleaner to use if (A && B) {...}, but that may just be personal preference.

maelle commented 1 year ago

Thanks a lot @mstrimas for your thoughtful review!! :pray:

As a side note on source filenames, it's also nice to align them with test filenames so if you rename one, make sure you rename the other. See https://r-pkgs.org/testing-basics.html#create-a-test and devtools::test_active_file().

maelle commented 1 year ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/577#issuecomment-1478615356 time 7

ropensci-review-bot commented 1 year ago

Logged review for mstrimas (hours: 7)

qdread commented 1 year ago

Thanks for giving me the opportunity to review this cool package! I like bird body size a lot so it was enjoyable to review this. I think the package overall is good in terms of having a well-defined objective, meeting that objective, and documenting how it's done. I do have some suggestions for improvement. If anything needs clarification, don't hesitate to get in touch with me! I also want to say that I agree with essentially everything in Matt's review, so I've tried not to be too redundant with what he already said.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
On my Windows machine, I got a lot of warnings and one failure on the unit tests. The warnings seem to mostly be related to the use of .$data in the tidyselect statements. I think they can all be dealt with by deleting .data$ from in front of the column names wherever it appears. The error that I got was related to unzip() in test-10_direct_bbs_data.R. The unzipping did not work so none of the downstream code worked. I think it would be ideal to have the unit tests run without warnings or errors.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 5

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

I think the purpose of the package is well-defined. It does a few well-defined and related tasks, and does them well. It also has demonstrated use cases, which is nice. The vignettes were well-organized and do a good job of providing examples of all the functions. I've divided my comments into three categories: code style, documentation, and "science stuff."

Code style review

The DESCRIPTION file states that the package depends on R >=2.10 but the tidyverse dependencies require R >= 3.3.
I agree with Matt that the documentation of the package datasets should be in one data.R file which is a little easier to navigate (though this is not too relevant to users as they won't ever see that, but it would be helpful for contributors)
This is a tough one, but I generally try to shy away from having tidyverse dependencies in packages. For example the lines with dplyr::mutate() could be rewritten from

community <- community %>%
        dplyr::mutate(
          richnessSpecies = .data$aou,
          species_designator = "aou"
        )

to

community[['richnessSpecies']] <- community[['aou']]
community[['species_designator']] <- "aou"

I understand this would be a lot of work to implement because tidyverse is used in most of your functions. If you do not want to fully remove the tidyverse dependencies, the most urgent one to address would be to replace dplyr::group_by_at() with dplyr::group_by(dplyr::across()). group_by_at() has been deprecated and will likely be removed from future versions of dplyr. (I'm in agreement with Matt's advice on that point).

Also, having tidyverse dependencies greatly increases the number of downstream dependencies for the package which might not be desirable.
I also agree with Matt that it would be good to at least rewrite the code not to use the magrittr pipes %>% for the reasons he cited.
In filter_bbs_survey(), package data are loaded with unidentified_species <- unidentified_species. I am not sure that is the recommended way to internally use package data. I noticed Matt also brought this up so I would follow his advice there.
In simulate_populations(), the error checking routines that cause failure if things like mean and standard deviation aren't provided is one level down in the ind_draw() function. To me it would make more sense if the input is checked for errors right away instead of further down.
Line 33 of species_data_functions.R: the argument data to lm() should be named.

Documentation review

I thought the vignettes were well-organized and do a good job of providing examples of all the functions. One suggestion is that the vignettes could have more informative titles. Just the word community does not make it clear exactly what can be learned from going through the vignette.
I also liked the Taylor Swift example but shouldn't the species be Taylor and the genus Swift? :-D
A user copying and pasting code from the vignettes will not be able to run the code containing tidyverse functions including select() and pmap_df() because those packages are not explicitly loaded in the vignette.
The term "AOU" should be defined somewhere - it is known to North Americans but not otherwise. I understand this package is most intended for use with BBS so users will be familiar but it can't hurt to define so that other people can understand what it is (maybe not necessarily the user but someone looking over the code). A related minor point: in some of the functions where AOU is an argument, it is in lowercase aou but I think it would be more consistent to make it capitalized AOU.
The bar plot with total biomass by AOU in the community vignette, as Matt mentioned, would be more informative with species names instead of AOU codes. Also, because the fill color legend is redundant with the axis labels, it would be cleaner to just label the x axis with the species names and suppress the legend of fill colors.

Comments on science stuff

I would feel better about it if you explicitly incorporated uncertainty in the scaling relationships. The parameters for the relationship between body mass mean and standard deviation are average expectations, so you would end up underestimating the variation in body mass if you only use that mean expectation to impute missing standard deviations. Is there any way of incorporating uncertainty in those scaling parameters?
It would be nice to allow filter_bbs_survey() to take arguments so that the user could customize removing specific species groups. For instance it would be interesting if the user could remove only waterbirds and keep nocturnal birds, if they were so inclined.
The real heart of the body mass distribution simulation is in the file simulate_populations.R in ind_draw() at line 34:

population <- rnorm(n = species_abundance, mean = species_mean, sd = species_sd)

  while (any(population < 0)) {
    population[which(population < 0)] <- rnorm(n = sum(population < 0), mean = species_mean, sd = species_sd)
  }

This looks like you are doing rejection sampling to generate samples from a truncated normal distribution with lower truncation bound of 0. I think that is fine, but it should be made clear in the documentation that you are doing this. I might even recommend allowing the user to input a lower truncation bound instead of hard-coding it at 0 (the default could be 0 but you could allow the user to modify this). For example the user might want to ensure that all masses are greater than 2 grams (that is roughly the lowest value I got when I generated the body masses for a_hundred_hummingbirds). Not too many birds weigh less than the Calliope Hummingbird! :-) Actually, in general I don't think it is clear in your documentation that samples are drawn from a normal distribution. Yes, it is implied by the fact that the parameters are mean and standard deviation but I think it would be good to be explicit about it. I also wanted to address Matt's point that the while loop in the rejection sampling may run infinitely or nearly so if invalid input is provided such as negative body mass. It would be good for you to expand the error checking code to cause failure on body mass means that are not positive, avoiding the potential of an infinite (or almost infinite loop).

To further address the point about the rejection sampling method, Matt mentioned in his review that it would be good to use a "less brute force" method to ensure the values are >0. Actually, the method of rejection sampling that you are currently using is a well-accepted method. It is used by the R package truncnorm. For almost any realistic values of body mass mean and sd, you will get very few negative values on the initial sample, so the while loop would hardly ever even be necessary. However, if you are interested in a more direct method of sampling from a truncated normal without the while loop, I would check out this Stackoverflow thread especially this answer. There are also the existing functions msm::tnorm() and truncnorm::truncnorm() but you may not want to add a dependency to your package.

mstrimas commented 1 year ago

@qdread Thanks for expanding on my comments about the while loop! I hadn't heard of "rejection sampling" before. truncnorm seems quite lightweight, so maybe that's a good option. Or keep the while loop but put in some logic so it stops after some number of iterations. Also, checking the inputs so the mean isn't negative would help. I like the suggestion of having the minimum size being user defined. Anyway, I would follow whatever @qdread suggests on this since I don't really know anything about this topic.

Regarding the comment about .data$ throwing warnings during testing, the PR I submitted should resolve this https://github.com/diazrenata/birdsize/pull/67.

maelle commented 1 year ago

Thanks @qdread for your thoughtful review! :smile: :bird: You mean using tidyverse packages increases the number of upstream dependencies, not downstream, correct?

maelle commented 1 year ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/577#issuecomment-1494333794 time 5

ropensci-review-bot commented 1 year ago

Logged review for qdread (hours: 5)

qdread commented 1 year ago

@maelle Yes, upstream :-)

maelle commented 1 year ago

I thought that as a salmon I was maybe confused about the flow direction. :grin:

ropensci-review-bot commented 1 year ago

:calendar: @qdread you have 2 days left before the due date for your review (2023-04-11).

maelle commented 1 year ago

@qdread please ignore the comment above, sorry (we're investigating the bug :sweat_smile: ).

ropensci-review-bot commented 1 year ago

@diazrenata: please post your response with @ropensci-review-bot submit response <url to issue comment> if you haven't done so already (this is an automatic reminder).

Here's the author guide for response. https://devguide.ropensci.org/authors-guide.html

maelle commented 1 year ago

@diazrenata any update? :smile_cat:

diazrenata commented 1 year ago

@maelle (and everyone!) Thank you very much for the feedback! I'm in the progress of incorporating changes. I am dealing with some health issues at the moment and have slightly longer turnaround times than usual, for which I apologize!

maelle commented 1 year ago

@diazrenata no problem, thanks for the update, take care!

maelle commented 1 year ago

Note that if needed we can put the submission on hold https://devguide.ropensci.org/policies.html?q=hold#policiesreviewprocess

The author can choose to have their submission put on hold (editor applies the holding label). The holding status will be revisited every 3 months, and after one year the issue will be closed.

diazrenata commented 1 year ago

@maelle, thank you very much for pointing this out! I think putting this on hold for the next 3 months would be the ideal move here. I expect to be able to complete the revisions within those 3 months, so things should proceed smoothly from there!

maelle commented 1 year ago

@ropensci-review-bot put on hold

ropensci-review-bot commented 1 year ago

Submission on hold!

maelle commented 1 year ago

@diazrenata Done! Thank you and take care.

ropensci-review-bot commented 1 year ago

@maelle: Please review the holding status

maelle commented 1 year ago

@diazrenata just checking in :smile_cat:

maelle commented 1 year ago

@diazrenata any update? I hope you're ok! :smile_cat:

maelle commented 1 year ago

@diazrenata just checking in, I hope you're ok.

diazrenata commented 1 year ago

Hi @maelle, thank you for checking in! I apologize for the long hiatus. I am back online, and have just about completed revisions! I expect to resubmit them in the next week, and no later than the end of the month, if that is all right?

maelle commented 1 year ago

Ok, thank you, glad to read you're back online! Here's the author guidance (as a reminder): https://devdevguide.netlify.app/softwarereview_author#the-review-process

ropensci / software-review