dwctaxon: Tools for Working with Darwin Core Taxon Data

joelnitta commented 1 year ago

Date accepted: 2023-05-22 Submitting Author Name: Joel H. Nitta Submitting Author Github Handle: !--author1-->@joelnitta@noamross<!--end-editor-- Reviewers: @collinschwantes, @sformel-usgs

Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: dwctaxon
Title: Tools for Working with Darwin Core Taxon Data
Version: 1.0.0.9000
Authors@R: 
    c(
      person(given = "Joel H.",
           family = "Nitta",
           role = c("aut", "cre"),
           email = "joelnitta@gmail.com",
           comment = c(ORCID = "0000-0003-4719-7472")),
      person(given = "Wataru",
           family = "Iwasaki",
           role = c("ctb"),
           comment = c(ORCID = "0000-0002-9169-9245"))
    )
Description: Provides functions to create, manipulate, and validate taxonomic 
    data in compliance with Darwin Core standards 
    (Darwin Core "Taxon" class https://dwc.tdwg.org/terms/#taxon).
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(
    markdown = TRUE,
    roclets = c("collate", "namespace", "rd", "roxyglobals::global_roclet"))
RoxygenNote: 7.2.1
Imports: 
    assertr,
    assertthat,
    digest,
    dplyr,
    glue,
    purrr,
    rlang,
    settings,
    stringr,
    tibble,
    utils
Suggests: 
    testthat (>= 3.0.0),
    roxyglobals (>= 0.2.1),
    mockery,
    readr,
    usethis,
    knitr,
    rmarkdown,
    tidyverse,
    patrick,
    stringi
Remotes: 
    anthonynorth/roxyglobals
Depends: 
    R (>= 2.10)
Config/testthat/edition: 3
URL: https://joelnitta.github.io/dwctaxon/,
    https://github.com/joelnitta/dwctaxon
BugReports: https://github.com/joelnitta/dwctaxon/issues
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [x] data munging
- [ ] data deposition
- [x] data validation and testing
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

dwctaxon facilitates manipulating and validating taxonomic data (of biological species) in R, in compliance with the widely used Darwin Core data standard.

Who is the target audience and what are scientific applications of this package?

Biologists, anybody working with taxonomic data in Darwin Core format.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

Not currently, to the best of my knowledge. The closest thing out there is the GBIF data validator (not an R package). The archived finch package had a function to call the GBIF data validator.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable

If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

No pre-submission inquiry

Explain reasons for any pkgcheck items which your package is unable to pass.

pkgcheck passes when I run it locally. I have observed an error in the CI, but I believe this is a problem with pkgcheck, and unrelated to my package.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the rOpenSci packaging guide.
[x] I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

ropensci-review-bot commented 1 year ago

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot commented 1 year ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 1 year ago

Checks for dwctaxon (v1.0.0.9000)

git hash: db71df7b

:heavy_check_mark: Package name is available
:heavy_check_mark: has a 'codemeta.json' file.
:heavy_check_mark: has a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_check_mark: Package coverage is 96.8%.
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.

Package License: MIT + file LICENSE

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate. |type |package | ncalls| |:----------|:-----------|------:| |internal |base | 83| |internal |dwctaxon | 28| |internal |graphics | 1| |internal |stats | 1| |imports |glue | 48| |imports |settings | 23| |imports |dplyr | 9| |imports |stringr | 6| |imports |utils | 5| |imports |purrr | 4| |imports |tibble | 2| |imports |assertr | 1| |imports |assertthat | NA| |imports |digest | NA| |imports |rlang | NA| |suggests |testthat | NA| |suggests |roxyglobals | NA| |suggests |mockery | NA| |suggests |readr | NA| |suggests |usethis | NA| |suggests |knitr | NA| |suggests |rmarkdown | NA| |suggests |tidyverse | NA| |suggests |patrick | NA| |suggests |stringi | NA| |linking_to |NA | NA| Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats()', and examining the 'external_calls' table.

base

list (12), is.na (9), paste (9), c (6), do.call (5), grepl (5), as.character (4), col (3), colnames (3), inherits (3), nrow (3), class (2), match (2), Sys.time (2), by (1), duplicated (1), for (1), formals (1), gsub (1), if (1), ifelse (1), is.null (1), isTRUE (1), lapply (1), seq_len (1), setdiff (1), strsplit (1), tryCatch (1), warning (1)

glue

glue (46), glue_collapse (1), identity_transformer (1)

dwctaxon

null_transformer (8), assert_that_d (2), any_not_true (1), assert_col (1), assert_dat (1), assert_that_uses_one_name (1), bind_rows_f (1), check_acc_id_has_tax_status (1), check_acc_id_valid_tax_status (1), check_accepted_map_to_nothing (1), check_col_names_p (1), check_fill_usage_id_name (1), check_mapping_exists (1), check_mapping_strict_status (1), check_mapping_to_self (1), check_sci_name_is_uniq (1), check_sci_name_not_na (1), dct_modify_row (1), dct_options (1), paste3 (1)

settings

inlist (22), options_manager (1)

dplyr

mutate (4), filter (3), anti_join (1), bind_rows (1)

stringr

fixed (4), str_detect (1), str_match (1)

utils

data (4), capture.output (1)

purrr

map_lgl (4)

tibble

tibble (2)

assertr

success_logical (1)

graphics

text (1)

stats

df (1)

**NOTE:** Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 17 files) and - 1 authors - 3 vignettes - 2 internal data files - 11 imported packages - 9 exported functions (median 29 lines of code) - 106 non-exported functions in R (median 11 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 17| 76.7| | |files_vignettes | 3| 92.4| | |files_tests | 14| 93.8| | |loc_R | 2343| 86.9| | |loc_vignettes | 324| 66.2| | |loc_tests | 2584| 95.6|TRUE | |num_vignettes | 3| 94.2| | |data_size_total | 28214| 76.8| | |data_size_median | 14107| 81.1| | |n_fns_r | 115| 79.7| | |n_fns_r_exported | 9| 42.0| | |n_fns_r_not_exported | 106| 84.9| | |n_fns_per_file_r | 5| 67.9| | |num_params_per_fn | 5| 69.6| | |loc_per_fn_r | 12| 36.1| | |loc_per_fn_r_exp | 29| 61.6| | |loc_per_fn_r_not_exp | 11| 35.4| | |rel_whitespace_R | 7| 65.5| | |rel_whitespace_vignettes | 43| 75.5| | |rel_whitespace_tests | 4| 74.5| | |doclines_per_fn_exp | 65| 76.5| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 213| 89.2| | ---

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

#### 3a. Continuous Integration Badges [![pkgcheck](https://github.com/joelnitta/dwctaxon/workflows/pkgcheck/badge.svg)](https://github.com/joelnitta/dwctaxon/actions) **GitHub Workflow Results** | id|name |conclusion |sha | run_number|date | |----------:|:--------------------------|:----------|:------|----------:|:----------| | 4191806772|pages build and deployment |success |9e0773 | 35|2023-02-16 | | 4191789240|pkgcheck |NA |db71df | 7|2023-02-16 | | 4191789239|pkgdown |success |db71df | 51|2023-02-16 | | 4191789237|test-coverage |success |db71df | 26|2023-02-16 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following check_fail: 1. cyclocomp #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 96.81 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following functions have cyclocomplexity >= 15: function | cyclocomplexity --- | --- dct_modify_row_single | 103 dct_add_row | 32 dct_opts | 19 check_acc_id_has_tax_status | 18 check_acc_id_valid_tax_status | 18 check_accepted_map_to_nothing | 18 check_syn_map_to_acc | 18 check_variant_map_to_nonvar | 18 check_variant_map_to_something | 18 check_mapping_exists | 17 check_mapping_to_self | 17 dct_validate | 17 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 10 potential issues: message | number of times --- | --- Avoid library() and require() calls in packages | 10

Package Versions

|package |version | |:--------|:--------| |pkgstats |0.1.3 | |pkgcheck |0.1.1.11 |

Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

maurolepore commented 1 year ago

@joelnitta thanks a lot for your submision! The checks are looking great. I'll check fit and overlap and come back to you asap.

maurolepore commented 1 year ago

@joelnitta thanks a lot for your patience.

I discussed with other editors and think this submission is in scope. I'll start looking for a handling editor.

joelnitta commented 1 year ago

Great, thanks @maurolepore!

maurolepore commented 1 year ago

@ropensci-review-bot assign @noamross as editor

ropensci-review-bot commented 1 year ago

Assigned! @noamross is now the editor

noamross commented 1 year ago

@ropensci-review-bot seeking reviewers

ropensci-review-bot commented 1 year ago

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/574_status.svg)](https://github.com/ropensci/software-review/issues/574)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

noamross commented 1 year ago

@ropensci-review-bot assign @collinschwantes as reviewer

ropensci-review-bot commented 1 year ago

@collinschwantes added to the reviewers list. Review due date is 2023-03-22. Thanks @collinschwantes for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 1 year ago

@collinschwantes: If you haven't done so, please fill this form for us to update our reviewers records.

noamross commented 1 year ago

Thanks @collinschwantes! Please be sure to look at the diagnostic report above. I note a few instances of packages imported for very few functions (assertr), high-cyclocomplexity functions that may benefit from breaking up, and some other goodpractice notes that should be examined as part of your review.

ropensci-review-bot commented 1 year ago

:calendar: @collinschwantes you have 2 days left before the due date for your review (2023-03-22).

noamross commented 1 year ago

@ropensci-review-bot assign @sformel as reviewer

ropensci-review-bot commented 1 year ago

@sformel added to the reviewers list. Review due date is 2023-04-18. Thanks @sformel for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 1 year ago

@sformel: If you haven't done so, please fill this form for us to update our reviewers records.

noamross commented 1 year ago

@ropensci-review-bot remove @sformel from reviewers

ropensci-review-bot commented 1 year ago

@sformel removed from the reviewers list!

noamross commented 1 year ago

@ropensci-review-bot assign @sformel-usgs as reviewer

ropensci-review-bot commented 1 year ago

@sformel-usgs added to the reviewers list. Review due date is 2023-04-18. Thanks @sformel-usgs for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 1 year ago

@sformel-usgs: If you haven't done so, please fill this form for us to update our reviewers records.

ropensci-review-bot commented 1 year ago

:calendar: @sformel-usgs you have 2 days left before the due date for your review (2023-04-18).

sformel-usgs commented 1 year ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors: None
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[ ] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 8

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

I found this package to be thoughtfully written and documented. Code appears to have been carefully constructed and functions well. However, I do have a few high level concerns:

TL;DR More context and example should be added to the statement of need.
- The statement of need explains what you can do with the package, but doesn’t do a good job justifying a need or specifying the use cases and gaps the package fills. My first reaction, and that of a colleague I shared the readme with, was that we couldn’t clearly understand why this package was needed from the readme alone. However, I found this talk given by the author to do an excellent job justifying the package creation and explaining the circumstances in which you would want to use it. It may be enough to link to the video as a ‘more detail’ resource. If not, more is needed to explain why/when the need for the package exists. To give a little more context, I am the type of user that is described at the end of the video (mostly referencing authorities like WoRMS, rarely building my own databases). That being said, after working through the vignettes, I was able to imagine how the package would be useful.
I’m not sure why these functions shouldn’t be included as part of the taxastand package (or vice versa)? Again, the video referenced above give some sense of why these tasks would be separated, but I think a little more description of how they relate could be useful. I’m not totally convinced that the modification/validation of taxonomic data (dwctaxon) should be separated from the standardization of species names from different sources (taxastand) as these tasks seem to have a lot of overlap.

Documentation

Generally, the instructions and documentation were clear, thoughtful and useful. It made good use of the principle of multiple points of entry without sprawling unnecessarily. I especially appreciate that dct_options is highlighted in the validation vignette, as I think the variety of outputs that can be returned by the functions is a strength of this package. I would also like to compliment the author on the “What is DWC?” vignette. It is a thorough and clear explanation of DwC with regard to taxonomy and will be helpful to novices. I do, however, have some suggestions below.

Throughout the documentation, but most noticeably in the vignette ‘What is DWC?’ vignette, Darwin Core is abbreviated ‘DWC’. Generally speaking, I see/use ‘DwC’, and that is what is described on the Wikipedia page since 2011. Not critical, but I would be curious to hear whether the author has a strong opinion on the appropriate abbreviation. If not, I would prefer DwC.
In the ‘What is DWC?’ vignette, line 64 describes genus, family and order as “higher taxonomic levels”. I understand these to be “lower levels”, which seems consistent with the comment in the DwC taxonomic term higherClassification.
Line 64 of the editing vignette references ‘hash’ generically. I think it would be worth specifying that it is using MD5.
utils.R credits the paste3 function to a stackoverflow conversation. I’m not sure of the best way to give credit in this situation, but it seems like it would also be good to include the user [“IRTFM”] and userID [“1855677”], since it is a function that is clearly written by one person.

Logical programming API

Yes, however see suggestions below.
In the function, dct_add_row, I’m curious about the decision to only use the first 8 characters from the hash as the taxonID. As you note, this may result in duplicates, why not preserve the entire hash as taxonID?
I disagree with the use/creation of aliases as described in lines 72-89 of the editing vignette. This promotes confusion about the name of the DwC standard term. Especially in this case where the term is a function parameter and users can use tab to autocomplete, rather than typing out the names. I think this aspect of the package should be revised, and the vignette can point out tab autocomplete as a way to avoid typing out the rather long names of DwC.

automated tool review

Checking the initial package report generated by our @ropensci-review-bot.

As noted in the goodpractice checks, dct_modify_row and dct_add_row have high-cyclocomplexity. I don’t have any great suggestions for ways to break it up. But, if my suggestion, that aliases for DwC terms should be avoided, is accepted, then some of the checks for misuse of aliases could be removed.
The library and require calls described by lintr in goodpractices are acceptable. They all appear to be associated with vignettes, tests, or documentation of raw data creation.
It appears that the assertr package is imported for a single function in utilties.R. I don’t know of a simple way to replace this function and reduce the dependency.
The package dependency analysis did not identify calls for the import packages, assert_that, digest, and rlang. I have checked that these are necessary for the package to function.

Checking the package’s logs on its continuous integration services (GitHub Actions, Codecov, etc.)

One issue addressed by the author as a possible problem with pkgcheck when run remotely.

Running devtools::check() and devtools::test() on the package to find any errors that may be missed on the author’s system.

Passed without error or unusual skips, and warnings.

joelnitta commented 1 year ago

Thanks @sformel-usgs for the thorough and helpful review! I will wait on the other review from @collinschwantes before addressing any comments in case any of them overlap and/or conflict.

collinschwantes commented 1 year ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[ ] A statement of need: clearly stating problems the software is designed to solve and its target audience in README ~
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software have been confirmed.
[x] Performance: Any performance claims of the software have been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: ~8

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer (“rev” role) in the package DESCRIPTION file.

Review Comments

Overall easy to use and well documented. I really enjoyed testing this package. That being said, the statement of need could more clearly define who the target audience is. Is it biologist who are not necessarily taxonomists, taxonomists, collections managers?

I tested the package using data from a project I worked on back in 2014. Once I had the taxa data in the proper form, all functions worked as expected. I did notice that certain functions create missing columns or provide errors for non-dct terms. Its not always obvious which functions will generate new columns, and which wont. See dct_fill_col vs dct_add_row

I would prefer that functions use DCW terms instead of abbreviations. I didn't find the terms saved me that much time from a typing perspective and I had to remember the abbreviation plus the actual term. Not a lot of mental effort but non-zero.
Since there is an "add_row" function, it may be worth including a drop_row?
dct_fill_col does not throw an error if the fill_from argument is a non-existent column that is a dct_term.
One error for failing test when running devtools::check. dct_add_row throws an error.
dct_modify_row_single has a lot of if statements that are nicely labelled in the function. Consider whether or not each of those sections of the function could be split into its own, smaller, easier to evaluate function.
You could also take advantage of rlang::empty to test if something is NA or NULL.
The dependency section in the packaging guidelines recommends using minimum version numbers for the package dependencies
Would be great (though potentially out of scope for the package) to have a vignette that shows how to retrieve taxon data and/or what you can do with taxon data once you have it properly formatted in Darwin core.

joelnitta commented 1 year ago

Thanks @collinschwantes for the helpful review! I will try to post a revision addressing both reviewers' comments as soon as possible.

noamross commented 1 year ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/574#issuecomment-1521174676 time 8

ropensci-review-bot commented 1 year ago

Logged review for collinschwantes (hours: 8)

noamross commented 1 year ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/574#issuecomment-1512072208 time 8

ropensci-review-bot commented 1 year ago

Logged review for sformel-usgs (hours: 8)

ropensci-review-bot commented 1 year ago

@joelnitta: please post your response with @ropensci-review-bot submit response <url to issue comment> if you haven't done so already (this is an automatic reminder).

Here's the author guide for response. https://devguide.ropensci.org/authors-guide.html

joelnitta commented 1 year ago

@ropensci-review-bot submit response https://github.com/joelnitta/dwctaxon/commit/82aa4ea48834f3c31640d99f209f192b656d94fe

Thanks again to the reviewers for their helpful comments! Here are my responses.

Responses to @sformel-usgs

Review Comments

I found this package to be thoughtfully written and documented. Code appears to have been carefully constructed and functions well. However, I do have a few high level concerns:

TL;DR More context and example should be added to the statement of need.

The Statement of Need has been clarified in the README (https://github.com/joelnitta/dwctaxon/commit/c86673168bfacc6417da57d545f05e8197c8962b).

I’m not sure why these functions shouldn’t be included as part of the taxastand package (or vice versa)? Again, the video referenced above give some sense of why these tasks would be separated, but I think a little more description of how they relate could be useful. I’m not totally convinced that the modification/validation of taxonomic data (dwctaxon) should be separated from the standardization of species names from different sources (taxastand) as these tasks seem to have a lot of overlap.

Usage of dwctaxon is not limited to standardization of species names, as described in the revised Statement of Need and demonstrated in the newly added "Real World Data" vignette. Furthermore, different packages / software are available to do name standardization (e.g., U.Taxonstand in addition to taxastand). So in the case of name standardization, I think it makes more since to provide dwctaxon as a separate tool that can be used to prepare the reference database, then the user could standardize names against the reference using the tool of their choice.

Documentation

Throughout the documentation, but most noticeably in the vignette ‘What is DWC?’ vignette, Darwin Core is abbreviated ‘DWC’. Generally speaking, I see/use ‘DwC’, and that is what is described on the Wikipedia page since 2011. Not critical, but I would be curious to hear whether the author has a strong opinion on the appropriate abbreviation. If not, I would prefer DwC.

"DWC" has been changed to "DwC" throughout (https://github.com/joelnitta/dwctaxon/commit/fd1e14918b0e70f1711584c524e693fd3e1b746a).

In the ‘What is DWC?’ vignette, line 64 describes genus, family and order as “higher taxonomic levels”. I understand these to be “lower levels”, which seems consistent with the comment in the DwC taxonomic term higherClassification.

The wording has been changed to "taxonomic levels above species" (https://github.com/joelnitta/dwctaxon/commit/06964165f1248a2f326e84947be1f6bdafa202dd).

Line 64 of the editing vignette references ‘hash’ generically. I think it would be worth specifying that it is using MD5.

It is now specified that MD5 is used (https://github.com/joelnitta/dwctaxon/commit/ef7c3931f3830e5b06a82d5a7535271521f8aa1f).

utils.R credits the paste3 function to a stackoverflow conversation. I’m not sure of the best way to give credit in this situation, but it seems like it would also be good to include the user [“IRTFM”] and userID [“1855677”], since it is a function that is clearly written by one person.

paste3() is no longer used (https://github.com/joelnitta/dwctaxon/commit/d25180cfd6b23999971c3b8b60cc2aabcd4c41a2).

Logical programming API

Yes, however see suggestions below.

In the function, dct_add_row, I’m curious about the decision to only use the first 8 characters from the hash as the taxonID. As you note, this may result in duplicates, why not preserve the entire hash as taxonID?

Now the entire MD5 is used by default, and there is an option to use fewer characters (https://github.com/joelnitta/dwctaxon/commit/ef7c3931f3830e5b06a82d5a7535271521f8aa1f).

I disagree with the use/creation of aliases as described in lines 72-89 of the editing vignette. This promotes confusion about the name of the DwC standard term. Especially in this case where the term is a function parameter and users can use tab to autocomplete, rather than typing out the names. I think this aspect of the package should be revised, and the vignette can point out tab autocomplete as a way to avoid typing out the rather long names of DwC.

These aliases are no longer used (https://github.com/joelnitta/dwctaxon/commit/b1196238489e1441c13176c424229dd3c44bf469).

automated tool review

Checking the initial package report generated by our @ropensci-review-bot.

As noted in the goodpractice checks, dct_modify_row and dct_add_row have high-cyclocomplexity. I don’t have any great suggestions for ways to break it up. But, if my suggestion, that aliases for DwC terms should be avoided, is accepted, then some of the checks for misuse of aliases could be removed.

dct_modify_row_single() previously had the highest cyclocomplexity (103). Numerous subfunctions have been split out from dct_modify_row_single(), and there are no longer any if() statements in the main function, greatly decreasing cyclocomplexity to 14 (https://github.com/joelnitta/dwctaxon/commit/315bd1970cc5dfed9d60b10636735bbc911f6f4e). Unfortunately one of the subfunctions, create_new_row_by_modification() still contains a large number of if() statements and has high cyclocomplexity (56), but this is a significant improvement from 103. I would like to avoid further breaking up create_new_row_by_modification() because it has a well-defined purpose (creating a single row), and it is easier to understand the conditional relationships when they are all at the same level and immediately visible.

The library and require calls described by lintr in goodpractices are acceptable. They all appear to be associated with vignettes, tests, or documentation of raw data creation.

It appears that the assertr package is imported for a single function in utilties.R. I don’t know of a simple way to replace this function and reduce the dependency.

Thanks for catching this. That function is no longer needed and has been removed, along with the dependency on assertr (https://github.com/joelnitta/dwctaxon/commit/b60ddd8cb57b53218366beb50cf98cf95fa3e7de).

Responses to @collinschwantes

Review Comments

Overall easy to use and well documented. I really enjoyed testing this package. That being said, the statement of need could more clearly define who the target audience is. Is it biologist who are not necessarily taxonomists, taxonomists, collections managers?

The target audience is anybody who needs to maintain DwC taxonomic data and uses R. The Statement of Need has been clarified and use-cases added (https://github.com/joelnitta/dwctaxon/commit/c86673168bfacc6417da57d545f05e8197c8962b).

I tested the package using data from a project I worked on back in 2014. Once I had the taxa data in the proper form, all functions worked as expected. I did notice that certain functions create missing columns or provide errors for non-dct terms. Its not always obvious which functions will generate new columns, and which wont. See dct_fill_col vs dct_add_row

Documentation has been added to clarify when new columns are added (https://github.com/joelnitta/dwctaxon/commit/9cb465ed9e65bbb054176f64152fcd1b10880e66). Non-dct terms are never added as new columns.

I would prefer that functions use DCW terms instead of abbreviations. I didn't find the terms saved me that much time from a typing perspective and I had to remember the abbreviation plus the actual term. Not a lot of mental effort but non-zero.

These aliases are no longer used (https://github.com/joelnitta/dwctaxon/commit/b1196238489e1441c13176c424229dd3c44bf469).

Since there is an "add_row" function, it may be worth including a drop_row?

Thanks for the suggestion. This has been added as dct_drop_row() (https://github.com/joelnitta/dwctaxon/commit/54ba0419cc74f8cfeb5d0e401d06d4fd54443ca9).

dct_fill_col does not throw an error if the fill_from argument is a non-existent column that is a dct_term.

Thanks for pointing out this bug, it has been fixed (https://github.com/joelnitta/dwctaxon/commit/3f52a9c1c926a1a42636670f8b3f0c73d5bf2436).

One error for failing test when running devtools::check. dct_add_row throws an error.

It is not clear to me what error this was, but in its current state devtools::check() passes locally and on CI builds.

dct_modify_row_single has a lot of if statements that are nicely labelled in the function. Consider whether or not each of those sections of the function could be split into its own, smaller, easier to evaluate function.

Thanks for the suggestion. Several sections of dct_modify_row_single() have been split into subfunctions (https://github.com/joelnitta/dwctaxon/commit/315bd1970cc5dfed9d60b10636735bbc911f6f4e).

You could also take advantage of rlang::empty to test if something is NA or NULL.

I have been able to clean up this code by moving the call to is.null() as one of the conditionals checked with assert_that() instead of running assert_that() conditional on the result of is.null() (https://github.com/joelnitta/dwctaxon/commit/712d07ea73058ecc2951c3ec6d1c6a7e22a7be62).

The dependency section in the packaging guidelines recommends using minimum version numbers for the package dependencies

The packaging guidelines only recommend using minimum version numbers if there is a known minimum version that would otherwise cause the package to break. I am not aware of any, so I did not include them.

Would be great (though potentially out of scope for the package) to have a vignette that shows how to retrieve taxon data and/or what you can do with taxon data once you have it properly formatted in Darwin core.

Thanks for the suggestion. This has been added as the "Real World Data" vignette (https://github.com/joelnitta/dwctaxon/commit/2af31e9fa82a2724b24ebad30c3fe86bfbe2e658)

sformel-usgs commented 1 year ago

Thanks @joelnitta , these are all satisfactory responses and revisions. I appreciate the changes you've made to the Statement of Need and the References in the readme. The Real World vignette was also a great addition for folks who might be starting out in this realm.

I rechecked the package and only came across one minor issue. On my Windows computer, the real-world data vignette fails to build because the download.file call results in a corrupt zip file. A solution is to add mode = "wb". But you should double check this, because I'm not 100% that it won't cause problems for non-Windows machines. This is something new that just started occurring for me (non-text files becoming corrupt). I'm not sure what changed, but apparently, it's not unusual.

sformel-usgs commented 1 year ago

Thanks for handling that so quickly! Everything is working smoothly now. I have no more suggestions or concerns.

noamross commented 1 year ago

Thank your for the robust reply @joelnitta and your follow-up, @sformel-usgs. @collinschwantes, please look at the changes to the package and indicate if they address your review.

collinschwantes commented 1 year ago

@joelnitta Looks great!! The real world example is very helpful! No more suggestions or concerns.

noamross commented 1 year ago

@ropensci-review-bot approve dwctaxon

ropensci-review-bot commented 1 year ago

Approved! Thanks @joelnitta for submitting and @collinschwantes, @sformel-usgs for your reviews! :grin:

To-dos:

[ ] Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You will need to enable two-factor authentication for your GitHub account. This invitation will expire after one week. If it happens write a comment @ropensci-review-bot invite me to ropensci/<package-name> which will re-send an invitation.
[ ] After transfer write a comment @ropensci-review-bot finalize transfer of <package-name> where <package-name> is the repo/package name. This will give you admin access back.
[ ] Fix all links to the GitHub repo to point to the repo under the ropensci organization.
[ ] Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
[ ] If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
- deactivate the automatic deployment you might have set up
- remove styling tweaks from your pkgdown config but keep that config file
- replace the whole current pkgdown website with a redirecting page
- replace your package docs URL with https://docs.ropensci.org/package_name
- In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar, https://github.com/ropensci/foobar
[ ] Skim the docs of the pkgdown automatic deployment, in particular if your website needs MathJax.
[ ] Fix any links in badges for CI and coverage to point to the new repository URL.
[ ] Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
[ ] We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
[ ] You can add this installation method to your package README install.packages("<package-name>", repos = "https://ropensci.r-universe.dev") thanks to R-universe.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

joelnitta commented 1 year ago

@ropensci-review-bot finalize transfer of dwctaxon

ropensci-review-bot commented 1 year ago

Transfer completed. The dwctaxon team is now owner of the repository and the author has been invited to the team

joelnitta commented 1 year ago

Thanks again to reviewers @sformel-usgs and @collinschwantes and editor @noamross !

I have completed all of the tasks above:

[x] Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You will need to enable two-factor authentication for your GitHub account. This invitation will expire after one week. If it happens write a comment @ropensci-review-bot invite me to ropensci/<package-name> which will re-send an invitation.
[x] After transfer write a comment @ropensci-review-bot finalize transfer of <package-name> where <package-name> is the repo/package name. This will give you admin access back.
[x] Fix all links to the GitHub repo to point to the repo under the ropensci organization.
[x] Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
[x] If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
- deactivate the automatic deployment you might have set up
- remove styling tweaks from your pkgdown config but keep that config file
- replace the whole current pkgdown website with a redirecting page
- replace your package docs URL with https://docs.ropensci.org/package_name
- In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar, https://github.com/ropensci/foobar
[x] Skim the docs of the pkgdown automatic deployment, in particular if your website needs MathJax.
[x] Fix any links in badges for CI and coverage to point to the new repository URL.
[x] Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
[x] We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
[x] You can add this installation method to your package README install.packages("<package-name>", repos = "https://ropensci.r-universe.dev") thanks to R-universe.

I'm happy this package now has a home at rOpenSci!

ropensci / software-review