[REVIEW]: covidregionaldata: Subnational data for COVID-19 epidemiology

whedon commented 3 years ago

Submitting author: @joseph-palmer (Joseph Palmer) Repository: https://github.com/epiforecasts/covidregionaldata Version: 0.9.2 Editor: @csoneson Reviewer: @mponce0, @federicomarini Archive: 10.5281/zenodo.5059988

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/dd6f7acdae3b7136a3ac373ce9a0655c"><img src="https://joss.theoj.org/papers/dd6f7acdae3b7136a3ac373ce9a0655c/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/dd6f7acdae3b7136a3ac373ce9a0655c/status.svg)](https://joss.theoj.org/papers/dd6f7acdae3b7136a3ac373ce9a0655c)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@mponce0 & @federicomarini, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @csoneson know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @mponce0

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@joseph-palmer) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @federicomarini

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@joseph-palmer) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @mponce0, @federicomarini it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 3 years ago

Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.07 s (1518.1 files/s, 128482.8 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
R                               60            365           1950           3904
YAML                            27            218             10           1167
Markdown                         6            162              0            501
TeX                              1             19              0            228
Rmd                              4            139            226            146
SVG                              8              0              0              8
Dockerfile                       1              3              6              4
-------------------------------------------------------------------------------
SUM:                           107            906           2192           5958
-------------------------------------------------------------------------------

Statistical information for the repository 'd8e76acfdd3c4932c26f5c27' was
gathered on 2021/05/17.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Hamish Gibbs                     1           159              5           22.97
seabbs                           2           198            352           77.03

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

csoneson commented 3 years ago

👋🏼 @joseph-palmer, @mponce0, @federicomarini - this is the review thread for the submission. All of our communications will happen here from now on.

Both reviewers have checklists at the top of this thread. As you go over the submission, please check any items that you feel have been satisfied. There are also links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues directly in the software repository. If you do so, please mention this thread so that a link is created (and I can keep an eye on what is happening). Please also feel free to comment and ask questions in this thread. It is often easier to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

Please feel free to ping me (@csoneson) if you have any questions or concerns. Thanks!

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.12688/wellcomeopenres.16006.2 is OK
- 10.5281/zenodo.3957539 is OK
- 10.21105/joss.02376 is OK
- 10.1126/science.abg3055 is OK
- 10.1016/S1473-3099(20)30120-1 is OK
- 10.12688/wellcomeopenres.15842.3 is OK
- 10.21105/joss.02376 is OK
- 10.1038/s41562-021-01079-8 is OK
- 10.1080/13876988.2021.1873703 is OK
- 10.1101/2020.10.18.20214585 is OK
- 10.21105/joss.01686 is OK

MISSING DOIs

- None

INVALID DOIs

- None

mponce0 commented 3 years ago

@csoneson @Joseph-palmer I'm done with my initial review, you will notice that some of the boxes on my list are not checked yet and this is related to some of the points I'm mentioning below. Please correct me if I'm mistaken in any of these points.

LICENSE file: there are indeed two files named license, LICENSE and LICENSE.md, which are not consistent. Ie. one states the copyright and the other a proper MIT License.
Versioning, the version number of the package appears to be arbitrary, shouldn't be ver 1.0?
There is a good number of authors, including a "working group", which makes not clear the individual contributions. Perhaps a few sentences clarifying this can be added to the package repository or github page.
API well documented.
It could be interesting to include some details about the data pipeline process, which appears to be done through github workflows.
Documentation is missing some elements, some of them they are present though in the manuscript:
- A statement of need: state what problems the software is designed to solve and who the target audience is
- Although there are tests included, these do not appear to be run automatically when the package is installed, i.e. when using install.package() in R. Also R-CMD-check is showing a Failing status.
- Although the community guidelines are included in the documentation, the title of the section is a little but misleading, I'd call it "Community Contributions" or something alike...
- Documentation could benefit from a few more examples, showing how to utilize more of the functions available in the package.
- I'd also emphasize the link to the documentation site from the README file, as it is right now it can be easily missed.
WRT the data sources: Maybe a few sentences clarifying the following points would be worth including in the manuscript.
- One thing is not quite clear to me is how scalable is the solution of including new additional data sets, i.e. as stated in the manuscript "data sources which are often collected and maintained by individuals or small teams."
  - Additionally, how robust are the data pipelines wrt to changes in or lost from the original sources. I.e. what would happen if a source link is broken or the data structure of the reported data changes, something that was quite frequent in some of the most popular data repository at the beginning of the pandemic.

csoneson commented 3 years ago

Thanks @mponce0!

joseph-palmer commented 3 years ago

Thanks @mponce0 for the helpful review comments - really nice to get some critical eyes on the package. Some detailed responses below.

LICENSE file: there are indeed two files named license, LICENSE and LICENSE.md, which are not consistent. Ie. one states the copyright and the other a proper MIT License.

The licence documentation follows the format of other packages, such as dplyr (https://github.com/tidyverse/dplyr). We are happy to update this is you have a link to better/newer practice that is CRAN compliant?

Versioning, the version number of the package appears to be arbitrary, shouldn't be ver 1.0?

We follow the convention of MAJOR.MINOR.PATCH (https://semver.org/), with the current version (on branch master) being 0.9.2. As you say we will look to move to version 1.0 in the near future but are hoping for sizable user contributions to expand our coverage for subnational data sources in Africa and to refine our optional processing support.

There is a good number of authors, including a "working group", which makes not clear the individual contributions. Perhaps a few sentences clarifying this can be added to the package repository or github page.

Contributions can be seen from the DESCRIPTION file and github commit logs (https://github.com/epiforecasts/covidregionaldata/pulse) which I believe shows clear evidence of code contributions. The working group refers to CMMID COVID-19 working group at LSHTM (https://cmmid.github.io/groups/ncov-group.html) and is referred to as members of the group made sizeable contributions to the planning and design of the package. @csoneson what are the current JOSS guidelines for showing contributions to the package? It is unusual to have this in the package readme etc?

It could be interesting to include some details about the data pipeline process, which appears to be done through github workflows.

The github workflows are used to check data integrity, such as that the raw data can be downloaded and that our inbuilt tests catch any changes, rather than to provide a processed database of data (unlike other packages that are similar). We felt that this was explained in the paper + docs but very happy to make changes to make this clearer?

A statement of need: state what problems the software is designed to solve and who the target audience is

We have included a statement of need section in the paper which addresses what problem the software solves and who it is intended for. As requested we have extended our initial description in the README to give more detail (https://github.com/epiforecasts/covidregionaldata/pull/362).

Although there are tests included, these do not appear to be run automatically when the package is installed, i.e. when using install.package() in R. Also R-CMD-check is showing a Failing status.

The package is passing checks both on CRAN and Github. For all packages (that we are aware of), tests run on package check rather than install. Users can check test status by either checking our data status page of the docs (https://epiforecasts.io/covidregionaldata/articles/dataset-status.html) or by running the tests directly (https://epiforecasts.io/covidregionaldata/articles/testing.html).

Although the community guidelines are included in the documentation, the title of the section is a little but misleading, I'd call it "Community Contributions" or something alike…

The document outlines our code of conduct and standards and is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. We are hesitant to use ‘Community Contributions’ as this implies a page listing user contributions.

Documentation could benefit from a few more examples, showing how to utilize more of the functions available in the package.

I agree, packages could always benefit from additional documentation but we have provided documentation for all of our functions, as well as vignettes outlining how to use the package. In an effort to expand our documentation we have currently open issues to develop our vignettes (https://github.com/epiforecasts/covidregionaldata/issues/316, https://github.com/epiforecasts/covidregionaldata/issues/314). If there is an area that you feel is critically lacking in documentation we can expand this but would need some more specific guidance.

I'd also emphasize the link to the documentation site from the README file, as it is right now it can be easily missed.

The docs are currently linked in the README, and the cran site (https://cran.r-project.org/web/packages/covidregionaldata/index.html). We are happy to include additional links but a pointer indicating how best to make this visible would be helpful.

One thing is not quite clear to me is how scalable is the solution of including new additional data sets, i.e. as stated in the manuscript "data sources which are often collected and maintained by individuals or small teams."

We have created guides to adding new data sources (https://github.com/epiforecasts/covidregionaldata/wiki) and data continues to be added to the package (https://github.com/epiforecasts/covidregionaldata/pull/358). We encourage our contributors to take ownership of the classes they add and monitor any changes which need addressing. Our automated data integrity checking allows us to identify datasets that are problematic without user reports. Of course if teams stop collating data then our processing will no longer be updated.

Additionally, how robust are the data pipelines wrt to changes in or lost from the original sources. I.e. what would happen if a source link is broken or the data structure of the reported data changes, something that was quite frequent in some of the most popular data repository at the beginning of the pandemic.

As noted in the manuscript and in the package documentation we test all processing code on all datasets each night using GitHub actions. Changes in underlying data structure would cause these tests to fail (in core variables only). This would then be broadcast to users via the data status vignette and our rendered documentation. At this point either a data owner or a user would submit a patch to be added to the dev build of the package. Our docs would then show the issue was fixed in the dev version and not CRAN until we pushed that version to CRAN. Users can also run these tests themselves when interacting with the package.

Once again, big thanks for taking the time with this!

federicomarini commented 3 years ago

Hi there @joseph-palmer, I'm going to start from some comments on the manuscript, while I continue with the code&package review.

Paper's well written and clearly describing the aims of covidregionaldata in the context of the need for fine-grained resolution data for localized health policy making for the COVID-19 pandemic.

An adequate survey of the current alternatives and state of the art is included, and the gap filled by this package is nicely defined.

A couple of minor comments:

Summary:
- maybe use a newer reference to the R core team?
- l 16-17, repetition of extensible/extend
- should links be kept also in their full form to be visible when printed? Might depend on the guidelines of JOSS, so I'm pinging @csoneson briefly
Statement of need:
- l.38: "changes to a dependency package" is not so clearly defined, do you mean something on the line of "changes in the subsequent processing steps" (for packages that use these sources)?
- l.53: "test and report... daily" -> You could specify this happens via GHA and their output is recorded on the project webpage
- while reading l.42-47: as a general idea: is there some way to track/record the origin or snapshot of the data for the files that are sourced? Something like an attribute could do it, but maybe some similar mechanism is already in place
References:
- Are the versions for sars2pack and covdata needed to ensure that this is the status quo at the moment of writing the manuscript? If dois are available, that could be even better
Footnote: is it correct to have a $ after Palmer as in Palmer$ et al? Maybe it is a feature of the build system

csoneson commented 3 years ago

Thanks @federicomarini! On this:

should links be kept also in their full form to be visible when printed? Might depend on the guidelines of JOSS, so I'm pinging @csoneson briefly

It's quite common in JOSS papers, so I'd say that's ok.

federicomarini commented 3 years ago

Here's the rest of my review - I second most points raised by @mponce0, thanks for your thoughtful review as well!

The main point I'd raise is that a package user (let's say "the average one", not only a developer) could benefit a big lot from these fully worked out examples (also raised as a point in the checklist-driven review). Maybe I am a little spoiled by how comprehensive some vignettes are in the Bioconductor project 😉

Functionality

Installation: The installation works smoothly, as documented!
Functionality: The package does deliver the functionality stated in the manuscript and is showcased in the vignettes
Performance: No particular performance claims are made, but it seems that the runtime is reasonable - optimizations with memoise are in place to avoid repetitions of operations

Documentation

A statement of need: These are elaborated in the vignettes. Maybe add a (more explicit) statement regarding what the target audience would be?
Installation instructions: The dependencies are all installed and correctly handled - maybe you can point out that to run the vignette examples you might need to use dependencies = TRUE in the installation command?
Example usage: The vignettes are present, and a very welcome piece of documentation. It might be overkill, but maybe some "fully worked out use cases" in one of the studies where the package has been used can add even more value? I realize there might be also some runtime/check-time constraints, so it is up to the authors to explore this point.
Functionality documentation: Well documented!
Automated tests: The test suite is well structured, and covers a wide portion of the code to retrieve and process the data. Maybe some additional tests might be desired to increase the overall coverage? Again, it might be a tradeoff between test runtimes and package coverage...
Community guidelines: A code of conduct is present, and the BugReports field of the DESCRIPTION correctly points to the repo issues.

Miscellaneous items

the help in some classes seems redundant, see ?Belgium (multiple same urls)
slowstart has a function that returns an error - probably country?
get_regional_data("Belgium", totals = TRUE) has an extra row - maybe worth run these chunks, or is it because of CRAN's policy?
maybe have the slowstart vignette evaluated? By having these chunks automatically generated, it would be robust to changes in the underlying data

federicomarini commented 3 years ago

Side note: massive, impressive work with the Github Actions to automate the operations! I'm in developer awe 💯

whedon commented 3 years ago

:wave: @federicomarini, please update us on how your review is going (this is an automated reminder).

whedon commented 3 years ago

:wave: @mponce0, please update us on how your review is going (this is an automated reminder).

federicomarini commented 3 years ago

👋 @federicomarini, please update us on how your review is going (this is an automated reminder).

All good in the hood dear @whedon 😉 My checklist is mostly complete, and the items that are left unchecked are addressed in the review points I reported.

mponce0 commented 3 years ago

Thanks @mponce0 for the helpful review comments - really nice to get some critical eyes on the package. Some detailed responses below.

Responding to some of the points, I have updated the ones we have agreed upon on the check list.

LICENSE file: there are indeed two files named license, LICENSE and LICENSE.md, which are not consistent. Ie. one states the copyright and the other a proper MIT License.

The licence documentation follows the format of other packages, such as dplyr (https://github.com/tidyverse/dplyr). We are happy to update this is you have a link to better/newer practice that is CRAN compliant?

OK, I usually prefer a simpler approach to licensing but it is OK this way too, just wanted to be sure that the two files wasn't a consequence of a "leftover".

Versioning, the version number of the package appears to be arbitrary, shouldn't be ver 1.0?

We follow the convention of MAJOR.MINOR.PATCH (https://semver.org/), with the current version (on branch master) being 0.9.2. As you say we will look to move to version 1.0 in the near future but are hoping for sizable user contributions to expand our coverage for subnational data sources in Africa and to refine our optional processing support.

I'm still not quite sure about this, I understand your reasoning but I'm not sure the version of the package must be tied to the data being incorporated... this will generate several versions as contributors start to add more and more data... I'm not saying I'm against just doubting here... Also this makes me think about what would happen when one of the data sources is for whatever reason not available anymore... will you prune it or perhaps an alternative approach is to create a copy of the original dataset and keep as "legacy" data? I'm still unsure about the versioning...

There is a good number of authors, including a "working group", which makes not clear the individual contributions. Perhaps a few sentences clarifying this can be added to the package repository or github page.

Contributions can be seen from the DESCRIPTION file and github commit logs (https://github.com/epiforecasts/covidregionaldata/pulse) which I believe shows clear evidence of code contributions. The working group refers to CMMID COVID-19 working group at LSHTM (https://cmmid.github.io/groups/ncov-group.html) and is referred to as members of the group made sizeable contributions to the planning and design of the package. @csoneson what are the current JOSS guidelines for showing contributions to the package? It is unusual to have this in the package readme etc?

Sure, I know that and I checked it, but this is requested from JOSS... if JOSS ( @csoneson ) is OK with it I don't have an issue... just following their guidelines. For instance, I have added a short paragraph in the web README in my own packages due to this request, just FYI.

It could be interesting to include some details about the data pipeline process, which appears to be done through github workflows.

The github workflows are used to check data integrity, such as that the raw data can be downloaded and that our inbuilt tests catch any changes, rather than to provide a processed database of data (unlike other packages that are similar). We felt that this was explained in the paper + docs but very happy to make changes to make this clearer?

Sure, just intrigued about the details and a quick overall top-down description.

A statement of need: state what problems the software is designed to solve and who the target audience is

We have included a statement of need section in the paper which addresses what problem the software solves and who it is intended for. As requested we have extended our initial description in the README to give more detail (epiforecasts/covidregionaldata#362).

Thanks.

Although there are tests included, these do not appear to be run automatically when the package is installed, i.e. when using install.package() in R. Also R-CMD-check is showing a Failing status.

The package is passing checks both on CRAN and Github. For all packages (that we are aware of), tests run on package check rather than install. Users can check test status by either checking our data status page of the docs (https://epiforecasts.io/covidregionaldata/articles/dataset-status.html) or by running the tests directly (https://epiforecasts.io/covidregionaldata/articles/testing.html).

OK, but in your README file I'm still seeing "R-CMD-check" Failing! is this because one of the data pipelines was failing or something else? In any case, it should be unentangled from the data verification process.

Although the community guidelines are included in the documentation, the title of the section is a little but misleading, I'd call it "Community Contributions" or something alike…

The document outlines our code of conduct and standards and is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. We are hesitant to use ‘Community Contributions’ as this implies a page listing user contributions.

Sure, that's OK, most of a suggestion than anything else... I still found the format of the links confusing, as they look more like "badges" or decorations to me than actual links which would invite me to click on them... no major issue here, just personal preference.

Documentation could benefit from a few more examples, showing how to utilize more of the functions available in the package.

I agree, packages could always benefit from additional documentation but we have provided documentation for all of our functions, as well as vignettes outlining how to use the package. In an effort to expand our documentation we have currently open issues to develop our vignettes (epiforecasts/covidregionaldata#316, epiforecasts/covidregionaldata#314). If there is an area that you feel is critically lacking in documentation we can expand this but would need some more specific guidance.

Sounds good, I just thought that there are a lot of interesting things that could have been shown, as for instance the mention to the use cases from your dev branch.

I'd also emphasize the link to the documentation site from the README file, as it is right now it can be easily missed.

The docs are currently linked in the README, and the cran site (https://cran.r-project.org/web/packages/covidregionaldata/index.html). We are happy to include additional links but a pointer indicating how best to make this visible would be helpful.

As mentioned above, it is more of a style thing.

One thing is not quite clear to me is how scalable is the solution of including new additional data sets, i.e. as stated in the manuscript "data sources which are often collected and maintained by individuals or small teams."

We have created guides to adding new data sources (https://github.com/epiforecasts/covidregionaldata/wiki) and data continues to be added to the package (epiforecasts/covidregionaldata#358). We encourage our contributors to take ownership of the classes they add and monitor any changes which need addressing. Our automated data integrity checking allows us to identify datasets that are problematic without user reports. Of course if teams stop collating data then our processing will no longer be updated.

OK

Additionally, how robust are the data pipelines wrt to changes in or lost from the original sources. I.e. what would happen if a source link is broken or the data structure of the reported data changes, something that was quite frequent in some of the most popular data repository at the beginning of the pandemic.

As noted in the manuscript and in the package documentation we test all processing code on all datasets each night using GitHub actions. Changes in underlying data structure would cause these tests to fail (in core variables only). This would then be broadcast to users via the data status vignette and our rendered documentation. At this point either a data owner or a user would submit a patch to be added to the dev build of the package. Our docs would then show the issue was fixed in the dev version and not CRAN until we pushed that version to CRAN. Users can also run these tests themselves when interacting with the package.

But what would happen to users who have already installed the package and try to access a dataset which is not longer available or failing? It will fail, right? No issues here but just be sure to provide an informative message to the user, recall that many R users (most probably) are not so "savvy" to be involved with dev versions, git-repos, etc.

Once again, big thanks for taking the time with this!

My absolute pleasure :) It is a great work and this package certainly contributes to the variety of R packages providing access to covid19 data.

csoneson commented 3 years ago

Thanks @federicomarini and @mponce0 for your quick reactions!

@joseph-palmer - over to you now, please let us know if you have any questions!

And sorry for missing the ping above re: contributions - in general, JOSS leaves it up to the authors to decide on the authorship of the JOSS submission (assuming that this has been discussed and agreed upon among contributors). On the software side itself I think the first step is indeed to include the contributors in the DESCRIPTION file together with the type of contribution, as is currently done. Any additional clarifications, acknowledgements etc would, in my opinion, fit well in the README. @joseph-palmer - I had just one question regarding the working group: you say above that it was included since "members of the group made sizeable contributions to the planning and design of the package". I also note that several of the named authors are indeed members of this working group. Could you clarify whether there were additional contributions made by the working group (beyond the named individuals)? I'm checking in with the larger editorial team as well, I don't think we have a lot of previous JOSS papers listing working groups or consortia among the authors :)

joseph-palmer commented 3 years ago

Apologies for the delay in getting back to you all, it has been a hectic couple of weeks! I have written individual responses to the points you all raised separately below. Once again I am very greatful for your thoughtful comments and advice with getting this package up to standard.

joseph-palmer commented 3 years ago

@federicomarini

maybe use a newer reference to the R core team?

We have updated the reference in the manuscript.

l 16-17, repetition of extensible/extend

We have changed this in the manuscript

l.38: "changes to a dependency package" is not so clearly defined, do you mean something on the line of "changes in the subsequent processing steps" (for packages that use these sources)?

What we mean by this is if a function used from another package (e.g. dplyr, a package on which this package depends on) changes then the function may not work the same way. As far we are aware this is standard terminology.

l.53: "test and report... daily" -> You could specify this happens via GHA and their output is recorded on the project webpage

Certainly, I have updated the manuscript to include this.

while reading l.42-47: as a general idea: is there some way to track/record the origin or snapshot of the data for the files that are sourced? Something like an attribute could do it, but maybe some similar mechanism is already in place

This is covered in the documentation, all the data is stored in the country class in the ‘data’ field, which contains ‘raw’ data (as is when downloaded) in addition to cleaned and processed data.

Are the versions for sars2pack and covdata needed to ensure that this is the status quo at the moment of writing the manuscript? If dois are available, that could be even better

The citations for these are taken from the package using the ‘citation()’ tool in R so are just as the authors report them.

Footnote: is it correct to have a $ after Palmer as in Palmer$ et al? Maybe it is a feature of the build system

No this is an error, good spot! I have updated this in the manuscript.

A statement of need: These are elaborated in the vignettes. Maybe add a (more explicit) statement regarding what the target audience would be?

I have expanded on this point in the README but have stopped short of an explicit statement to the tune of “this package is designed for ….”. This is just personal preference and I think the contents of the README now, along with the other documentation provides potential users with enough information to decide if this package is right for them.

Installation instructions: The dependencies are all installed and correctly handled - maybe you can point out that to run the vignette examples you might need to use dependencies = TRUE in the installation command?

We would generally advise individuals to check out the pkgdown documentation rather than build the vignette but we agree that it should be easy for users to run in R. As far as we are aware having library calls in the vignettes is standard practice for suggested packages required to build them that are not included as package imports. This would naturally support a workflow where the less technical user (who may not know about suggested packages) installs the packages that are not in place when they go to build the vignette.

Example usage: The vignettes are present, and a very welcome piece of documentation. It might be overkill, but maybe some "fully worked out use cases" in one of the studies where the package has been used can add even more value? I realize there might be also some runtime/check-time constraints, so it is up to the authors to explore this point.

Absolutely, we plan to expand the documentation vignettes as we move forwards with the package (https://github.com/epiforecasts/covidregionaldata/issues/316). Hopefully, you agree that the documentation as it stands is sufficient at this time until we can spare development time for additional documentation?

Automated tests: The test suite is well structured, and covers a wide portion of the code to retrieve and process the data. Maybe some additional tests might be desired to increase the overall coverage? Again, it might be a tradeoff between test runtimes and package coverage…

Completely agree and we have currently open PRs to increase test coverage (https://github.com/epiforecasts/covidregionaldata/pull/312) (https://github.com/epiforecasts/covidregionaldata/pull/307). We feel that the current testing framework probably has greater coverage than comparable tools but we are very keen to develop this further.

the help in some classes seems redundant, see ?Belgium (multiple same urls)

Thank you for spotting this, it somehow slipped by us! This was an error and we have now corrected this for Belgium and others with this bug.

maybe have the slowstart vignette evaluated? By having these chunks automatically generated, it would be robust to changes in the underlying data

This is a great idea and we have updated the vignette to run these chunks and use the code to render the tables. Complying with CRAN policy may be an issue here but if at submission time for this version we have issue we will alter the vignette processing pipeline so that CRAN is passed an md version of this Rmd but that we otherwise use an Rmd.

joseph-palmer commented 3 years ago

@mponce0

I'm still not quite sure about this, I understand your reasoning but I'm not sure the version of the package must be tied to the data being incorporated... this will generate several versions as contributors start to add more and more data... I'm not saying I'm against just doubting here...

I understand your concerns, however, versioning for most purposes just needs to be a way of marking a difference between releases. For our project, as new datasets get added we then issue a new release with these datasets (usually as a batch) to CRAN, so there must be some version difference. Having versioning take account of the data included suits our approach.

Also this makes me think about what would happen when one of the data sources is for whatever reason not available anymore... will you prune it or perhaps an alternative approach is to create a copy of the original dataset and keep as "legacy" data? I'm still unsure about the versioning...

At the moment we do not store data snapshots as the package is designed to be purely an interface to existing data sources. Should the datasets change source we would aim to update our package to use the new source instead of the old. If the source disappeared we would try and find the legacy source and offer this as an option, however it is not within the scope of this package to store data. An option we are considering is hosting a separate legacy source to use in these circumstances but we have not yet made much progress with these plans.

Sure, I know that and I checked it, but this is requested from JOSS... if JOSS ( @csoneson ) is OK with it I don't have an issue... just following their guidelines. For instance, I have added a short paragraph in the web README in my own packages due to this request, just FYI.

Yes I see this is not as clear as I thought, I have added more details of the working group to the README document. Your example was very helpful to follow, thank you!

OK, but in your README file I'm still seeing "R-CMD-check" Failing! is this because one of the data pipelines was failing or something else? In any case, it should be unentangled from the data verification process.

Data testing is split from other unit testing. By this we mean that downloading data is checked along with its integrity on a scheduled CI check each night. Our general check (and what is shown on the front page of the package) is a more traditional check that uses stored data to ensure the internal consistency of the package. This was a crucial design element as testing external data sources is error prone and would likely lead to our package not being CRAN stable at the very least. We had a general CI outage across projects that is now resolved and may have been what you saw.

But what would happen to users who have already installed the package and try to access a dataset which is not longer available or failing? It will fail, right? No issues here but just be sure to provide an informative message to the user, recall that many R users (most probably) are not so "savvy" to be involved with dev versions, git-repos, etc.

Yes, if a source is no longer supported we would provide a depreciation message but the user would need to install the latest version to get this. If using an older version which had code to collect data that no longer existed then this would just fail. Currently we only support data status updates via a vignette (though this is available in the downloaded version of the package) we are also considering if surfacing a user function for this check makes sense. We have recently had user feedback that the data status page is useful and currently no user feedback along the lines of is my data broken etc. which we had prior to the data status page being in place. Very much agree that this is something it is important to make easier to check so we will keep an eye on user feedback to see if we need to make changes.

joseph-palmer commented 3 years ago

@csoneson

On the software side itself I think the first step is indeed to include the contributors in the DESCRIPTION file together with the type of contribution, as is currently done. Any additional clarifications, acknowledgements etc would, in my opinion, fit well in the README.

Certainly, we have updated these to ensure it meets the guidance.

I had just one question regarding the working group: you say above that it was included since "members of the group made sizeable contributions to the planning and design of the package". I also note that several of the named authors are indeed members of this working group. Could you clarify whether there were additional contributions made by the working group (beyond the named individuals)?

Sure, individual members of the working group made very specific contributions hence their inclusion in the description and git commit histories. The package was discussed in the working group meetings at great length and whilst no code was actually written in these meetings and by some members, we wanted to acknowledge the role the working group had as a whole, including contributors listed and those not, in helping develop this package. I have included a statement in the README citing the working group along with a link to their site outlining what and who they are. Our inclusion of the working group is in line with our COVID-19 response group policy with the ideal of reflecting unstated contributions (such as finding data sources), often from those at an early career stage for whom it is most important, that would otherwise not be acknowledged.

csoneson commented 3 years ago

Thanks for this @joseph-palmer! This sounds sensible to me.

@mponce0, @federicomarini - could you take a look when you have a chance to see whether the responses above address your concerns? Thanks!

federicomarini commented 3 years ago

Hi @joseph-palmer !

@federicomarini

maybe use a newer reference to the R core team?

We have updated the reference in the manuscript.

l 16-17, repetition of extensible/extend

We have changed this in the manuscript

Good! 👍

l.38: "changes to a dependency package" is not so clearly defined, do you mean something on the line of "changes in the subsequent processing steps" (for packages that use these sources)?

What we mean by this is if a function used from another package (e.g. dplyr, a package on which this package depends on) changes then the function may not work the same way. As far we are aware this is standard terminology.

Thanks for clarifying this. Makes more sense the way you say it. Probably I was confused by the compactness of the sentence 😉

l.53: "test and report... daily" -> You could specify this happens via GHA and their output is recorded on the project webpage

Certainly, I have updated the manuscript to include this.

👍

while reading l.42-47: as a general idea: is there some way to track/record the origin or snapshot of the data for the files that are sourced? Something like an attribute could do it, but maybe some similar mechanism is already in place

This is covered in the documentation, all the data is stored in the country class in the ‘data’ field, which contains ‘raw’ data (as is when downloaded) in addition to cleaned and processed data.

Very good! Your choice if you think it is a feature worth mentioning - I would probably appreciate it quite a lot, given the issue we encounter in the bioinformatics world re: sources that are not so stable/well documented.

Are the versions for sars2pack and covdata needed to ensure that this is the status quo at the moment of writing the manuscript? If dois are available, that could be even better

The citations for these are taken from the package using the ‘citation()’ tool in R so are just as the authors report them.

Yep, this is quite common - I was just wondering whether JOSS does require doi to be used @csoneson ? Or maybe you can just add the version number by hand in the bibtex output by citation()?

Footnote: is it correct to have a $ after Palmer as in Palmer$ et al? Maybe it is a feature of the build system

No this is an error, good spot! I have updated this in the manuscript.

👍

A statement of need: These are elaborated in the vignettes. Maybe add a (more explicit) statement regarding what the target audience would be?

I have expanded on this point in the README but have stopped short of an explicit statement to the tune of “this package is designed for ….”. This is just personal preference and I think the contents of the README now, along with the other documentation provides potential users with enough information to decide if this package is right for them.

Fine for me. While sometimes the fixed formulation might help, I guess it is fair to allow some style shaping.

Installation instructions: The dependencies are all installed and correctly handled - maybe you can point out that to run the vignette examples you might need to use dependencies = TRUE in the installation command?

We would generally advise individuals to check out the pkgdown documentation rather than build the vignette but we agree that it should be easy for users to run in R. As far as we are aware having library calls in the vignettes is standard practice for suggested packages required to build them that are not included as package imports. This would naturally support a workflow where the less technical user (who may not know about suggested packages) installs the packages that are not in place when they go to build the vignette.

Agreeing on all the points above. My comment was more on the line of "normally, suggested packages are not installed, so if these are not already there, a user might not manage to run the vignette fully".

Example usage: The vignettes are present, and a very welcome piece of documentation. It might be overkill, but maybe some "fully worked out use cases" in one of the studies where the package has been used can add even more value? I realize there might be also some runtime/check-time constraints, so it is up to the authors to explore this point.

Absolutely, we plan to expand the documentation vignettes as we move forwards with the package (epiforecasts/covidregionaldata#316). Hopefully, you agree that the documentation as it stands is sufficient at this time until we can spare development time for additional documentation?

I agree there is a fair lot in the documentation, so it was not a mandatory point. If you can invest some more time (balanced vs CRAN constraints as well..), it would be really great.

Automated tests: The test suite is well structured, and covers a wide portion of the code to retrieve and process the data. Maybe some additional tests might be desired to increase the overall coverage? Again, it might be a tradeoff between test runtimes and package coverage…

Completely agree and we have currently open PRs to increase test coverage (epiforecasts/covidregionaldata#312) (epiforecasts/covidregionaldata#307). We feel that the current testing framework probably has greater coverage than comparable tools but we are very keen to develop this further.

Good steps, look forward to see the PRs merged in 😉

the help in some classes seems redundant, see ?Belgium (multiple same urls)

Thank you for spotting this, it somehow slipped by us! This was an error and we have now corrected this for Belgium and others with this bug.

👍

maybe have the slowstart vignette evaluated? By having these chunks automatically generated, it would be robust to changes in the underlying data

This is a great idea and we have updated the vignette to run these chunks and use the code to render the tables. Complying with CRAN policy may be an issue here but if at submission time for this version we have issue we will alter the vignette processing pipeline so that CRAN is passed an md version of this Rmd but that we otherwise use an Rmd.

If the workaround is allowed, it would be a good solution for ensuring that all chunks (eval'd locally)

All in all, @csoneson - The responses and edits addressed all my concerns! There's just a couple of items open, but none of them is major. Happy to see this work out in JOSS!

seabbs commented 3 years ago

Thanks for the great review and detailed feedback @federicomarini - very much appreciated. Agree we should flag data tracking a bit more as it's a nice feature to have and that we missed in other tools. For interest, this is the source we were basing the statement about Rmd -> md vignettes on: https://ropensci.org/blog/2019/12/08/precompute-vignettes/

csoneson commented 3 years ago

Thanks @federicomarini! Would you also update the checklist above for completeness?

mponce0 commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mponce0 commented 3 years ago

@joseph-palmer Thank you for the explanations and updates to the different documents in the package. Just recall to update your citation information to point towards the JOSS record when you get your final DOI.

@csoneson I'm good with the status of the manuscript and checked all the items in the check list, please let me know if there is anything else needed from my side.

csoneson commented 3 years ago

@mponce0 - thank you! That should be all from your side :)

@joseph-palmer - I will also take a look through the submission to see that everything is in order - I'll get back to you shortly with the next steps.

csoneson commented 3 years ago

👋🏻 @joseph-palmer - I just opened a PR with a couple of minor fixes to reference formatting. Please check and merge if you agree (and ask whedon to regenerate the pdf). The only other thing I noticed is the citation to Abbott et al (Zenodo), which doesn't have the same title as the actual Zenodo record.

Then, the next steps are as follows:

[ ] Make a tagged release of the software, and list the version tag of the archived version here.
[ ] Archive the reviewed software in Zenodo
[ ] Check that the archival deposit has the correct metadata. In particular, the title and author list should be identical to those of the paper. You can also add the authors' ORCID.
[ ] Please list the DOI of the archived version here.

I can then move forward with accepting the submission.

seabbs commented 3 years ago

@whedon generate pdf

Thanks @csoneson (standing in briefly for @joseph-palmer). I think we have dealt with the outstanding issues 😄.

Version tag: 0.9.2 Tagged release: https://github.com/epiforecasts/covidregionaldata/releases/tag/0.9.2 Archived DOI: 10.5281/zenodo.5055487

[x] Check metadata
[x] Fixed Zenodo reference in paper.

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

csoneson commented 3 years ago

Thanks @seabbs. The Zenodo archive looks good! However, I'm not sure about the citation to Palmer et al (previously Abbott et al) - the paper is basically citing itself, but with a URL pointing to Zenodo. I would suggest removing the citation (or make it a website citation, not pointing to the JOSS publication) - the link to Zenodo is still there (note though that currently the link does not point to the URL given in the citation), and the version above is also explicitly added also to the published record (there will be a link to a 'Software Archive' on the paper's landing page).

seabbs commented 3 years ago

Thanks @csoneson

I have removed the citation and link to Zenodo in the draft, deleted and recreated the release, and re-edited the zenodo release to share meta data with JOSS.

Zenodo: https://zenodo.org/record/5059988 Doi: 10.5281/zenodo.5059988

@whedon generate pdf

csoneson commented 3 years ago

@whedon generate pdf

(I think whedon will only recognize commands if they are the first line in the comment :))

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

csoneson commented 3 years ago

@whedon check references

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.12688/wellcomeopenres.16006.2 is OK
- 10.5281/zenodo.3957539 is OK
- 10.21105/joss.02376 is OK
- 10.1126/science.abg3055 is OK
- 10.1016/S1473-3099(20)30120-1 is OK
- 10.12688/wellcomeopenres.15842.3 is OK
- 10.21105/joss.02376 is OK
- 10.1038/s41562-021-01079-8 is OK
- 10.1080/13876988.2021.1873703 is OK
- 10.1101/2020.10.18.20214585 is OK
- 10.21105/joss.01686 is OK

MISSING DOIs

- None

INVALID DOIs

- None

csoneson commented 3 years ago

@whedon set 10.5281/zenodo.5059988 as archive

whedon commented 3 years ago

OK. 10.5281/zenodo.5059988 is the archive.

csoneson commented 3 years ago

@whedon set 0.9.2 as version

whedon commented 3 years ago

OK. 0.9.2 is the version.

csoneson commented 3 years ago

Looks good! I'm going to hand over now to the associate EiC on rotation for the final steps.

csoneson commented 3 years ago

@whedon recommend-accept

whedon commented 3 years ago

Attempting dry run of processing paper acceptance...

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.12688/wellcomeopenres.16006.2 is OK
- 10.5281/zenodo.3957539 is OK
- 10.21105/joss.02376 is OK
- 10.1126/science.abg3055 is OK
- 10.1016/S1473-3099(20)30120-1 is OK
- 10.12688/wellcomeopenres.15842.3 is OK
- 10.21105/joss.02376 is OK
- 10.1038/s41562-021-01079-8 is OK
- 10.1080/13876988.2021.1873703 is OK
- 10.1101/2020.10.18.20214585 is OK
- 10.21105/joss.01686 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 3 years ago

:wave: @openjournals/joss-eics, this paper is ready to be accepted and published.

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/2428

If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/2428, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true

seabbs commented 3 years ago

Amazing,

Thanks for editing @csoneson and thanks again for reviewing @federicomarini and @mponce0

arfon commented 3 years ago

@joseph-palmer – please merge this change which fixes the footnotes for your paper https://github.com/epiforecasts/covidregionaldata/pull/395

arfon commented 3 years ago

@joseph-palmer – please merge this change which fixes the footnotes for your paper.

Note to @openjournals/joss-eics – this PR needs merging before publishing.

openjournals / joss-reviews