ropensci / software-review

rOpenSci Software Peer Review.

291 stars 104 forks source link

medrxivr: Accessing and searching medRxiv preprint data in R #380

Closed mcguinlu closed 4 years ago

mcguinlu commented 4 years ago

Submitting Author: Luke McGuinness (@mcguinlu) Repository: https://github.com/mcguinlu/medrxivr
Version submitted: 0.0.2 Editor: @maurolepore Reviewer 1: @tts Reviewer 2: @njahn82 Archive: TBD
Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: medrxivr
Title: Access MedRxiv Preprint Data
Version: 0.0.2
Authors@R: c(
    person("Luke", "McGuinness",
           role = c("aut", "cre"),
           email = "luke.mcguinness@bristol.ac.uk",
           comment = c(ORCID = "0000-0001-8730-9761")),
    person("Lena", "Schmidt",
           role = "aut",
           comment = c(ORCID = "0000-0003-0709-8226")))
Description: The medRxiv <https://www.medrxiv.org/> repository is a free online
    archive and distribution server for complete but unpublished manuscripts 
    (preprints) in the medical, clinical, and related health sciences. medrxivr
    provides programmatic access to both medRxiv API <https://api.biorxiv.org/>
    and a static snapshot of database, which is updated daily. Users can then 
    search for relevant records using regular expressions and Boolean logic, and
    can easily download the full-text PDFs of preprints matching their search 
    criteria. 
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Language: en-US
URL: https://github.com/mcguinlu/medrxivr
BugReports: https://github.com/mcguinlu/medrxivr/issues
Imports: 
    rvest,
    methods,
    dplyr,
    xml2,
    curl,
    jsonlite,
    httr,
    stringr,
    rlang
Suggests: 
    testthat (>= 2.1.0),
    knitr,
    rmarkdown,
    covr,
    kableExtra
VignetteBuilder: 
    knitr, 
    rmarkdown    
RoxygenNote: 7.1.0

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [X] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] workflow automataion
- [ ] version control
- [X] citation management and bibliometrics
- [ ] scientific software wrappers
- [X] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences): medrxivr allows users to programmatically access data from medRxiv, a preprint respository for papers in medical, clinical, and related health sciences. The package also allows user to readily perform and document reproducible literature searches of the medRxiv database.
Who is the target audience and what are scientific applications of this package?
The primary target of this package is systematic reviewers (i.e. me!), who frequently wish both to use more complicated queries (e.g. regular expresssions/Boolean combinations) when searching medRxiv than the official site currrently allows for, and who also wish to be easily able to download the full text PDFs of records matching their search. medrxivr helps with both of these challenges. However, anyone who wishes to find and retrieve relevant medRxiv records in R, for example to explore the distribution of preprints by subject area, will find the package useful.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? As far as I am aware, no other package allows users to access medRxiv data in R.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted. Issue: https://github.com/ropensci/software-review/issues/369 Editor: @annakrystalli

Technical checks

Confirm each of the following by checking the box.

[X] I have read the guide for authors and rOpenSci packaging guide.

This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has a CRAN and OSI accepted license.
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions, created with roxygen2.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[X] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[X] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

- [X] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [X] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [X] The package is deposited in a long-term repository with the DOI: 10.5281/zenodo.3860024 - (*Do not submit your package separately to JOSS*)

[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[X] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Tagging my co-author @L-ENA for reference.

maurolepore commented 4 years ago

@mcguinlu and @L-ENA, thanks for your submission. I'll be the editor. As we move though the process I'll keep you posted. I welcome your questions any time.

maurolepore commented 4 years ago

Editor checks:

[x] Fit: The package meets criteria for fit and overlap
[x] Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
[x] License: The package has a CRAN or OSI accepted license
[x] Repository: The repository link resolves correctly
[ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

\@mcguinlu, thanks again for your submission. The editor checks flagged a few issues that need your attention; see them below.

Let's discuss the first two items (ml1 and ml2) before I search for reviewers; these two items refer to a potential overlap with existing packages.

[x] (ml1) \@mcguinlu, I see you discussed this issue with @sckott (https://github.com/ropensci/fulltext/issues/213#issue-574066182); but this was a while ago and the issue remains unresolved and open. As of today, what do you think is the best way to move forward?:

a. Extend the package fulltext (https://github.com/ropensci/fulltext). b. Continue with this submission (please argue for the lack of overlap). c. Something else (please explain).
[x] (ml2) The file DESCRIPTION links to https://api.biorxiv.org/. \@mcguinlu, is there an overlap with the package biorxivr (https://cran.r-project.org/web/packages/biorxivr/index.html)? Why not?

The remaining items are important but not as urgent as the first two.

[ ] (ml3) Run spelling::spell_check_package(); then fix or update the list of valid words with spelling::update_wordlist().

> spelling::spell_check_package()
WORD              FOUND IN
api               mx_api_content.Rd:39,42
                  mx_api_doi.Rd:28,31
                  description:4
AppVeyor          README.md:14
                  README.Rmd:24
ation             building-complex-search-strategies.Rmd:110
biorxiv           description:4
capitalisation    building-complex-search-strategies.Rmd:125
... more lines

[ ] (ml4) Run goodpractice::gp() to identify lines of code that tests don't touch.

> goodpractice::gp()
... more lines
── GP medrxivr ─────────────────────────────────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in
    general. 77% of code lines are covered by test cases.

    R/mx_crosscheck.R:50:NA
    R/mx_download.R:25:NA
    R/mx_download.R:27:NA
    R/mx_download.R:28:NA
    R/mx_download.R:30:NA
    ... and 51 more lines

[ ] (ml5) Run covr::package_coverage() and try to test code in files with low % coverage.

> covr::package_coverage()
medrxivr Coverage: 77.60%
R/mx_download.R: 1.92%
R/mx_crosscheck.R: 96.15%
R/mx_search.R: 96.33%
R/mx_api.R: 100.00%
R/mx_info.R: 100.00%

[ ] (ml6) You may run an automated code-styler to make it easier for reviewers to read your code (https://styler.r-lib.org/reference/style_pkg.html).

> styler::style_pkg()
Styling  12  files:
 R/medrxivr.R                     ✓ 
 R/mx_api.R                       ℹ 
 R/mx_crosscheck.R                ℹ 
 R/mx_download.R                  ℹ 
 R/mx_info.R                      ℹ 
 R/mx_search.R                    ℹ 
 tests/testthat.R                 ✓ 
 tests/testthat/test-api.R        ℹ 
 tests/testthat/test-crosscheck.R ✓ 
 tests/testthat/test-download.R   ℹ 
 tests/testthat/test-info.R       ✓ 
 tests/testthat/test-search.R     ℹ 
────────────────────────────────────────
Status  Count   Legend 
✓   4   File unchanged.
ℹ   8   File changed.
x   0   Styling threw an error.
────────────────────────────────────────
Please review the changes carefully!

Reviewers: @tts and @njahn82 Due date: 2020-07-01

mcguinlu commented 4 years ago

Hi @maurolepore

Thanks for your inital review of our package. I've gone through it and try to address each point below:

ml1: Overlap with fulltext Personally, I think that fulltext and medrxivr should continue to be two seperate packages, but that there is the potential for intergration between the two (as occured with the aRxiv package). In all honesty, I completely forgot to reply to the issue on fulltext 🤦‍♂️- sorry @sckott! The reason I think they should be seperate is two fold:

In the first instance, the two packages take fundamentally different approaches to searching. At present, the medrxivr workflow is to create a local copy of the whole medRxiv repository via the API (or maintained static snapshot), and then search it locally using the mx_search() function to find relevant articles [i.e. all data -> local search -> results]. In comparison, fulltext takes the same approach to search bioRxiv as the biorxivr package does (note: medRxiv and bioRxiv are very similar, so I am using fulltext's approach to bioRxiv for comparison here). They both paste the search (see this code line) on after the base URL (https://www.biorxiv.org/search) and scrape the resulting page(s), essentially mimicing what would happen if you performed a search on the site itself [i.e. remote search -> results]. The search functionality offered by this approach is dependent on what the site itself offers, and so therefore is not as comprehensive as that offered by medrxivr (e.g. you can't use regexes to define capitalisation/alternative spellings, or use the NEAR operator). In addition, using the search functionality of the site itself (i.e. by pasting the query onto the search/ path) has been shown to not be very reproducible/transparent, which is what originally motivated the development of medrxivr.
Secondly, as far as I can tell (correct me if I am wrong @sckott), search strings for fulltext do not vary based on the database searched - for example, if you use ft_search() without specifying the source, it uses the same string to search every database. This would cause issues for anything beyond a simple search, as medrxivr allows for advanced search strings that would not be compatible with other data sources. Based on this, my argument is that medrxivr should be a standalone full package, and a simple restricted version of the medrxivr search could be implemented in fulltext, provided @sckott is happy to implement the medrxivr workflow [i.e. download data then search] within fulltext.

ml2: Overlap with biorxivr Thanks for highlighting this. I did come across this package while developing medrxivr - however, the last work on this package took place 5 years ago, before introduction of the API which you refer to in your query. In addition, while the base URL of the API contains "bioRxiv" (e.g. https://api.biorxiv.org/), this is only because the same organisation (Cold Spring Harbour) is responsible for both repositories. The actually endpoint for the medRxiv API is https://api.biorxiv.org/details/medrxiv/[interval]/[cursor]/[format]. Finally, as mentioned above, biorxivr relies on the search functionality offered by the site (e.g. by pasting the query on after the search/ path) rather than performing the searches itself.

ml3: Spelling I've added this as an issue (https://github.com/mcguinlu/medrxivr/issues/4) and plan to address it soon.

ml4/ml5: Test coverage In hindsight, I should have highlighted this as a potential sticking point in my inital submission. The single file which is dragging down the average coverage contains mx_download(), which takes the dataframe of records identified by the user as relevant and downloads a PDF for each one. In fact, this function has a test suite (see https://github.com/mcguinlu/medrxivr/blob/master/tests/testthat/test-download.R), but because it manipulates files and folders on a users machine, I have these tests set to skip_on_CRAN, which in turn means that covr doesn't pick them up. This is something I was hoping to get feedback on during the review process, as I am not sure if this is best practice or what a viable alternative would be?

ml6: styler I've added this as an issue (https://github.com/mcguinlu/medrxivr/issues/5) and plan to address it soon.

Hopefully this addresses your inital concerns, but please do let me know if anything is unclear, if my responses are insufficient, or if you need further details!

maurolepore commented 4 years ago

Thanks @mcguinlu! I think {medrxivr} merits to move to the next stage. I'll now start searching for reviewers.

ml1 and ml2

Here is my conclusion. I base it on your answers above, and on this quote from rOpenSci's guidelines on overlap:

"An R package that replicates the functionality of an existing R package may be considered for inclusion in the rOpenSci suite if it significantly improves on alternatives in any repository (RO, CRAN, BioC) by being ... better in usability and performance".

I considered the packages {medrxivr}, {fulltext} and {biorxivr}. I see an overlap in what the packages aim to do but not in how they do it. This would not justify the overlap in general; but in this case I think it does.

Compared to the other packages, {medrxivr} searches locally. This ensures the results can be reproduced; and enables searching with regular expressions. The different approach to searching also means that to integrate {medrxivr} into the other packages seems challenging. This might be eventually possible, but first {medrxivr} may need to mature independently.

ml3 to ml6

@mcguinlu, please let me know or check the boxes as you address these issues.

ml7

@mcguinlu, I see the positive aspects of the "local" approach to searching that {medrxivr} implements; but I understand that {medrxivr} downloads the entire database. I worry this may not scale up. Here are some questions I have; you may discuss them directly with the reviewers:

How big is the database? How fast does it grow? And how long does it take to download it in a range of reasonable conditions? What happens in a range of extreme conditions?
Is the process transparent and "polite" to the user?

Maybe you can avoid downloading the database and still provide flexible queries. For example, see how pkgsearch::advanced_search() does it -- apparently it uses elastic, which supports regular expressions, wildcards, fuzziness, and more). If you implement something similar you might get some guidance from @maelle; she is one of the two developers of pkgsearch::advanced_search().

maurolepore commented 4 years ago

@mcguinlu, please do this (from these guidelines):

add a rOpenSci review badge to their README, via rodev::use_review_badge(), rodev::use_review_badge(<issue_number>). Badge URL is https://badges.ropensci.org/<issue_id>_status.svg. Full link should be:

[![](https://badges.ropensci.org/<issue_id>_status.svg)](https://github.com/ropensci/software-review/issues/<issue_id>)

maurolepore commented 4 years ago

(ml8) @mcguinlu have you suggested any reviewer?

(The editor guidelines suggest you might but I fail to find them here or at https://github.com/ropensci/software-review/issues/369.)

maurolepore commented 4 years ago

Thanks @tts and @njahn82 for accepting to review this package. Your reviews are due on 2020-07-01 -- that is three weeks from today.

Let me know any questions you have.

tts commented 4 years ago

Package Review

Hi @maurolepore and @mcguinlu - here is my review. Thanks for this opportunity, and all the best for the package!

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[ ] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 10

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

medRxiv has been accepting preprints for a year now. Their API does not offer any search capabilities, so clearly medrxivr has a function to fill. Regular expressions and Boolean logic are state-of-the-art ways to fine-tune search queries, so users of this package should be happy. In addition, you can download all searched and found preprints as PDF files, which is handy and helpful.

Although the target group and the goal of the package are clearly defined, it took me some time to understand the core functionality. I suppose the main reason for this is the varying terminology of data sources used in vignettes and help pages. The way I understand the logic looks like this:

medrxiv

In short, for a search target there are two options, the dataset I download myself from medRxiv, or the dataset provided by the GitHub repo. The former can be either all items or just a subset limited by date. The latter is all items. Technically speaking, my download uses the medRxiv API, but the dataset in the repo is built by scraping the medRxiv web site on a daily basis. My understanding is that the main reasons for the scraped dataset are to provide a reliable data source for those occasions when the API does not serve well or not at all, and lighten the burden of the API usage.

How long does it take to download all metadata from the API? I tested it from two physical locations with a differing bandwidth:

start_time <- Sys.time()
medrxiv_data <- mx_api_content()
end_time <- Sys.time()
end_time - start_time

Time difference of 1.434701 mins (1Gb/s line, work)
Time difference of 3.045589 mins (28Mb/s line, home)

So far this is not bad, especially if you run the function once a day. One minor thing: is there any way to gracefully stop the process if started by accident? When the RStudio's red Stop button is hit, the following error is thrown

Error in curl::curl_fetch_memory(url, handle = handle): Operation was aborted by an application callback
Request failed [ERROR]. Retrying in 1 seconds...

httr::RETRY is a new function to me. Thanks for this, I will definitely try to use it myself at some point. I wonder though if it allows a clean, user-friendly, forced exit and if yes, how should it be defined?

How rapidly can we expect medRxiv to grow? Looking back, the amount of submissions accelerated when the still very much prevailing COVID-19 pandemic began.

library(tidyverse)
library(medrxivr)

mx_data <- mx_api_content()

stats <- mx_data %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(date) %>% 
  summarise(count = n())

png(filename="medrxivstats.png", 
    units="cm", 
    width=20, 
    height=20,
    pointsize=12, 
    res=72)

qplot(x=date, y=count,
      data=stats, na.rm=TRUE,
      main="medRxiv item growth",
      xlab="Date of submission", ylab="Number of submissions")

dev.off()

medrxivstats

Search is a key component of this package, and vignettes help in building search queries. The medrxivr one shows how to use the mx_search function: either within a two-step process, or with a one-step or piped process. The examples are a little confusing though because the functions shown are not the same; the first example uses mx_api_content, the second one mx_api which does not exists. I suppose mx_api is a typo, maybe the name of a former version? The vignette building-complex-search-strategies shows several strategies to filter data, and also how to use regular expressions. Very helpful. One minor thing about this example

mx_results <- mx_search(query = "dementia",
                        NOT = "mild cognitive impairment")

The NOT argument does not match to Mild cognitive impairment which is found in one abstract, so perhaps better to use the form of [mM]ild cognitive impairment instead.

In mx_search , the data argument is important because it defines the target. Again, the example in the help file is slightly misleading because there is no mx_raw function. A former version this one too I presume?

When I ran mx_search with zero arguments, my first thought was that there are some issues with error handling. The query starts but clearly you need to include the search string too! However, after some time the error handling kicks in and correctly reminds me of the missing query argument. If I am not mistaken, the delay was caused by the latency of the default data source in the GitHub repository.

As of writing this, how long does it take to query the repo?

start_time <- Sys.time()
mx_results <- mx_search(query = "molecular")
end_time <- Sys.time()
(end_time - start_time)

Using medRxiv snapshot - 2020-06-27 06:01
Found 226 record(s) matching your search.
Time difference of 20.75107 secs

To me this is acceptable, but people of today tend to be impatient. Still, when the same search against my local copy of the medRxiv database takes only 0.5 secs, you begin to wonder which one to use. I noticed that the question of how to efficiently host and serve a dataset is something you and the editor have already discussed about. Unfortunately, I cannot give any advice, but am very much interested to learn about this topic too. I hope you will find a good solution.

Downloading PDFs works smoothly and as promised. Note: the mx_download help file example of mx_search uses a limit argument which is not defined.

The Shiny application that comes with the package is a beautiful piece of work, and the idea of delivering reproducible code is a nice one indeed. However, there are some issues with the code. Both the basic and advanced search codes throw an error when run in R.

Basic:

query <- "coronavirus"
mx_results <- mx_search(query)

Error: $ operator is invalid for atomic vectors

Advanced:

topic1 <- c("coronavirus")
topic2 <- c("airborne")
query <- list(topic1, topic2)
mx_results <- mx_search(query, from.date =20190101, to.date =20200628, NOT = c(""), deduplicate = TRUE)

Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "list"

I was noted by @maurolepore that the package includes also a short manuscript to be submitted to Journal of Open Source Software. I found the manuscript in the inst directory, read it, and found it to be both clear and concise. Good luck!

mcguinlu commented 4 years ago

Hi @tts,

Just a short note to say thanks so much for your review. I've given it a quick skim, and it seems that everything you propose will be straightforward to implement. I'll go through your comments systematically soon, and post a response/list of changes. (@maurolepore, a process question - is it better for me to wait until the second reviewer has filed their review before beginning to make changes?)

Thanks in particular for spotting the discrepancies across the package (old function names in the examples, missing definitions for arguments, problems with the code from the app). You are correct that there is some hangover from an earlier version of the package/early versions of the package functions - I thought I had caught them all, but obviously not! When I started developing medrxivr, the medRxiv API didn't exist, meaning the data argument of mx_search() was not required. To note, this is also what's causing the reproducible code from the Shiny app to fail, as under the new version of the function, mx_search(query) is read as mx_search(data = query).

One specific thing I wanted to follow-up on was that the "Automated testing" item in the reviewer checklist is not marked as complete - did you have any specific issues with/reccomendations for this area of the pacakge?

maurolepore commented 4 years ago

@tts, thanks for your wonderful review!

@mcguinlu, RE

"Is it better for me to wait until the second reviewer has filed their review before beginning to make changes?"

Both reviewers should work on the exact same package. You may change the package in a separate branch, but please only merge it after both reviewers submitted their review.

tts commented 4 years ago

One specific thing I wanted to follow-up on was that the "Automated testing" item in the reviewer checklist is not marked as complete - did you have any specific issues with/reccomendations for this area of the pacakge?

@mcguinlu Sorry, my bad. Both devtools::check() and devtools::test() ran without errors. Checked that item in the list.

maurolepore commented 4 years ago

@njahn82, I hope you are well. Could you please update us about your review?

njahn82 commented 4 years ago

Sorry, I didn't meet my review deadline. Will submit it by Wednesday. Thanks for your patience!

On Thu, 2 Jul 2020 at 03:28, Mauro Lepore notifications@github.com wrote:

@njahn82 https://github.com/njahn82, I hope you are well. Could you please update us about your review?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/380#issuecomment-652726542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM7YRTEKAQTRMHY73MYBBDRZPPEVANCNFSM4NMF3HMQ .

njahn82 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions in R help
[X] Examples for all exported functions in R Help that run successfully locally
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[X] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5 hours

[X] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

This is very timely package that not just reflect the increasing popularity of open access preprints in Health Sciences, but also issues around finding and searching them. Although a growing suite of scholarly search engines make medRxiv preprints available, there seems to be no standard way to retrieve data from medRxiv thoroughly and systematically. Also finding full-texts is challenging, because medRxiv preprints are not made available via PubMed Central. Similiary, Crossref metadata, medRxiv's DOI registration agency, lack links to pdf full-texts.

Before I share my code review, I want to disclose that I neither have an academic background in Health Sciences nor have I been involved in systematic reviews as a librarian. I will therefore focus on more formal aspects of the package and its design.

Overall Design

The package contains functions to retrieve metadata from medRxiv, applying complex search strategies on a metadata snapshot, and download pdf full-texts. However, the source code repository contains a considerable amount of other functionality as well, which is outside of the R directory and excluded from the package build in .Rbuildignore:

app comprises a nice-looking and useful Shiny app helping users to build queries and visually explore the results.
data-extraction has scripts and functions for fetching and validating data.

There's also a link to (daily updated) data in an external GitHub repo, https://github.com/mcguinlu/medrxivr-data/, which is used in an exported R function.

My main concern with this approach is that dependencies, which are not part of the package, are loaded, and in one case installed. The code outside of the R folder also lacks documentation using roxygen tags and tests, and there's some redundancy. I feel that R code not part of the {medrxivr} package build either needs to be factored out should be moved into the R/ directory.

In the following, I will focus on the functionality, which is part of the package build.

README

The README is very helpful to get started with the package. A brief description of what medRxiv is and a link to the preprint server would make the README more informative.
Maybe the distinction between downloading a snapshot and searching the remote snapshot could be made a bit clearer. I first started to download the whole corpus, and then realised that there's already a snapshot that I can use instead.
I love @tts sketch of the overall design. Maybe it can be adapted and re-used?

Documentation

In the beginning, I followed the docs on https://mcguinlu.github.io/medrxivr/index.html and faced several issues. It took me a while before realising that the pkgdown website is outdated, because it was not build after code and documentation updates.
Documentation of functions could be expanded, particularly roxygen tags @import and @importFrom do not cover all external functions used.
Documentation of mx_search() refers to a function called mx_raw(), which is not part of the package.
Preprint are a quite new scholarly communication phenomena in Health Science and not all health scientists publish preprints regularly. Moreover, some other preprint servers target health science publications. Therefore, I think it would be good to warn users that medRxiv only contains parts of the Health Science (preprint) literature.

Vignette

There are three vignettes, which is great. Again, the general overview misses a sentence about what the preprint server medRxiv is about.
Not all code chunks are rendered. Some are introduced with a blank between the ticks and {r} Is this intentional?

Functionality

There is a considerable duplication of code regarding the API call, which can make it hard to update the package in case of API changes. It would be good to have a single function for the API call.
URL paths are constructed using paste. httr::modify_url() and the query of httr::GET() allow passing arguments to a API. Furthermore, {httr} provides helpful functionality to capture API errors more systematically than in the current implementation.
mx_crosscheck() does web scraping, which is fine according to the robots.txt. However, the requested crawl delay of 7 sec has been not implemented, yet.

Here's the checking using {polite}

polite::bow("https://www.medrxiv.org/archive", force = TRUE)
#> <polite session> https://www.medrxiv.org/archive
#>     User-agent: polite R package - https://github.com/dmi3kno/polite
#>     robots.txt: 68 rules are defined for 1 bots
#>    Crawl delay: 7 sec
#>   The path is scrapable for this user-agent

^{Created on 2020-07-08 by the reprex package (v0.3.0)}

mx_search() returns a grouped tibble. Personally, I prefer to have an ungrouped tibble. The column date is of type double, not date.
Because of the downloading time, it is good to have feedback about the progress. Maybe re-using a progress bar functionality like from {progress} can lead to less code, while expanding the current feedback mechanism.
mx_search(): rOpenSci style guide recommends snake case for params (from.date and to.date)
Finally, I wonder, if Europe PMC could be of use for searching medRxiv. Europe PMC search syntax is quite extensive and supports Boolean operator, wildcards and controlled vocabularies. What are the reasons not using it for searching medRxiv? Is it an indexing lag, or lacking metadata?

Here's a reprex using the vignette example, which took less than 2 second.

library(tidyverse)
library(europepmc)
ep_q <-
  c('PUBLISHER:"medRxiv" AND (mendelian* AND (randomisation OR randomization))')
epmc_l <- europepmc::epmc_search(ep_q, "raw", limit = 10000)
#> 91 records found, returning 91

my_df <-
  purrr::map_dfr(epmc_l, `[`, c("doi", "title", "abstractText"))
my_df %>%
  filter_at(vars(abstractText, title), any_vars(
    grepl(
      "[Mm]endelian(\\s)([[:graph:]]+\\s){0,4}randomi([[:alpha:]])ation",
      .
    )))
#> # A tibble: 81 x 3
#>    doi           title                         abstractText                     
#>    <chr>         <chr>                         <chr>                            
#>  1 10.1101/2020… Cardiometabolic traits, seps… Objectives: To investigate wheth…
#>  2 10.1101/2020… The relationship between gly… Aims: To investigate the relatio…
#>  3 10.1101/2020… Modifiable lifestyle factors… Aims: Assessing whether modifiab…
#>  4 10.1101/2020… Influence of blood pressure … Objectives: To determine whether…
#>  5 10.1101/2020… Increased adiposity is prote… Background Breast and prostate c…
#>  6 10.1101/2020… Examining the association be… Background: We examined associat…
#>  7 10.1101/2020… Investigating the potential … Aim: Use Mendelian randomisation…
#>  8 10.1101/2020… Unhealthy Behaviours and Par… Objective: Tobacco smoking, alco…
#>  9 10.1101/2020… Exploring the causal effect … BACKGROUND: Hearing loss has bee…
#> 10 10.1101/2020… Genetically informed precisi… Impaired lung function is associ…
#> # … with 71 more rows

^{Created on 2020-07-08 by the reprex package (v0.3.0)}

(Disclaimer: I maintain the {europepmc} package and I am curios to learn more about potential shortcomings using Europe PMC instead of a primary literature source. Because I also find it sometimes not very helpful when reviewers point to their own work, I do not expect you to consider this :-))

Testing

All tests passed, but it took a while. My duration was 1221.5 sec. However, I was connected to the internet via a cell phone connection during the review of the package.
I realised that a lot of skipping for CI platforms happens and I wonder why? Is it the run-time?

I think that's it from me! Thank you for making Health Science preprints more accessible and better discoverable! Happy to help further with the process!

maurolepore commented 4 years ago

Thanks @njahn82 for your review!

@mcguinlu, please aim to address the comments of both @tts and @njahn82 within the next 2 weeks.

njahn82 commented 4 years ago

Hi @mcguinlu. Sorry, while still playing with your app, I just realised that I was wrong and nothing is installed from the Shiny app. Please ignore this bit from the review.

mcguinlu commented 4 years ago

@njahn82 Thanks a million for your detailed review! At a quick skim, everything you flag/recommend is fixable/implementable, and will definitely help to improve the functionality. I'm also looking forward to examing europepmc further - to be honest, I was not aware that medRxiv preprints were indexed in Europe PMC.

@maurolepore Just confirming that I have seen this, and so am aiming to address the comments by 23rd July (at the latest).

mcguinlu commented 4 years ago

Hi all (esp @maurolepore)

A brief message to let you know that I have most of the changes requested made, but due to external circumstances, I haven't yet finished off the small number of outstanding items. I'm now aiming to have it ready for re-review by Thursday week (6th August) at the very latest.

Very sorry for the delay, and hope this is okay!

maurolepore commented 4 years ago

That's okay. Thanks for letting me know.

mcguinlu commented 4 years ago

Overview

To start, thanks once again to everyone for your constructive comments! I've gone through all the posts above and have tried to extract all feedback here so that I can address it systematically. - please let me know if I have missed anything. Comments are divided by commenter and topic for ease of reference, and are presented in bold text, with the response immediately below.

Reviewer Acknowledgement

I have also acknowledged both @tts and @njahn82 as reviewers in the Description, as I feel like your comments have added a great deal to the functionality/user-friendliness of the package. Please check your details just to make sure I have them correct!

Some key points about new functionality:

[x] I've added a new function (mx_export()) to save search results as a .bib file for import to reference management software (this functionality already existed for the app but hadn't been copied across to the package).
[x] I have seperated out importing the snapshot into a seperate function (mx_snapshot()), rather than just using mx_search(data= NULL) to indicate that you want to use the snapshot. I think this will help users be very clear about which data source they are using, and was influenced by feedback from both reviewers about being confused re: my description of the data sources.
[x] Due to the fact that the API endpoints are identical for both medRxiv and bioRxiv and the fact that all interactions with the API have been centralised into a single function (thanks to reviewers feedback), coupled with demand from others in my group/people emailing me about it, the mx_api_*() functions now contain a server argument, allowing users to specify which server (medRxiv or bioRxiv) they want to interact with. Downloading the whole bioRxiv database does take a while, but users seem happy to do so rather than having to perform the searches and download the results manually via the website (see here for an example). One thing I am worried about is whether to keep this as a happy bonus of the packages restructure following the reviews above, or whether to publicise it further/make it a key element of the package? Any advice/comments from reviewers/editors would be apprecited here!

Editor (@maurolepore)

Tasks

These have all been completed:

[x] Add rOpenSci badge to README
[x] Fix spelling (ml3)
[x] Fix style (ml6)

Discussion points

[x] How big is the database? How fast does it grow? And how long does it take to download it in a range of reasonable conditions? What happens in a range of extreme conditions?
As mentioned by @tts, there has been a substantial surge in the size of the database thanks to the COVID-19 pandemic. @tts reported times of 1.4 and 3.1 mins to download the database. I am not sure what is meant by extreme circumstances, but happy to do more testing if needed!
[x] Is the process transparent and "polite" to the user?
I'm not 100% sure what was meant by this comment. Re: interaction with medRxiv, all information is now taken from the API (previously mx_crosscheck() used web-scraping, but this is no longer the case) - is this what was meant?

Reviewer 1 (@tts)

General comments

[x] Although the target group and the goal of the package are clearly defined, it took me some time to understand the core functionality. I suppose the main reason for this is the varying terminology of data sources used in vignettes and help pages.
In addition to adding your suggested diagram, I have tried to make the language used across the documentation more consistent, but please do point out anything that could be clearer!

API

[x] Is there any way to gracefully stop the process if started by accident? . . . httr::RETRY is a new function to me. Thanks for this, I will definitely try to use it myself at some point. I wonder though if it allows a clean, user-friendly, forced exit and if yes, how should it be defined?
This is a great question, and to be honest, I am not sure how to implement this. Just so I'm sure we're on the same page, the issue is that because the httr::RETRY is nested within the larger function, hitting "Esc" or clicking "Stop" stops that iteration of httr:RETRY, which treats it like a failure and the retries the URL. This results in having to hit "Esc"/click "Stop" multiple times in order to actually get mx_api_content() to stop. Is this right? And maybe @njahn82 might have some clever ideas about this?

Snapshot

[x] As of writing this, how long does it take to query the repo [via the snapshot]?
The rate limiting set of searching via the snapshot is how long it takes to read in the CSV file from the medrxivr-data respository. Thanks to the new set-up, which uses vroom::vroom() rather than read.csv(), this step is now subtantially faster. Trying mx_search(query="molecular"), I got an average search time of ~1 second (vs ~20 seconds previously, as per your review).

start_time <- Sys.time()
mx_results <- mx_search(data = mx_snapshot(), query = "molecular")
end_time <- Sys.time()
(end_time - start_time)

Using medRxiv snapshot - 2020-08-05 06:02
Found 289 record(s) matching your search.
Time difference of 1.341 secs

Vignette/README

[x] The examples are a little confusing though because the functions shown are not the same; the first example uses mx_api_content, the second one mx_api which does not exists. I suppose mx_api is a typo, maybe the name of a former version? // In mx_search , the data argument is important because it defines the target. Again, the example in the help file is slightly misleading because there is no mx_raw function. A former version this one too I presume?
This is completely my bad. I thought I caught all references to this old function name, but obviously didn't. All references to mx_api() and mx_raw() have been removed/replaced as necessary.
[x] One minor thing about this example . . . The NOT argument does not match to Mild cognitive impairment which is found in one abstract, so perhaps better to use the form of [mM]ild cognitive impairment instead.
Thanks for catching this - I've changed the example to reflect this.

Download

[x] Note: the mx_download help file example of mx_search uses a limit argument which is not defined.
Thanks for catching this - removed now!

Shiny app

[x] However, there are some issues with the code [produced by the app]. Both the basic and advanced search codes throw an error when run in R.
This was due to the fact that the data argument now comes first in order to make it compatible with piping, meaning that the example code from the app was trying to pass the search terms to the data argument. This has been corrected, and the reproducible code should now work.
[x] When I ran mx_search with zero arguments, my first thought was that there are some issues with error handling. The query starts but clearly you need to include the search string too! However, after some time the error handling kicks in and correctly reminds me of the missing query argument. If I am not mistaken, the delay was caused by the latency of the default data source in the GitHub repository.
I've added a check to make sure that the data/query arguments are not empty very early on (prior to the rate limiting step of reading data from the GitHub repo), meaning that it fails fast and gracefully if no data source/search terms are provided.

mx_search()

Error in mx_search() : 
  Please provide medRxiv data to search, accessed from either from either the mx_api_content(), or mx_snapshot() function.

Reviewer 2 (@njahn82)

General comments

[x] My main concern with this approach is that dependencies, which are not part of the package, are loaded, and in one case installed. The code outside of the R folder also lacks documentation using roxygen tags and tests, and there's some redundancy. I feel that R code not part of the {medrxivr} package build either needs to be factored out should be moved into the R/ directory.
In response to this comments, the elements of this package that are beyond the core functionality (namely the snapshot creation and the code for web-app) have been moved into their own individual repositories, and cross-linked within the README. See here for the snapshot and app.

README

[x] The README is very helpful to get started with the package. A brief description of what medRxiv is and a link to the preprint server would make the README more informative.
I've added a sentence to this effect and link to the repository to the opening paragraph of both the README and the main vignette. Please let me know if this doesn't cover it.
[x] Maybe the distinction between downloading a snapshot and searching the remote snapshot could be made a bit clearer. I first started to download the whole corpus, and then realised that there's already a snapshot that I can use instead. /// I love @tts sketch of the overall design. Maybe it can be adapted and re-used?
Hopefully by including the graphic @tts suggested and by cleaning the language used in the README/vignettes, this is addressed.

Vignette

[x] There are three vignettes, which is great. Again, the general overview misses a sentence about what the preprint server medRxiv is about.
Addressed in the README comment above.
[x] Not all code chunks are rendered. Some are introduced with a blank between the ticks and {r} Is this intentional?
To quote Olivander¹:

No, no, definitely not.

Not exactly sure what happened here, but I think it is fixed now. Please let me know if this is not the case!

Functionality

[x] There is a considerable duplication of code regarding the API call, which can make it hard to update the package in case of API changes. It would be good to have a single function for the API call. // URL paths are constructed using paste. httr::modify_url() and the query of httr::GET() allow passing arguments to a API. Furthermore, {httr} provides helpful functionality to capture API errors more systematically than in the current implementation.
All interactions with the API have now been centralised in a collection of helper functions in R/helpers.R, which make use of httr::modify_url() and httr::GET(). Additionally, better/informative handling of API errors, using httr::stop_for_status(), has also been implemented. Finally, due to the API being extremely unreliably over the past two weeks, an additional helper function skip_on_api_message()
[x] mx_crosscheck() does web scraping, which is fine according to the robots.txt. However, the requested crawl delay of 7 sec has been not implemented, yet.
The mx_crosscheck() function has been updated to make use of the API inteface rather than webscraping. This change was made as there is often a discrepancy between the API-provided total number of records and the total number on the website (which is what was originally used to compare against the snapshot). It also means that cross-checking between the snapshot and the live database is much faster, and can be used as an indicator of whether or not it is worth downloading your own copy via the API.
[x] mx_search() returns a grouped tibble. Personally, I prefer to have an ungrouped tibble. The column date is of type double, not date.
mx_search() now returns an ungrouped tibble. The date column is now of type Date.
[x] Because of the downloading time, it is good to have feedback about the progress. Maybe re-using a progress bar functionality like from {progress} can lead to less code, while expanding the current feedback mechanism.
This has been implemented for mx_api_content(), both to give users better feedback in terms of progress and to better estimate the remaining time needed to download a local copy of the database.
[x] mx_search(): rOpenSci style guide recommends snake case for params (from.date and to.date)
The argument names in both mx_search() and mx_api_content() have been updated to reflect this. In addition, I have made the format for specifying a date consistent between the two functions: "2020-06-01" (previously, mx_search() used a numeric format: 20200601)
[x] Finally, I wonder, if Europe PMC could be of use for searching medRxiv. Europe PMC search syntax is quite extensive and supports Boolean operator, wildcards and controlled vocabularies. What are the reasons not using it for searching medRxiv? Is it an indexing lag, or lacking metadata?
Full disclosure : I was not aware that medRxiv preprints were captured by Europe PMC. However, on investigation, there seems to be two differences between searching medRxiv directly vs via europeomc, using your example search.
- The first difference is expected: it takes a while for things to be indexed in PMC, and so searching the medRxiv repo directly means you are as up-to-date as possible. This can be seen in the example below, where europepmc gives 8994 records total, while medrxivr gives 9146.
- The second is not expected: comparing the output of your example search between the two packages shows that there are three records retrieved from medRxiv that are not present in PMC (110 medrxivr, 107 europepmc). Of these, one was published on the 01/06/2020, so it is maybe not surprising that it is not indexed yet (though preprints published on medRxiv after this date are), but one of the other records was registered on medRxiv in February (04/02/2020). I'm not sure why this has not been captured by PMC, but if you have any ideas, be great to hear them!

One last reason (and one of the inital motivating factors fordeveloping the package) was that medrxivr allows you to search for/download multiple versions of the same preprint (mx_search(data, query, deduplicate = FALSE)), allowing for comparison between them. As far as I can see, this functionality is not implemented in europepmc (but please correct me if I am wrong!).

Code for comparison

```{r} # Load packages ----------------------------------------------------------- library(tidyverse) library(europepmc) library(medrxivr) # Compare total records returned ------------------------------------------ # Using europepmc gives 8994 ep_q <- c('PUBLISHER:"medRxiv"') epmc_l <- europepmc::epmc_search(ep_q, "raw", limit = 10000) pmc_all <-purrr::map_dfr(epmc_l, `[`, c("doi", "title", "abstractText")) # Using medrxivr gives 9146 mx_all <- mx_snapshot() %>% mx_search(query = "*") # Compare searches -------------------------------------------------------- pattern <- "[Mm]endelian(\\s)([[:graph:]]+\\s){0,4}[Rr]andomi([[:alpha:]])ation" # Using europepmc gives 107 records pmc_results <- pmc_all %>% filter_at(vars(abstractText, title), any_vars( grepl( pattern, . ))) # Using medrxivr gives 110 records mx_results <- mx_snapshot() %>% mx_search(query = pattern, fields = c("title","abstract","doi")) # Find records found by medrxivr but not europepmc ----------------------- '%notin%' <- Negate('%in%') # Gives 3 records discrepancy_df <- mx_results %>% filter(doi %notin% pmc_results$doi) ```

Testing

[x] All tests passed, but it took a while. My duration was 1221.5 sec. However, I was connected to the internet via a cell phone connection during the review of the package.
This is likely due to the old way of reading in the data (i.e. via read.csv()). Now that the package is using vroom, testing should be a lot faster. I ran the testing a few times, and the average was ~ 140 s. From now on, the rate limiting step will be how fast it can download the copy from the database.
[x] I realised that a lot of skipping for CI platforms happens and I wonder why? Is it the run-time?
The skips were initially due to the fact that I assumed Travis/Appveyor wouldn't like it if I saved files to their systems, which is what the tests for mx_download() did. However, I've tried it, and all tests work as expected, so I have removed these skips from the updated version.

¹ Harry Potter and the Sorcerer's Stone (movie), 00:28:35.

maurolepore commented 4 years ago

@mcguinlu, thanks for your hard work. See my comments below.

@tts and @njahn82 please consider @mcguinlu 's changes and respond with either your approval or a suggestion for improvement.

Following up previous comments (editor checks)

ml3: spelling::spell_check_package() still shows some unknown words. Please update your words list and consider automating the process with usethis::use_spell_check().

spelling::spell_check_package()
#>   WORD      FOUND IN
#> bioRxiv   mx_api_content.Rd:5,35
#> Harbour   mx_api_content.Rd:5,35

ml4: goodpractice::gp() still suggests some improvements:

suppressWarnings(goodpractice::gp())
...
─────────────────────────────────────────────────────────────────
#>
#> It is good practice to
#>
#>   ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R
#>     users and developers are used it and it is easier to read your code
#>     for them if you use '<-'.
#>
#>     tests/testthat/test-helpers.R:4:7
#>     tests/testthat/test-helpers.R:20:7
#>
#>   ✖ avoid long code lines, it is bad for readability. Also, many people
#>     prefer editor windows that are about 80 characters wide. Try make
#>     your lines shorter than 80 characters
#>
#>     R/mx_crosscheck.R:24:1
#>     R/mx_download.R:7:1
#>     R/mx_search.R:37:1
#>     tests/testthat/test-export.R:9:1
#>
#> ────────────────────────────────────────────────────────────────────────────────

ml5: covr::package_coverage() shows greater coverage than before; thanks. The only file that's still a little low is R/mx_crosscheck.R. Please consider adding more tests or excluding code as necessary (https://github.com/r-lib/covr#exclusions).

# > covr::package_coverage()
# medrxivr Coverage: 91.04%
# R/mx_crosscheck.R: 69.23%
# R/helpers.R: 83.33%
# R/mx_search.R: 90.27%
# R/mx_download.R: 93.85%
# R/mx_api.R: 100.00%
# R/mx_export.R: 100.00%
# R/mx_info.R: 100.00%
# R/mx_snapshot.R: 100.00%

ml6: usethis::use_tidy_style() suggests some files could improve. Please run usethis::use_tidy_style() and consider committing the changes.

usethis::use_tidy_style()
#> ✔ Setting active project to '/home/mauro/git/medrxivr'
#> Styling  17  files:
#>  R/helpers.R                      ℹ
#>  R/medrxivr.R                     ✔
#>  R/mx_api.R                       ✔
#>  R/mx_crosscheck.R                ✔
#>  R/mx_download.R                  ✔
#>  R/mx_export.R                    ✔
#>  R/mx_info.R                      ✔
#>  R/mx_search.R                    ✔
#>  R/mx_snapshot.R                  ✔
#>  tests/testthat.R                 ✔
#>  tests/testthat/test-api.R        ✔
#>  tests/testthat/test-crosscheck.R ✔
#>  tests/testthat/test-download.R   ✔
#>  tests/testthat/test-export.R     ✔
#>  tests/testthat/test-helpers.R    ℹ
#>  tests/testthat/test-info.R       ✔
#>  tests/testthat/test-search.R     ✔
#> ────────────────────────────────────────
#> Status   Count   Legend
#> ✔    15  File unchanged.
#> ℹ    2   File changed.
#> ✖    0   Styling threw an error.
#> ────────────────────────────────────────
#> Please review the changes carefully!
#>
#> ✔ Styled project according to the tidyverse style guide

New comments (suggestions)

ml7: On the website, the Reference tab shows "All functions". Maybe you can help users navigate this reference by grouping functions in some meaningful way? (see https://pkgdown.r-lib.org/reference/build_reference.html).
ml8: You may want to consider setting up a CI services for a wider range of environments. Here are two workflows you may use -- standard, and full.
ml9: I see three .Rmd files inside vignettes/ but only two in the Articles section of the website. Is this itentional? Also, vignettes are great, but they can make the installation heavier. Consider the difference between use_vignette() and use_article().
ml10: I recommend walking through the steps listed by use_release_issue() or devtools::release(). Even if you don't submit to CRAN, walking through the process can help you find details to improve.
ml11: The vignettes show code but not output. Reproducible examples are most useful when they include the output because readers can understand what the code does even if they choose not to run the code themselves. This is why reprex::reprex() prints output (https://reprex.tidyverse.org/).

tts commented 4 years ago

Hi @mcguinlu and thanks for your efforts! Below, I'll use your headings, and give my remarks to each of them.

General comments. In addition to adding your suggested diagram, I have tried to make the language used across the documentation more consistent, but please do point out anything that could be clearer!

Great, much better now. I cannot find anything more to complain :)

API. having to hit "Esc"/click "Stop" multiple times in order to actually get mx_api_content() to stop. Is this right?

Yes, that's what I mean, and I understand your explanation. Several clicks do terminate the download, so I find this sufficient.

Snapshot. Trying mx_search(query="molecular")

Time difference of 6.319079 secs for me, which is not bad.

Vignette/README

Ok now.

Download

Ok.

Shiny app mx_search()

Reproducible code works now fine, and a missing data | query argument is caught right away. Good!

One new comment

mx_info(commit = "master") Error in mx_info(commit = "master") : could not find function "mx_info"

Except this comment, I give my approval.

mcguinlu commented 4 years ago

Thanks for the further feedback both (and Happy Friday)! Please find my responses to your comments below:

Editor (@maurolepore)

ml3: spelling::spell_check_package() still shows some unknown words. Please update your words list and consider automating the process with usethis::use_spell_check(). I have automated the spellcheck now as recommended.
ml4: goodpractice::gp() still suggests some improvements. I have addressed all issues raised, and goodpractice::gp() now does not recommend any further improvements.
ml5: covr::package_coverage() shows greater coverage than before; thanks. The only file that's still a little low is R/mx_crosscheck.R. Please consider adding more tests or excluding code as necessary (https://github.com/r-lib/covr#exclusions). I have added more tests to increase the coverage, and where it is not possible to test the error handling behaviour (e.g. because it's not possible to simulate the user not having an internet connection or the API returning a specific message), I have excluded lines as needed. The skipped lines are all marked with a #nocov comment, so can be readily found for inspection. I've included the output of my local run of covr::package_coverage() below:

medrxivr Coverage: 100.00%
R/helpers.R: 100.00%
R/mx_api.R: 100.00%
R/mx_crosscheck.R: 100.00%
R/mx_download.R: 100.00%
R/mx_export.R: 100.00%
R/mx_info.R: 100.00%
R/mx_search.R: 100.00%
R/mx_snapshot.R: 100.00%

ml6: usethis::use_tidy_style() suggests some files could improve. Please run usethis::use_tidy_style() and consider committing the changes. I have run this and commited the changes.
ml7: On the website, the Reference tab shows "All functions". Maybe you can help users navigate this reference by grouping functions in some meaningful way? (see https://pkgdown.r-lib.org/reference/build_reference.html). I had added keywords to the functions already, but hadn't realised that you needed to alter the _pkgdown.yml file in order to group the functions. This has now been implemented, and functions are grouped into three categories: "Accessing medRxiv/bioRxiv data", "Performing the search", and "Helper functions".
ml8: You may want to consider setting up a CI services for a wider range of environments. Here are two workflows you may use -- standard, and full. Thanks for the recommendation - I have gone with the standard workflow, and R CMD passes in all environments.
ml9: I see three .Rmd files inside vignettes/ but only two in the Articles section of the website. Is this intentional? Also, vignettes are great, but they can make the installation heavier. Consider the difference between use_vignette() and use_article(). Yes, this is intentional. When you include a .Rmd file with the same name as the package in the vignette/ folder, pkgdown treats this as a special type of vignette ("Get Started"). From the pkgdown website:

A vignette with the same name as the package (e.g., vignettes/pkgdown.Rmd or vignettes/articles/pkgdown.Rmd) automatically becomes a top-level "Get started" link, and will not appear in the articles drop-down. (If your package name include a ., e.g. pack.down, use a - in the vignette name, e.g. pack.down.Rmd.)

I have also taken your advice and converted the two vignettes covering advanced topics to articles, and signposted to them in the final introductory vignette.
ml10: I recommend walking through the steps listed by use_release_issue() or devtools::release(). Even if you don't submit to CRAN, walking through the process can help you find details to improve. As a result of this process, the following changes were made:
- xml2 was removed from the DESCRIPTION as it is now longer needed now that the package does not perform any web-scraping.
- Titles of some of the functions were edited to be more comprehensive, so that the pkgdown function list is more useful.
- README.html was removed from the top level directory.
ml11: The vignettes show code but not output. Reproducible examples are most useful when they include the output because readers can understand what the code does even if they choose not to run the code themselves. This is why reprex::reprex() prints output (https://reprex.tidyverse.org/). Thanks for this feedback. I have decided not to produce output for the one remaining vignette, as the example code in this vignette calls the API via mx_api_content(). I am worried that enabling evalutation of the code in this vignette would mean that it would take a long time to render and make installing the package slow. However, for the two new articles (converted from vignettes as per ml9, and included only in the pkgdown website), the output is now shown.

Reviewer 1 (@tts)

Glad to hear things are a bit clearer now!

The reason mx_info() is not found is that it is an internal function (medrxivr:::mx_info()) and should not have been available in the function list on the pkgdown website. I had marked several internal function with the "Internal" keyword, which should have hidden them, but it seems that pkgdown is case sensitive and the correct keyword is "internal". This has been corrected now and the internal functions now longer appear in the website's function list.

Finally, just wanted to confirm that your details in the DESCRIPTION are correct?

tts commented 4 years ago

@mcguinlu Yes, my details in DESCRIPTION are correct.

maurolepore commented 4 years ago

@mcguinlu, just to let you know that I believe @njahn82 will respond to your changes next week.

mcguinlu commented 4 years ago

Great - thanks for letting me know!

njahn82 commented 4 years ago

Great job @mcguinlu, and thank you for the careful and thorough consideration of my review. I feel, it is clearer now what the package does and how it relates to the Shiny app and the backup/dump mechanism.

Thank you also for cross-checking with Europe PMC and demonstrating the added value of the medrxivr package.

Although all my suggestions have been addressed, I have some final suggestions

I wonder if the returned data frames from the mx_api_* family could be also represented as tibbles?
The package does a good job in parsing and cleaning preprint metadata. Unfortunately, I cannot find documentation or an example showcasing what is actually returned. Can you provide one reproducible example in the README and/or extend the documentation in the function docs?
In the function docs of mx_export(), it says Dataframe returned by mx_search(), but I realised that also data obtained from the mx_api_ family can be exported as bib file using mx_export().

mcguinlu commented 4 years ago

Thanks @njahn82. Just to note as well that I recently moved the snapshot functionality from relying on my local Task Scheduler to working from GitHub Actions, so it should now be a lot more robust (in the past, if my local PC experienced network issues, the snapshot would not be taken).

In response to your comments:

*I wonder if the returned data frames from the mxapi() family could be also represented as tibbles?**

The package now returns tibbles across the board. I had never really understood the difference, but after a bit of research, I do prefer the printing defaults for tibble objects.

The package does a good job in parsing and cleaning preprint metadata. Unfortunately, I cannot find documentation or an example showcasing what is actually returned. Can you provide one reproducible example in the README and/or extend the documentation in the function docs?

Hoping I understood this ask correctly, there is now a section in the README that desribes how to access the raw, uncleaned API data using the mx_api_*() functions, which also points to a section in the API article on the pkgdown website that provides more detail and an example of the uncleaned output. In addition, a clearer description of what the cleaning process entails has been included in the documention of the mx_api_*() functions (e.g. here)

In the function docs of mx_export(), it says Dataframe returned by mx_search(), but I realised that also data obtained from the mxapi family can be exported as bib file using mx_export().

Thanks for this - I have updated the docs for the mx_export() function to read @param data Dataframe returned by mx_search() or mx_api_*() functions

@maurolepore, I have checked that these changes don't throw any new errors and that goodpractice doesn't recommend any changes. I've also re-run styler/spelling functions and commited any modications.

Hoping we are nearly there!

maurolepore commented 4 years ago

Thanks @njahn82 and @mcguinlu,

@njahn82, once again, please consider @mcguinlu 's changes and respond with either your approval or further suggestions for improvement.

njahn82 commented 4 years ago

Thank you again @mcguinlu for your careful consideration of my review! All my suggestions have been addressed.

maurolepore commented 4 years ago

Approved! Thanks @mcguinlu for submitting and @tts and @njahn82 for your reviews! :smile:

To-dos:

[ ] Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. Soon you will be invited to a team that should allow you to do so. You'll be made admin once you do.
[ ] Fix all links to the GitHub repo to point to the repo under the ropensci organization.
[ ] If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
- deactivate the automatic deployment you might have set up
- remove styling tweaks from your pkgdown config but keep that config file
- replace the whole current pkgdown website with a redirecting page
- replace your package docs URL with https://docs.ropensci.org/package_name
- In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
[ ] Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname).
[ ] We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.

From https://github.com/ropensci/software-review/issues/380#issue-625738088 I see you wish to automatically submit to the Journal of Open Source Software? If so:

[ ] Activate Zenodo watching the repo
[ ] Tag and create a release so as to create a Zenodo version and DOI
[ ] Submit to JOSS at https://joss.theoj.org/papers/new, using the rOpenSci GitHub repo URL. When a JOSS "PRE REVIEW" issue is generated for your paper, add the comment: This package has been reviewed by rOpenSci: https://LINK.TO/THE/REVIEW/ISSUE, @ropensci/editors

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions.

We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

annakrystalli commented 4 years ago

Hello @mcguinlu! I've just invited you to the @ropensci/medrxivr team! You should now be allowed to transfer the repo. Once you do, just ping me here and I'll transfer full admin rights back to you 🙂👍

mcguinlu commented 4 years ago

Hi @annakrystalli have transferred across now. @maurolepore thanks for the checklist - I will work through it over the coming day. And finally, just flagging to @stefaniebutland that I would be interested in producing a blog post for this package!

Thanks again to @tts and @njahn82 for reviewing, and @maurolepore for herding us all through the process!

annakrystalli commented 4 years ago

Thanks @mcguinlu ! Full admin rights now returned 👍

danielskatz commented 4 years ago

Has this review been completed? (I'm asking as the editor of the corresponding JOSS submission)

mcguinlu commented 4 years ago

To-dos:

[x] Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. Soon you will be invited to a team that should allow you to do so. You'll be made admin once you do.

[x] Fix all links to the GitHub repo to point to the repo under the ropensci organization.

[x] If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,

deactivate the automatic deployment you might have set up

remove styling tweaks from your pkgdown config but keep that config file

replace the whole current pkgdown website with a redirecting page

replace your package docs URL with https://docs.ropensci.org/package_name

In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar

[x] Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname).

[x] We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.

From #380 (comment) I see you wish to automatically submit to the Journal of Open Source Software? If so:

[x] Activate Zenodo watching the repo

[x] Tag and create a release so as to create a Zenodo version and DOI

[x] Submit to JOSS at https://joss.theoj.org/papers/new, using the rOpenSci GitHub repo URL. When a JOSS "PRE REVIEW" issue is generated for your paper, add the comment: This package has been reviewed by rOpenSci: https://LINK.TO/THE/REVIEW/ISSUE, @ropensci/editors

Okay, I've completed all the steps now @maurolepore! Re: the JOSS review, please see @danielskatz's comment above.

The one thing I wasn't clear on was how to replace the old pkgdown website with a redirecting page - seeing as the repo has been transferred across, the old pkgdown website on GitHub Pages (https://mcguinlu.github.io/medrxivr/index.html) no longer exists (I think?!), so I wasn't clear on how to set the redirect.

maurolepore commented 4 years ago

@danielskatz, thanks for checking. Yes, as the guest editor of this submission, I confirm this review has been completed.

maurolepore commented 4 years ago

@mcguinlu,

RE:

The one thing I wasn't clear on was how to replace the old pkgdown website with a redirecting page - seeing as the repo has been transferred across, the old pkgdown website on GitHub Pages (https://mcguinlu.github.io/medrxivr/index.html) no longer exists (I think?!), so I wasn't clear on how to set the redirect.

I'm sorry this isn't clear for you or me. But as you say, the working website seems correct. I see no reason to worry.

Here are a few more comments from section 8.1.4 of https://devguide.ropensci.org/:

If you intend to submit to CRAN, see CRAN gotchas. I"m happy to provide support through this process. Let me know.

Please check these boxes to confirm you've done the following last steps:

[ ] Add a CodeMeta file by running codemetar::write_codemeta() (codemetar GitHub repo)
[ ] Change any needed links, such those for CI badges
[ ] Re-activate CI services
- For Travis, activating the project in the ropensci account should be sufficient
- For AppVeyor, tell the author to update the GitHub link in their badge, but do not transfer the project: AppVeyor projects should remain under the authors’ account. The badge is .
- For Codecov, the webhook may need to be reset by the author.
[ ] If authors maintain a gitbook that is at least partly about their package, contact an rOpenSci staff member so they might contact the authors about transfer to the ropensci-books GitHub organisation.
[ ] Add a “peer-reviewed” topic to the repo (it seems I'm the one supposed to do this but I apparently lack the privileges to access the "topics" settings -- see if you can or let me know).

Ping me when this is done and I'll then close this issue.

Thanks!

maurolepore commented 4 years ago

@mcguinlu, I see you already mentioned Stephanie Butland above. To comply with https://devguide.ropensci.org/editorguide.html#after-review, I also mention @ropensci/blog-editors for follow-up about your willingness to write a blog post or tech note.

Finally, please see https://devguide.ropensci.org/editorguide.html#package-promotion

mcguinlu commented 4 years ago

So in response to the last few bits:

[x] Add a CodeMeta file by running codemetar::write_codemeta() (codemetar GitHub repo)

CodeMeta file added (see here)

[x] Change any needed links, such those for CI badges

All CI badges updated to point to the ropensci endpoints (e.g see here)

[x] Re-activate CI services

Done, and have triggered a build under the new set-up to ensure everything works, which was successful.

[ ] If authors maintain a gitbook that is at least partly about their package, contact an rOpenSci staff member so they might contact the authors about transfer to the ropensci-books GitHub organisation.

Not applicable to me.

[x] Add a “peer-reviewed” topic to the repo (it seems I'm the one supposed to do this but I apparently lack the privileges to access the "topics" settings -- see if you can or let me know).

Done!

Thanks also for the additional materials re: CRAN submission (I do intend to submit to CRAN in the near future) and promotion, and for looping in@ropensci/blog-editors.

And I think that's us!

maurolepore commented 4 years ago

@mcguinlu , thanks and congratulations! To the best of my knowledge, this completes the review process so I'll close now.

You may already know this. To prepare packages for CRAN, usethis::use_release_issue() is useful. And here are some aspects of the workflow that might help (including a link to some tweaks). Feel free to reach out with questions.

Are you in rOpenSci's Slack workspace? If not, I recommend you find someone who can add you. I have found friendly advice there that I wouldn't find anywhere else.

stefaniebutland commented 4 years ago

Hello @mcguinlu. We'd love to have a post about medrxivr.

Our Blog Guide has most of the information you should need, with both content and technical advice. For readers, it would be helpful to highlight how this package relates to similar ones and the specific niche that medrxivr fills. Once that's clear early in the post, your readers will give their attention.

Let me know when you'd like to submit a draft and I can suggest a publication date.

stefaniebutland commented 4 years ago

@mcguinlu Also let me know if you'd like a new invitation to rOpenSci Slack. We could move this discussion there for example.

mcguinlu commented 4 years ago

@stefaniebutland a new invite would be great! I thought I had activated the first one correctly but apparently not (I am still getting to grips with Slack) 🤦‍♂️ and happy to continue chatting about this there.