ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

medrxivr: Accessing and searching medRxiv preprint data in R #380

Closed mcguinlu closed 4 years ago

mcguinlu commented 4 years ago

Submitting Author: Luke McGuinness (@mcguinlu) Repository: https://github.com/mcguinlu/medrxivr
Version submitted: 0.0.2 Editor: @maurolepore Reviewer 1: @tts Reviewer 2: @njahn82 Archive: TBD
Version accepted: TBD


Package: medrxivr
Title: Access MedRxiv Preprint Data
Version: 0.0.2
Authors@R: c(
    person("Luke", "McGuinness",
           role = c("aut", "cre"),
           email = "luke.mcguinness@bristol.ac.uk",
           comment = c(ORCID = "0000-0001-8730-9761")),
    person("Lena", "Schmidt",
           role = "aut",
           comment = c(ORCID = "0000-0003-0709-8226")))
Description: The medRxiv <https://www.medrxiv.org/> repository is a free online
    archive and distribution server for complete but unpublished manuscripts 
    (preprints) in the medical, clinical, and related health sciences. medrxivr
    provides programmatic access to both medRxiv API <https://api.biorxiv.org/>
    and a static snapshot of database, which is updated daily. Users can then 
    search for relevant records using regular expressions and Boolean logic, and
    can easily download the full-text PDFs of preprints matching their search 
    criteria. 
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Language: en-US
URL: https://github.com/mcguinlu/medrxivr
BugReports: https://github.com/mcguinlu/medrxivr/issues
Imports: 
    rvest,
    methods,
    dplyr,
    xml2,
    curl,
    jsonlite,
    httr,
    stringr,
    rlang
Suggests: 
    testthat (>= 2.1.0),
    knitr,
    rmarkdown,
    covr,
    kableExtra
VignetteBuilder: 
    knitr, 
    rmarkdown    
RoxygenNote: 7.1.0

Scope

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

JOSS Options - [X] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [X] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [X] The package is deposited in a long-term repository with the DOI: 10.5281/zenodo.3860024 - (*Do not submit your package separately to JOSS*)
MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

Tagging my co-author @L-ENA for reference.

maurolepore commented 4 years ago

@mcguinlu and @L-ENA, thanks for your submission. I'll be the editor. As we move though the process I'll keep you posted. I welcome your questions any time.

maurolepore commented 4 years ago

Editor checks:


Editor comments

\@mcguinlu, thanks again for your submission. The editor checks flagged a few issues that need your attention; see them below.

Let's discuss the first two items (ml1 and ml2) before I search for reviewers; these two items refer to a potential overlap with existing packages.

The remaining items are important but not as urgent as the first two.

> spelling::spell_check_package()
WORD              FOUND IN
api               mx_api_content.Rd:39,42
                  mx_api_doi.Rd:28,31
                  description:4
AppVeyor          README.md:14
                  README.Rmd:24
ation             building-complex-search-strategies.Rmd:110
biorxiv           description:4
capitalisation    building-complex-search-strategies.Rmd:125
... more lines
> goodpractice::gp()
... more lines
── GP medrxivr ─────────────────────────────────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in
    general. 77% of code lines are covered by test cases.

    R/mx_crosscheck.R:50:NA
    R/mx_download.R:25:NA
    R/mx_download.R:27:NA
    R/mx_download.R:28:NA
    R/mx_download.R:30:NA
    ... and 51 more lines
> covr::package_coverage()
medrxivr Coverage: 77.60%
R/mx_download.R: 1.92%
R/mx_crosscheck.R: 96.15%
R/mx_search.R: 96.33%
R/mx_api.R: 100.00%
R/mx_info.R: 100.00%
> styler::style_pkg()
Styling  12  files:
 R/medrxivr.R                     ✓ 
 R/mx_api.R                       ℹ 
 R/mx_crosscheck.R                ℹ 
 R/mx_download.R                  ℹ 
 R/mx_info.R                      ℹ 
 R/mx_search.R                    ℹ 
 tests/testthat.R                 ✓ 
 tests/testthat/test-api.R        ℹ 
 tests/testthat/test-crosscheck.R ✓ 
 tests/testthat/test-download.R   ℹ 
 tests/testthat/test-info.R       ✓ 
 tests/testthat/test-search.R     ℹ 
────────────────────────────────────────
Status  Count   Legend 
✓   4   File unchanged.
ℹ   8   File changed.
x   0   Styling threw an error.
────────────────────────────────────────
Please review the changes carefully!

Reviewers: @tts and @njahn82 Due date: 2020-07-01

mcguinlu commented 4 years ago

Hi @maurolepore

Thanks for your inital review of our package. I've gone through it and try to address each point below:

ml1: Overlap with fulltext Personally, I think that fulltext and medrxivr should continue to be two seperate packages, but that there is the potential for intergration between the two (as occured with the aRxiv package). In all honesty, I completely forgot to reply to the issue on fulltext 🤦‍♂️- sorry @sckott! The reason I think they should be seperate is two fold:

ml2: Overlap with biorxivr Thanks for highlighting this. I did come across this package while developing medrxivr - however, the last work on this package took place 5 years ago, before introduction of the API which you refer to in your query. In addition, while the base URL of the API contains "bioRxiv" (e.g. https://api.biorxiv.org/), this is only because the same organisation (Cold Spring Harbour) is responsible for both repositories. The actually endpoint for the medRxiv API is https://api.biorxiv.org/details/medrxiv/[interval]/[cursor]/[format]. Finally, as mentioned above, biorxivr relies on the search functionality offered by the site (e.g. by pasting the query on after the search/ path) rather than performing the searches itself.

ml3: Spelling I've added this as an issue (https://github.com/mcguinlu/medrxivr/issues/4) and plan to address it soon.

ml4/ml5: Test coverage In hindsight, I should have highlighted this as a potential sticking point in my inital submission. The single file which is dragging down the average coverage contains mx_download(), which takes the dataframe of records identified by the user as relevant and downloads a PDF for each one. In fact, this function has a test suite (see https://github.com/mcguinlu/medrxivr/blob/master/tests/testthat/test-download.R), but because it manipulates files and folders on a users machine, I have these tests set to skip_on_CRAN, which in turn means that covr doesn't pick them up. This is something I was hoping to get feedback on during the review process, as I am not sure if this is best practice or what a viable alternative would be?

ml6: styler I've added this as an issue (https://github.com/mcguinlu/medrxivr/issues/5) and plan to address it soon.

Hopefully this addresses your inital concerns, but please do let me know if anything is unclear, if my responses are insufficient, or if you need further details!

maurolepore commented 4 years ago

Thanks @mcguinlu! I think {medrxivr} merits to move to the next stage. I'll now start searching for reviewers.

ml1 and ml2

Here is my conclusion. I base it on your answers above, and on this quote from rOpenSci's guidelines on overlap:

"An R package that replicates the functionality of an existing R package may be considered for inclusion in the rOpenSci suite if it significantly improves on alternatives in any repository (RO, CRAN, BioC) by being ... better in usability and performance".

I considered the packages {medrxivr}, {fulltext} and {biorxivr}. I see an overlap in what the packages aim to do but not in how they do it. This would not justify the overlap in general; but in this case I think it does.

Compared to the other packages, {medrxivr} searches locally. This ensures the results can be reproduced; and enables searching with regular expressions. The different approach to searching also means that to integrate {medrxivr} into the other packages seems challenging. This might be eventually possible, but first {medrxivr} may need to mature independently.

ml3 to ml6

@mcguinlu, please let me know or check the boxes as you address these issues.

ml7

@mcguinlu, I see the positive aspects of the "local" approach to searching that {medrxivr} implements; but I understand that {medrxivr} downloads the entire database. I worry this may not scale up. Here are some questions I have; you may discuss them directly with the reviewers:

Maybe you can avoid downloading the database and still provide flexible queries. For example, see how pkgsearch::advanced_search() does it -- apparently it uses elastic, which supports regular expressions, wildcards, fuzziness, and more). If you implement something similar you might get some guidance from @maelle; she is one of the two developers of pkgsearch::advanced_search().

maurolepore commented 4 years ago

@mcguinlu, please do this (from these guidelines):

add a rOpenSci review badge to their README, via rodev::use_review_badge(), rodev::use_review_badge(<issue_number>). Badge URL is https://badges.ropensci.org/<issue_id>_status.svg. Full link should be:

[![](https://badges.ropensci.org/<issue_id>_status.svg)](https://github.com/ropensci/software-review/issues/<issue_id>)
maurolepore commented 4 years ago

(ml8) @mcguinlu have you suggested any reviewer?

(The editor guidelines suggest you might but I fail to find them here or at https://github.com/ropensci/software-review/issues/369.)

maurolepore commented 4 years ago

Thanks @tts and @njahn82 for accepting to review this package. Your reviews are due on 2020-07-01 -- that is three weeks from today.

Let me know any questions you have.

tts commented 4 years ago

Package Review

Hi @maurolepore and @mcguinlu - here is my review. Thanks for this opportunity, and all the best for the package!

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 10


Review Comments

medRxiv has been accepting preprints for a year now. Their API does not offer any search capabilities, so clearly medrxivr has a function to fill. Regular expressions and Boolean logic are state-of-the-art ways to fine-tune search queries, so users of this package should be happy. In addition, you can download all searched and found preprints as PDF files, which is handy and helpful.

Although the target group and the goal of the package are clearly defined, it took me some time to understand the core functionality. I suppose the main reason for this is the varying terminology of data sources used in vignettes and help pages. The way I understand the logic looks like this:

medrxiv

In short, for a search target there are two options, the dataset I download myself from medRxiv, or the dataset provided by the GitHub repo. The former can be either all items or just a subset limited by date. The latter is all items. Technically speaking, my download uses the medRxiv API, but the dataset in the repo is built by scraping the medRxiv web site on a daily basis. My understanding is that the main reasons for the scraped dataset are to provide a reliable data source for those occasions when the API does not serve well or not at all, and lighten the burden of the API usage.

How long does it take to download all metadata from the API? I tested it from two physical locations with a differing bandwidth:

start_time <- Sys.time()
medrxiv_data <- mx_api_content()
end_time <- Sys.time()
end_time - start_time
Time difference of 1.434701 mins (1Gb/s line, work)
Time difference of 3.045589 mins (28Mb/s line, home)

So far this is not bad, especially if you run the function once a day. One minor thing: is there any way to gracefully stop the process if started by accident? When the RStudio's red Stop button is hit, the following error is thrown

Error in curl::curl_fetch_memory(url, handle = handle): Operation was aborted by an application callback
Request failed [ERROR]. Retrying in 1 seconds...

httr::RETRY is a new function to me. Thanks for this, I will definitely try to use it myself at some point. I wonder though if it allows a clean, user-friendly, forced exit and if yes, how should it be defined?

How rapidly can we expect medRxiv to grow? Looking back, the amount of submissions accelerated when the still very much prevailing COVID-19 pandemic began.

library(tidyverse)
library(medrxivr)

mx_data <- mx_api_content()

stats <- mx_data %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(date) %>% 
  summarise(count = n())

png(filename="medrxivstats.png", 
    units="cm", 
    width=20, 
    height=20,
    pointsize=12, 
    res=72)

qplot(x=date, y=count,
      data=stats, na.rm=TRUE,
      main="medRxiv item growth",
      xlab="Date of submission", ylab="Number of submissions")

dev.off()

medrxivstats

Search is a key component of this package, and vignettes help in building search queries. The medrxivr one shows how to use the mx_search function: either within a two-step process, or with a one-step or piped process. The examples are a little confusing though because the functions shown are not the same; the first example uses mx_api_content, the second one mx_api which does not exists. I suppose mx_api is a typo, maybe the name of a former version? The vignette building-complex-search-strategies shows several strategies to filter data, and also how to use regular expressions. Very helpful. One minor thing about this example

mx_results <- mx_search(query = "dementia",
                        NOT = "mild cognitive impairment")

The NOT argument does not match to Mild cognitive impairment which is found in one abstract, so perhaps better to use the form of [mM]ild cognitive impairment instead.

In mx_search , the data argument is important because it defines the target. Again, the example in the help file is slightly misleading because there is no mx_raw function. A former version this one too I presume?

When I ran mx_search with zero arguments, my first thought was that there are some issues with error handling. The query starts but clearly you need to include the search string too! However, after some time the error handling kicks in and correctly reminds me of the missing query argument. If I am not mistaken, the delay was caused by the latency of the default data source in the GitHub repository.

As of writing this, how long does it take to query the repo?

start_time <- Sys.time()
mx_results <- mx_search(query = "molecular")
end_time <- Sys.time()
(end_time - start_time)
Using medRxiv snapshot - 2020-06-27 06:01
Found 226 record(s) matching your search.
Time difference of 20.75107 secs

To me this is acceptable, but people of today tend to be impatient. Still, when the same search against my local copy of the medRxiv database takes only 0.5 secs, you begin to wonder which one to use. I noticed that the question of how to efficiently host and serve a dataset is something you and the editor have already discussed about. Unfortunately, I cannot give any advice, but am very much interested to learn about this topic too. I hope you will find a good solution.

Downloading PDFs works smoothly and as promised. Note: the mx_download help file example of mx_search uses a limit argument which is not defined.

The Shiny application that comes with the package is a beautiful piece of work, and the idea of delivering reproducible code is a nice one indeed. However, there are some issues with the code. Both the basic and advanced search codes throw an error when run in R.

Basic:

query <- "coronavirus"
mx_results <- mx_search(query) 

Error: $ operator is invalid for atomic vectors

Advanced:

topic1 <- c("coronavirus")
topic2 <- c("airborne")
query <- list(topic1, topic2)
mx_results <- mx_search(query, from.date =20190101, to.date =20200628, NOT = c(""), deduplicate = TRUE) 

Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "list"

I was noted by @maurolepore that the package includes also a short manuscript to be submitted to Journal of Open Source Software. I found the manuscript in the inst directory, read it, and found it to be both clear and concise. Good luck!

mcguinlu commented 4 years ago

Hi @tts,

Just a short note to say thanks so much for your review. I've given it a quick skim, and it seems that everything you propose will be straightforward to implement. I'll go through your comments systematically soon, and post a response/list of changes. (@maurolepore, a process question - is it better for me to wait until the second reviewer has filed their review before beginning to make changes?)

Thanks in particular for spotting the discrepancies across the package (old function names in the examples, missing definitions for arguments, problems with the code from the app). You are correct that there is some hangover from an earlier version of the package/early versions of the package functions - I thought I had caught them all, but obviously not! When I started developing medrxivr, the medRxiv API didn't exist, meaning the data argument of mx_search() was not required. To note, this is also what's causing the reproducible code from the Shiny app to fail, as under the new version of the function, mx_search(query) is read as mx_search(data = query).

One specific thing I wanted to follow-up on was that the "Automated testing" item in the reviewer checklist is not marked as complete - did you have any specific issues with/reccomendations for this area of the pacakge?

maurolepore commented 4 years ago

@tts, thanks for your wonderful review!

@mcguinlu, RE

"Is it better for me to wait until the second reviewer has filed their review before beginning to make changes?"

Both reviewers should work on the exact same package. You may change the package in a separate branch, but please only merge it after both reviewers submitted their review.

tts commented 4 years ago

One specific thing I wanted to follow-up on was that the "Automated testing" item in the reviewer checklist is not marked as complete - did you have any specific issues with/reccomendations for this area of the pacakge?

@mcguinlu Sorry, my bad. Both devtools::check() and devtools::test() ran without errors. Checked that item in the list.

maurolepore commented 4 years ago

@njahn82, I hope you are well. Could you please update us about your review?

njahn82 commented 4 years ago

Sorry, I didn't meet my review deadline. Will submit it by Wednesday. Thanks for your patience!

On Thu, 2 Jul 2020 at 03:28, Mauro Lepore notifications@github.com wrote:

@njahn82 https://github.com/njahn82, I hope you are well. Could you please update us about your review?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/380#issuecomment-652726542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM7YRTEKAQTRMHY73MYBBDRZPPEVANCNFSM4NMF3HMQ .

njahn82 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 5 hours


Review Comments

This is very timely package that not just reflect the increasing popularity of open access preprints in Health Sciences, but also issues around finding and searching them. Although a growing suite of scholarly search engines make medRxiv preprints available, there seems to be no standard way to retrieve data from medRxiv thoroughly and systematically. Also finding full-texts is challenging, because medRxiv preprints are not made available via PubMed Central. Similiary, Crossref metadata, medRxiv's DOI registration agency, lack links to pdf full-texts.

Before I share my code review, I want to disclose that I neither have an academic background in Health Sciences nor have I been involved in systematic reviews as a librarian. I will therefore focus on more formal aspects of the package and its design.

Overall Design

The package contains functions to retrieve metadata from medRxiv, applying complex search strategies on a metadata snapshot, and download pdf full-texts. However, the source code repository contains a considerable amount of other functionality as well, which is outside of the R directory and excluded from the package build in .Rbuildignore:

There's also a link to (daily updated) data in an external GitHub repo, https://github.com/mcguinlu/medrxivr-data/, which is used in an exported R function.

My main concern with this approach is that dependencies, which are not part of the package, are loaded, and in one case installed. The code outside of the R folder also lacks documentation using roxygen tags and tests, and there's some redundancy. I feel that R code not part of the {medrxivr} package build either needs to be factored out should be moved into the R/ directory.

In the following, I will focus on the functionality, which is part of the package build.

README

Documentation

Vignette

Functionality

Here's the checking using {polite}

polite::bow("https://www.medrxiv.org/archive", force = TRUE)
#> <polite session> https://www.medrxiv.org/archive
#>     User-agent: polite R package - https://github.com/dmi3kno/polite
#>     robots.txt: 68 rules are defined for 1 bots
#>    Crawl delay: 7 sec
#>   The path is scrapable for this user-agent

Created on 2020-07-08 by the reprex package (v0.3.0)

Here's a reprex using the vignette example, which took less than 2 second.

library(tidyverse)
library(europepmc)
ep_q <-
  c('PUBLISHER:"medRxiv" AND (mendelian* AND (randomisation OR randomization))')
epmc_l <- europepmc::epmc_search(ep_q, "raw", limit = 10000)
#> 91 records found, returning 91

my_df <-
  purrr::map_dfr(epmc_l, `[`, c("doi", "title", "abstractText"))
my_df %>%
  filter_at(vars(abstractText, title), any_vars(
    grepl(
      "[Mm]endelian(\\s)([[:graph:]]+\\s){0,4}randomi([[:alpha:]])ation",
      .
    )))
#> # A tibble: 81 x 3
#>    doi           title                         abstractText                     
#>    <chr>         <chr>                         <chr>                            
#>  1 10.1101/2020… Cardiometabolic traits, seps… Objectives: To investigate wheth…
#>  2 10.1101/2020… The relationship between gly… Aims: To investigate the relatio…
#>  3 10.1101/2020… Modifiable lifestyle factors… Aims: Assessing whether modifiab…
#>  4 10.1101/2020… Influence of blood pressure … Objectives: To determine whether…
#>  5 10.1101/2020… Increased adiposity is prote… Background Breast and prostate c…
#>  6 10.1101/2020… Examining the association be… Background: We examined associat…
#>  7 10.1101/2020… Investigating the potential … Aim: Use Mendelian randomisation…
#>  8 10.1101/2020… Unhealthy Behaviours and Par… Objective: Tobacco smoking, alco…
#>  9 10.1101/2020… Exploring the causal effect … BACKGROUND: Hearing loss has bee…
#> 10 10.1101/2020… Genetically informed precisi… Impaired lung function is associ…
#> # … with 71 more rows

Created on 2020-07-08 by the reprex package (v0.3.0)

(Disclaimer: I maintain the {europepmc} package and I am curios to learn more about potential shortcomings using Europe PMC instead of a primary literature source. Because I also find it sometimes not very helpful when reviewers point to their own work, I do not expect you to consider this :-))

Testing

I think that's it from me! Thank you for making Health Science preprints more accessible and better discoverable! Happy to help further with the process!

maurolepore commented 4 years ago

Thanks @njahn82 for your review!

@mcguinlu, please aim to address the comments of both @tts and @njahn82 within the next 2 weeks.

njahn82 commented 4 years ago

Hi @mcguinlu. Sorry, while still playing with your app, I just realised that I was wrong and nothing is installed from the Shiny app. Please ignore this bit from the review.

mcguinlu commented 4 years ago

@njahn82 Thanks a million for your detailed review! At a quick skim, everything you flag/recommend is fixable/implementable, and will definitely help to improve the functionality. I'm also looking forward to examing europepmc further - to be honest, I was not aware that medRxiv preprints were indexed in Europe PMC.

@maurolepore Just confirming that I have seen this, and so am aiming to address the comments by 23rd July (at the latest).

mcguinlu commented 4 years ago

Hi all (esp @maurolepore)

A brief message to let you know that I have most of the changes requested made, but due to external circumstances, I haven't yet finished off the small number of outstanding items. I'm now aiming to have it ready for re-review by Thursday week (6th August) at the very latest.

Very sorry for the delay, and hope this is okay!

maurolepore commented 4 years ago

That's okay. Thanks for letting me know.

mcguinlu commented 4 years ago

Overview

To start, thanks once again to everyone for your constructive comments! I've gone through all the posts above and have tried to extract all feedback here so that I can address it systematically. - please let me know if I have missed anything. Comments are divided by commenter and topic for ease of reference, and are presented in bold text, with the response immediately below.

Reviewer Acknowledgement

I have also acknowledged both @tts and @njahn82 as reviewers in the Description, as I feel like your comments have added a great deal to the functionality/user-friendliness of the package. Please check your details just to make sure I have them correct!


Some key points about new functionality:

Editor (@maurolepore)

Tasks

These have all been completed:

Discussion points


Reviewer 1 (@tts)

General comments

API

Snapshot

start_time <- Sys.time()
mx_results <- mx_search(data = mx_snapshot(), query = "molecular")
end_time <- Sys.time()
(end_time - start_time)
Using medRxiv snapshot - 2020-08-05 06:02
Found 289 record(s) matching your search.
Time difference of 1.341 secs

Vignette/README

Download

Shiny app

mx_search()
Error in mx_search() : 
  Please provide medRxiv data to search, accessed from either from either the mx_api_content(), or mx_snapshot() function.

Reviewer 2 (@njahn82)

General comments

README

Vignette

Functionality

One last reason (and one of the inital motivating factors fordeveloping the package) was that medrxivr allows you to search for/download multiple versions of the same preprint (mx_search(data, query, deduplicate = FALSE)), allowing for comparison between them. As far as I can see, this functionality is not implemented in europepmc (but please correct me if I am wrong!).

Code for comparison

```{r} # Load packages ----------------------------------------------------------- library(tidyverse) library(europepmc) library(medrxivr) # Compare total records returned ------------------------------------------ # Using europepmc gives 8994 ep_q <- c('PUBLISHER:"medRxiv"') epmc_l <- europepmc::epmc_search(ep_q, "raw", limit = 10000) pmc_all <-purrr::map_dfr(epmc_l, `[`, c("doi", "title", "abstractText")) # Using medrxivr gives 9146 mx_all <- mx_snapshot() %>% mx_search(query = "*") # Compare searches -------------------------------------------------------- pattern <- "[Mm]endelian(\\s)([[:graph:]]+\\s){0,4}[Rr]andomi([[:alpha:]])ation" # Using europepmc gives 107 records pmc_results <- pmc_all %>% filter_at(vars(abstractText, title), any_vars( grepl( pattern, . ))) # Using medrxivr gives 110 records mx_results <- mx_snapshot() %>% mx_search(query = pattern, fields = c("title","abstract","doi")) # Find records found by medrxivr but not europepmc ----------------------- '%notin%' <- Negate('%in%') # Gives 3 records discrepancy_df <- mx_results %>% filter(doi %notin% pmc_results$doi) ```

Testing


1 Harry Potter and the Sorcerer's Stone (movie), 00:28:35.

maurolepore commented 4 years ago

@mcguinlu, thanks for your hard work. See my comments below.

@tts and @njahn82 please consider @mcguinlu 's changes and respond with either your approval or a suggestion for improvement.

--

Following up previous comments (editor checks)

spelling::spell_check_package()
#>   WORD      FOUND IN
#> bioRxiv   mx_api_content.Rd:5,35
#> Harbour   mx_api_content.Rd:5,35
suppressWarnings(goodpractice::gp())
...
─────────────────────────────────────────────────────────────────
#>
#> It is good practice to
#>
#>   ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R
#>     users and developers are used it and it is easier to read your code
#>     for them if you use '<-'.
#>
#>     tests/testthat/test-helpers.R:4:7
#>     tests/testthat/test-helpers.R:20:7
#>
#>   ✖ avoid long code lines, it is bad for readability. Also, many people
#>     prefer editor windows that are about 80 characters wide. Try make
#>     your lines shorter than 80 characters
#>
#>     R/mx_crosscheck.R:24:1
#>     R/mx_download.R:7:1
#>     R/mx_search.R:37:1
#>     tests/testthat/test-export.R:9:1
#>
#> ────────────────────────────────────────────────────────────────────────────────
# > covr::package_coverage()
# medrxivr Coverage: 91.04%
# R/mx_crosscheck.R: 69.23%
# R/helpers.R: 83.33%
# R/mx_search.R: 90.27%
# R/mx_download.R: 93.85%
# R/mx_api.R: 100.00%
# R/mx_export.R: 100.00%
# R/mx_info.R: 100.00%
# R/mx_snapshot.R: 100.00%
usethis::use_tidy_style()
#> ✔ Setting active project to '/home/mauro/git/medrxivr'
#> Styling  17  files:
#>  R/helpers.R                      ℹ
#>  R/medrxivr.R                     ✔
#>  R/mx_api.R                       ✔
#>  R/mx_crosscheck.R                ✔
#>  R/mx_download.R                  ✔
#>  R/mx_export.R                    ✔
#>  R/mx_info.R                      ✔
#>  R/mx_search.R                    ✔
#>  R/mx_snapshot.R                  ✔
#>  tests/testthat.R                 ✔
#>  tests/testthat/test-api.R        ✔
#>  tests/testthat/test-crosscheck.R ✔
#>  tests/testthat/test-download.R   ✔
#>  tests/testthat/test-export.R     ✔
#>  tests/testthat/test-helpers.R    ℹ
#>  tests/testthat/test-info.R       ✔
#>  tests/testthat/test-search.R     ✔
#> ────────────────────────────────────────
#> Status   Count   Legend
#> ✔    15  File unchanged.
#> ℹ    2   File changed.
#> ✖    0   Styling threw an error.
#> ────────────────────────────────────────
#> Please review the changes carefully!
#>
#> ✔ Styled project according to the tidyverse style guide

New comments (suggestions)

tts commented 4 years ago

Hi @mcguinlu and thanks for your efforts! Below, I'll use your headings, and give my remarks to each of them.

General comments. In addition to adding your suggested diagram, I have tried to make the language used across the documentation more consistent, but please do point out anything that could be clearer!

Great, much better now. I cannot find anything more to complain :)

API. having to hit "Esc"/click "Stop" multiple times in order to actually get mx_api_content() to stop. Is this right?

Yes, that's what I mean, and I understand your explanation. Several clicks do terminate the download, so I find this sufficient.

Snapshot. Trying mx_search(query="molecular")

Time difference of 6.319079 secs for me, which is not bad.

Vignette/README

Ok now.

Download

Ok.

Shiny app mx_search()

Reproducible code works now fine, and a missing data | query argument is caught right away. Good!

One new comment

mx_info(commit = "master") Error in mx_info(commit = "master") : could not find function "mx_info"

Except this comment, I give my approval.

mcguinlu commented 4 years ago

Thanks for the further feedback both (and Happy Friday)! Please find my responses to your comments below:

Editor (@maurolepore)

medrxivr Coverage: 100.00%
R/helpers.R: 100.00%
R/mx_api.R: 100.00%
R/mx_crosscheck.R: 100.00%
R/mx_download.R: 100.00%
R/mx_export.R: 100.00%
R/mx_info.R: 100.00%
R/mx_search.R: 100.00%
R/mx_snapshot.R: 100.00%

Reviewer 1 (@tts)

Glad to hear things are a bit clearer now!

The reason mx_info() is not found is that it is an internal function (medrxivr:::mx_info()) and should not have been available in the function list on the pkgdown website. I had marked several internal function with the "Internal" keyword, which should have hidden them, but it seems that pkgdown is case sensitive and the correct keyword is "internal". This has been corrected now and the internal functions now longer appear in the website's function list.

Finally, just wanted to confirm that your details in the DESCRIPTION are correct?

tts commented 4 years ago

@mcguinlu Yes, my details in DESCRIPTION are correct.

maurolepore commented 4 years ago

@mcguinlu, just to let you know that I believe @njahn82 will respond to your changes next week.

mcguinlu commented 4 years ago

Great - thanks for letting me know!

njahn82 commented 4 years ago

Great job @mcguinlu, and thank you for the careful and thorough consideration of my review. I feel, it is clearer now what the package does and how it relates to the Shiny app and the backup/dump mechanism.

Thank you also for cross-checking with Europe PMC and demonstrating the added value of the medrxivr package.

Although all my suggestions have been addressed, I have some final suggestions

mcguinlu commented 4 years ago

Thanks @njahn82. Just to note as well that I recently moved the snapshot functionality from relying on my local Task Scheduler to working from GitHub Actions, so it should now be a lot more robust (in the past, if my local PC experienced network issues, the snapshot would not be taken).

In response to your comments:

*I wonder if the returned data frames from the mxapi() family could be also represented as tibbles?**

The package does a good job in parsing and cleaning preprint metadata. Unfortunately, I cannot find documentation or an example showcasing what is actually returned. Can you provide one reproducible example in the README and/or extend the documentation in the function docs?

In the function docs of mx_export(), it says Dataframe returned by mx_search(), but I realised that also data obtained from the mxapi family can be exported as bib file using mx_export().

@maurolepore, I have checked that these changes don't throw any new errors and that goodpractice doesn't recommend any changes. I've also re-run styler/spelling functions and commited any modications.

Hoping we are nearly there!

maurolepore commented 4 years ago

Thanks @njahn82 and @mcguinlu,

@njahn82, once again, please consider @mcguinlu 's changes and respond with either your approval or further suggestions for improvement.

njahn82 commented 4 years ago

Thank you again @mcguinlu for your careful consideration of my review! All my suggestions have been addressed.

maurolepore commented 4 years ago

Approved! Thanks @mcguinlu for submitting and @tts and @njahn82 for your reviews! :smile:

To-dos:

From https://github.com/ropensci/software-review/issues/380#issue-625738088 I see you wish to automatically submit to the Journal of Open Source Software? If so:

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions.

We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

annakrystalli commented 4 years ago

Hello @mcguinlu! I've just invited you to the @ropensci/medrxivr team! You should now be allowed to transfer the repo. Once you do, just ping me here and I'll transfer full admin rights back to you 🙂👍

mcguinlu commented 4 years ago

Hi @annakrystalli have transferred across now. @maurolepore thanks for the checklist - I will work through it over the coming day. And finally, just flagging to @stefaniebutland that I would be interested in producing a blog post for this package!

Thanks again to @tts and @njahn82 for reviewing, and @maurolepore for herding us all through the process!

annakrystalli commented 4 years ago

Thanks @mcguinlu ! Full admin rights now returned 👍

danielskatz commented 4 years ago

Has this review been completed? (I'm asking as the editor of the corresponding JOSS submission)

mcguinlu commented 4 years ago

To-dos:

  • [x] Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. Soon you will be invited to a team that should allow you to do so. You'll be made admin once you do.
  • [x] Fix all links to the GitHub repo to point to the repo under the ropensci organization.
  • [x] If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,

    • deactivate the automatic deployment you might have set up
    • remove styling tweaks from your pkgdown config but keep that config file
    • replace the whole current pkgdown website with a redirecting page
    • replace your package docs URL with https://docs.ropensci.org/package_name
    • In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
  • [x] Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname).
  • [x] We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.

From #380 (comment) I see you wish to automatically submit to the Journal of Open Source Software? If so:

  • [x] Activate Zenodo watching the repo
  • [x] Tag and create a release so as to create a Zenodo version and DOI
  • [x] Submit to JOSS at https://joss.theoj.org/papers/new, using the rOpenSci GitHub repo URL. When a JOSS "PRE REVIEW" issue is generated for your paper, add the comment: This package has been reviewed by rOpenSci: https://LINK.TO/THE/REVIEW/ISSUE, @ropensci/editors

Okay, I've completed all the steps now @maurolepore! Re: the JOSS review, please see @danielskatz's comment above.

The one thing I wasn't clear on was how to replace the old pkgdown website with a redirecting page - seeing as the repo has been transferred across, the old pkgdown website on GitHub Pages (https://mcguinlu.github.io/medrxivr/index.html) no longer exists (I think?!), so I wasn't clear on how to set the redirect.

maurolepore commented 4 years ago

@danielskatz, thanks for checking. Yes, as the guest editor of this submission, I confirm this review has been completed.

maurolepore commented 4 years ago

@mcguinlu,

RE:

The one thing I wasn't clear on was how to replace the old pkgdown website with a redirecting page - seeing as the repo has been transferred across, the old pkgdown website on GitHub Pages (https://mcguinlu.github.io/medrxivr/index.html) no longer exists (I think?!), so I wasn't clear on how to set the redirect.

I'm sorry this isn't clear for you or me. But as you say, the working website seems correct. I see no reason to worry.

Here are a few more comments from section 8.1.4 of https://devguide.ropensci.org/:

Please check these boxes to confirm you've done the following last steps:

--

Ping me when this is done and I'll then close this issue.

Thanks!

maurolepore commented 4 years ago

@mcguinlu, I see you already mentioned Stephanie Butland above. To comply with https://devguide.ropensci.org/editorguide.html#after-review, I also mention @ropensci/blog-editors for follow-up about your willingness to write a blog post or tech note.

Finally, please see https://devguide.ropensci.org/editorguide.html#package-promotion

mcguinlu commented 4 years ago

So in response to the last few bits:

[x] Add a CodeMeta file by running codemetar::write_codemeta() (codemetar GitHub repo)

CodeMeta file added (see here)

[x] Change any needed links, such those for CI badges

All CI badges updated to point to the ropensci endpoints (e.g see here)

[x] Re-activate CI services

Done, and have triggered a build under the new set-up to ensure everything works, which was successful.

[ ] If authors maintain a gitbook that is at least partly about their package, contact an rOpenSci staff member so they might contact the authors about transfer to the ropensci-books GitHub organisation.

Not applicable to me.

[x] Add a “peer-reviewed” topic to the repo (it seems I'm the one supposed to do this but I apparently lack the privileges to access the "topics" settings -- see if you can or let me know).

Done!

Thanks also for the additional materials re: CRAN submission (I do intend to submit to CRAN in the near future) and promotion, and for looping in@ropensci/blog-editors.

And I think that's us!

maurolepore commented 4 years ago

@mcguinlu , thanks and congratulations! To the best of my knowledge, this completes the review process so I'll close now.

--

You may already know this. To prepare packages for CRAN, usethis::use_release_issue() is useful. And here are some aspects of the workflow that might help (including a link to some tweaks). Feel free to reach out with questions.

Are you in rOpenSci's Slack workspace? If not, I recommend you find someone who can add you. I have found friendly advice there that I wouldn't find anywhere else.

stefaniebutland commented 4 years ago

Hello @mcguinlu. We'd love to have a post about medrxivr.

Our Blog Guide has most of the information you should need, with both content and technical advice. For readers, it would be helpful to highlight how this package relates to similar ones and the specific niche that medrxivr fills. Once that's clear early in the post, your readers will give their attention.

Let me know when you'd like to submit a draft and I can suggest a publication date.

stefaniebutland commented 4 years ago

@mcguinlu Also let me know if you'd like a new invitation to rOpenSci Slack. We could move this discussion there for example.

mcguinlu commented 4 years ago

@stefaniebutland a new invite would be great! I thought I had activated the first one correctly but apparently not (I am still getting to grips with Slack) 🤦‍♂️ and happy to continue chatting about this there.