Presubmission inquiry: medrxivr

Submitting Author: Luke McGuinness (@mcguinlu) Repository: https://github.com/mcguinlu/medrxivr

Paste the full DESCRIPTION file inside a code block below:

Package: medrxivr
Title: Access MedRxiv Data In R
Version: 0.0.1.900
Authors@R: c(
    person("Luke", "McGuinness",
           role = c("aut", "cre"),
           email = "luke.mcguinness@bristol.ac.uk",
           comment = c(ORCID = "0000-0001-8730-9761")),
    person("Lena", "Schmidt",
           role = "aut",
           comment = c(ORCID = "0000-0003-0709-8226")))
Description: medRxiv (https://www.medrxiv.org/) is a free online archive and 
    distribution server for complete but unpublished manuscripts (preprints) in
    the medical, clinical, and related health sciences. medrxivr provides 
    programmatic access to a snapshot of the preprints contained on medRxiv,
    which is updated daily. Users can search for relevant records using regular
    expressions and Boolean logic, and can easily download the full-text PDFs
    of preprints matching their search criteria. 
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Language: en-US
URL: https://github.com/mcguinlu/medrxivr
BugReports: https://github.com/mcguinlu/medrxivr/issues
Imports: 
    rvest,
    methods,
    dplyr,
    magrittr,
    stringr,
    xml2
Suggests: 
    testthat (>= 2.1.0),
    knitr,
    rmarkdown,
    covr,
    kableExtra
VignetteBuilder: 
    knitr, 
    rmarkdown    
RoxygenNote: 7.0.1

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below.:
- [x] data retrieval
- [ ] data extraction
- [ ] database access
- [ ] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [x] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how the and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of: medrxivr allows users to programmatically access and manipulate a snapshot of medRxiv, a preprint respository for papers in medical, clinical, and related health sciences. The snapshot is automatically updated each morning (webscraping/cleaning script can be found here).
Who is the target audience and what are scientific applications of this package? The primary target of this package is systematic reviewers (i.e. me!), who frequently wish both to use more complicated queries (e.g. regular expresssions/Boolean combinations) when searching medRxiv than the official site currrently allows for, and who also wish to be easily able to download the full text PDFs of records matching their search. medrxivr helps with both of these challenges. However, anyone who wishes to find and retrieve relevant medRxiv records in R, for example to explore the distribution of preprints by subject area, will find the package useful.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? As far as I am aware, no other package allows users to access medRxiv data in R.
Any other questions or issues we should be aware of?: medrxivr is not yet ready for submission - there is some additional functionality that I would like to incorporate into the package and I need to fix one or two issues with the webscraping/data cleaning script. However, I did have two questions it would be great to get feedback on:
- Q1: medrxivr provides access to a static snapshot of the medRxiv data, which I also maintain, in addition to functions to search for and download relevant full text records. The reason this approach was taken, rather than providing tools for users to search the site dynamically, is that the robots.txt on medRxiv forbids the scraping of the search/ path. In this case, providing functions that will query this path multiple times seemed like a bad idea. However, I want to check that my setup - a package providing access to a snapshot of the data, both of which are maintained by the same person - is an acceptable setup for an rOpenSci package?
- Q2: In terms of intergration with other rOpenSci packages, it seems like medrxivr could be a useful addition to fulltext. Is it worth contacting the maintainers of this package about intergration before or after submitting medrxivr to the peer review process?

[Tagging my co-author @L-ENA for reference]

Hello @mcguinlu and many thanks for your presubmission enquiry!

In general, we would consider the package in scope. However, there are couple of questions we'd like a bit more info on:

Where are the data snapshot stored?
Are the medRxiv website maintainers aware of the package's use of their data? Especially given the data snapshot uses web scraping and seems undocumented. We would need confirmation that they are happy for your package to use their data as the package does to accept it.

Here are also some initial comments too:

1. Given the package accesses a static snapshot of the data, the README is a bit misleading since it doesn't mention that at all.
We recommend you make contact with fulltext authors to assess potential compatibility

I look forward to some clarification in response to our queries.

Hi @annakrystalli, thanks for getting back to me, and for your inital comments!

Great to hear that the pacakge is provisionally in scope - I've provided responses to the four queries you raised below:

The snapshot is currently stored as a CSV file in the same repository as the webscraping script (see here). I appreciate this is a basic solution, but this is one of the areas I was hoping to solict reviewer feedback on, as I have never had to host a dataset before and am unsure what the best solution is.
As far as I know, the medRxiv maintainers are not currently aware of this package, though I have now sent an email querying whether they are happy for me to provide access to the data in this way. I really hope they are, as the original arXiv repository takes a strong view that making the data available programmatically is a cornerstone of open access publishing. I will let you know when I hear back.
I completely agree with this comment - the original README introduction was written before I had hammered out the functionality. I've gone through the package (README, documentation, and function messages) to ensure that it is clear medrixr provides access to a static snapshot rather than dynamic access to the repository.
I have opened an issue on the fulltext repository to begin a discussion about the possibility of incorporating medrixr as an additional data source.

Thanks again for your inital feedback, and do let me know if I can provide any further clarifications!

Thanks for the clarifications @mcguinlu !

As far as I know, the medRxiv maintainers are not currently aware of this package, though I have now sent an email querying whether they are happy for me to provide access to the data in this way. I really hope they do, as the original arXiv repository takes a strong view that making the data available programmatically is a cornerstone of open access publishing. I will let you know when I hear back.

Great and yes, I imagine they will be fine with it. They even be able to provide better access to the data rather than crawling their site.

The snapshot is currently stored as a CSV file in the same repository as the webscraping script (see here). I appreciate this is a basic solution, but this is one of the areas I was hoping to solict reviewer feedback on, as I have never had to host a dataset before and am unsure what the best solution is.

Depending on the size of the files, I wonder if using pins might be a good solution for this problem. You could also consider getting feedback by posting on rOpenSci Discuss. It's quite an interesting topic.

I completely agree with this comment - the original README introduction was written before I had hammered out the functionality. I've gone through the package (README, documentation, and function messages) to ensure that it is clear medrixr provides access to a static snapshot rather than dynamic access to the repository. 👍

I have opened an issue on the fulltext repository to begin a discussion about the possibility of incorporating medrixr as an additional data source.

I saw that. Nice work! Will look out for any feedback.

We're pretty much satisfied with response to our comments at this stage, so let's see what the feedback is from your enquiries.

Hey @mcguinlu,

Was just wondering whether there were any updates on your contact with the medRxiv maintainers? I'm going to then close the issue after that.

Also, regarding your data caching challenge, you might like to have a look at how the developers of ramlegacy approached it.

⚠️⚠️⚠️⚠️⚠️

In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.

Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board

⚠️⚠️⚠️⚠️⚠️

ropensci / software-review

Presubmission inquiry: medrxivr #369

Scope