ropensci / software-review

rOpenSci Software Peer Review.
292 stars 104 forks source link

Presubmission inquiry: medrxivr #369

Closed mcguinlu closed 4 years ago

mcguinlu commented 4 years ago

Submitting Author: Luke McGuinness (@mcguinlu) Repository: https://github.com/mcguinlu/medrxivr


Package: medrxivr
Title: Access MedRxiv Data In R
Version: 0.0.1.900
Authors@R: c(
    person("Luke", "McGuinness",
           role = c("aut", "cre"),
           email = "luke.mcguinness@bristol.ac.uk",
           comment = c(ORCID = "0000-0001-8730-9761")),
    person("Lena", "Schmidt",
           role = "aut",
           comment = c(ORCID = "0000-0003-0709-8226")))
Description: medRxiv (https://www.medrxiv.org/) is a free online archive and 
    distribution server for complete but unpublished manuscripts (preprints) in
    the medical, clinical, and related health sciences. medrxivr provides 
    programmatic access to a snapshot of the preprints contained on medRxiv,
    which is updated daily. Users can search for relevant records using regular
    expressions and Boolean logic, and can easily download the full-text PDFs
    of preprints matching their search criteria. 
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Language: en-US
URL: https://github.com/mcguinlu/medrxivr
BugReports: https://github.com/mcguinlu/medrxivr/issues
Imports: 
    rvest,
    methods,
    dplyr,
    magrittr,
    stringr,
    xml2
Suggests: 
    testthat (>= 2.1.0),
    knitr,
    rmarkdown,
    covr,
    kableExtra
VignetteBuilder: 
    knitr, 
    rmarkdown    
RoxygenNote: 7.0.1

Scope

[Tagging my co-author @L-ENA for reference]

annakrystalli commented 4 years ago

Hello @mcguinlu and many thanks for your presubmission enquiry!

In general, we would consider the package in scope. However, there are couple of questions we'd like a bit more info on:

  1. Where are the data snapshot stored?
  2. Are the medRxiv website maintainers aware of the package's use of their data? Especially given the data snapshot uses web scraping and seems undocumented. We would need confirmation that they are happy for your package to use their data as the package does to accept it.

Here are also some initial comments too:

    1. Given the package accesses a static snapshot of the data, the README is a bit misleading since it doesn't mention that at all.
  1. We recommend you make contact with fulltext authors to assess potential compatibility

I look forward to some clarification in response to our queries.

mcguinlu commented 4 years ago

Hi @annakrystalli, thanks for getting back to me, and for your inital comments!

Great to hear that the pacakge is provisionally in scope - I've provided responses to the four queries you raised below:

  1. The snapshot is currently stored as a CSV file in the same repository as the webscraping script (see here). I appreciate this is a basic solution, but this is one of the areas I was hoping to solict reviewer feedback on, as I have never had to host a dataset before and am unsure what the best solution is.

  2. As far as I know, the medRxiv maintainers are not currently aware of this package, though I have now sent an email querying whether they are happy for me to provide access to the data in this way. I really hope they are, as the original arXiv repository takes a strong view that making the data available programmatically is a cornerstone of open access publishing. I will let you know when I hear back.

  3. I completely agree with this comment - the original README introduction was written before I had hammered out the functionality. I've gone through the package (README, documentation, and function messages) to ensure that it is clear medrixr provides access to a static snapshot rather than dynamic access to the repository.

  4. I have opened an issue on the fulltext repository to begin a discussion about the possibility of incorporating medrixr as an additional data source.

Thanks again for your inital feedback, and do let me know if I can provide any further clarifications!

annakrystalli commented 4 years ago

Thanks for the clarifications @mcguinlu !

As far as I know, the medRxiv maintainers are not currently aware of this package, though I have now sent an email querying whether they are happy for me to provide access to the data in this way. I really hope they do, as the original arXiv repository takes a strong view that making the data available programmatically is a cornerstone of open access publishing. I will let you know when I hear back.

Great and yes, I imagine they will be fine with it. They even be able to provide better access to the data rather than crawling their site.

The snapshot is currently stored as a CSV file in the same repository as the webscraping script (see here). I appreciate this is a basic solution, but this is one of the areas I was hoping to solict reviewer feedback on, as I have never had to host a dataset before and am unsure what the best solution is.

Depending on the size of the files, I wonder if using pins might be a good solution for this problem. You could also consider getting feedback by posting on rOpenSci Discuss. It's quite an interesting topic.

I completely agree with this comment - the original README introduction was written before I had hammered out the functionality. I've gone through the package (README, documentation, and function messages) to ensure that it is clear medrixr provides access to a static snapshot rather than dynamic access to the repository. 👍

I have opened an issue on the fulltext repository to begin a discussion about the possibility of incorporating medrixr as an additional data source.

I saw that. Nice work! Will look out for any feedback.

We're pretty much satisfied with response to our comments at this stage, so let's see what the feedback is from your enquiries.

annakrystalli commented 4 years ago

Hey @mcguinlu,

Was just wondering whether there were any updates on your contact with the medRxiv maintainers? I'm going to then close the issue after that.

Also, regarding your data caching challenge, you might like to have a look at how the developers of ramlegacy approached it.

noamross commented 4 years ago

⚠️⚠️⚠️⚠️⚠️

In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.

Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board

⚠️⚠️⚠️⚠️⚠️