Closed mcguinlu closed 4 years ago
Hello @mcguinlu and many thanks for your presubmission enquiry!
In general, we would consider the package in scope. However, there are couple of questions we'd like a bit more info on:
Here are also some initial comments too:
fulltext
authors to assess potential compatibilityI look forward to some clarification in response to our queries.
Hi @annakrystalli, thanks for getting back to me, and for your inital comments!
Great to hear that the pacakge is provisionally in scope - I've provided responses to the four queries you raised below:
The snapshot is currently stored as a CSV file in the same repository as the webscraping script (see here). I appreciate this is a basic solution, but this is one of the areas I was hoping to solict reviewer feedback on, as I have never had to host a dataset before and am unsure what the best solution is.
As far as I know, the medRxiv maintainers are not currently aware of this package, though I have now sent an email querying whether they are happy for me to provide access to the data in this way. I really hope they are, as the original arXiv repository takes a strong view that making the data available programmatically is a cornerstone of open access publishing. I will let you know when I hear back.
I completely agree with this comment - the original README introduction was written before I had hammered out the functionality. I've gone through the package (README, documentation, and function messages) to ensure that it is clear medrixr
provides access to a static snapshot rather than dynamic access to the repository.
I have opened an issue on the fulltext
repository to begin a discussion about the possibility of incorporating medrixr
as an additional data source.
Thanks again for your inital feedback, and do let me know if I can provide any further clarifications!
Thanks for the clarifications @mcguinlu !
As far as I know, the medRxiv maintainers are not currently aware of this package, though I have now sent an email querying whether they are happy for me to provide access to the data in this way. I really hope they do, as the original arXiv repository takes a strong view that making the data available programmatically is a cornerstone of open access publishing. I will let you know when I hear back.
Great and yes, I imagine they will be fine with it. They even be able to provide better access to the data rather than crawling their site.
The snapshot is currently stored as a CSV file in the same repository as the webscraping script (see here). I appreciate this is a basic solution, but this is one of the areas I was hoping to solict reviewer feedback on, as I have never had to host a dataset before and am unsure what the best solution is.
Depending on the size of the files, I wonder if using pins might be a good solution for this problem. You could also consider getting feedback by posting on rOpenSci Discuss. It's quite an interesting topic.
I completely agree with this comment - the original README introduction was written before I had hammered out the functionality. I've gone through the package (README, documentation, and function messages) to ensure that it is clear medrixr provides access to a static snapshot rather than dynamic access to the repository. 👍
I have opened an issue on the fulltext repository to begin a discussion about the possibility of incorporating medrixr as an additional data source.
I saw that. Nice work! Will look out for any feedback.
We're pretty much satisfied with response to our comments at this stage, so let's see what the feedback is from your enquiries.
Hey @mcguinlu,
Was just wondering whether there were any updates on your contact with the medRxiv maintainers? I'm going to then close the issue after that.
Also, regarding your data caching challenge, you might like to have a look at how the developers of ramlegacy
approached it.
⚠️⚠️⚠️⚠️⚠️
In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.
In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.
Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.
The rOpenSci Editorial Board
⚠️⚠️⚠️⚠️⚠️
Submitting Author: Luke McGuinness (@mcguinlu) Repository: https://github.com/mcguinlu/medrxivr
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below.:
Explain how the and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:
medrxivr
allows users to programmatically access and manipulate a snapshot of medRxiv, a preprint respository for papers in medical, clinical, and related health sciences. The snapshot is automatically updated each morning (webscraping/cleaning script can be found here).Who is the target audience and what are scientific applications of this package? The primary target of this package is systematic reviewers (i.e. me!), who frequently wish both to use more complicated queries (e.g. regular expresssions/Boolean combinations) when searching medRxiv than the official site currrently allows for, and who also wish to be easily able to download the full text PDFs of records matching their search.
medrxivr
helps with both of these challenges. However, anyone who wishes to find and retrieve relevant medRxiv records in R, for example to explore the distribution of preprints by subject area, will find the package useful.Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? As far as I am aware, no other package allows users to access medRxiv data in R.
Any other questions or issues we should be aware of?:
medrxivr
is not yet ready for submission - there is some additional functionality that I would like to incorporate into the package and I need to fix one or two issues with the webscraping/data cleaning script. However, I did have two questions it would be great to get feedback on:medrxivr
provides access to a static snapshot of the medRxiv data, which I also maintain, in addition to functions to search for and download relevant full text records. The reason this approach was taken, rather than providing tools for users to search the site dynamically, is that therobots.txt
on medRxiv forbids the scraping of thesearch/
path. In this case, providing functions that will query this path multiple times seemed like a bad idea. However, I want to check that my setup - a package providing access to a snapshot of the data, both of which are maintained by the same person - is an acceptable setup for an rOpenSci package?medrxivr
could be a useful addition tofulltext
. Is it worth contacting the maintainers of this package about intergration before or after submittingmedrxivr
to the peer review process?[Tagging my co-author @L-ENA for reference]