Closed jglev closed 7 years ago
@dhimmel, to confirm, is it ok to include the state-of-oa-dois
dataset here? Do you have rights to release it CC0?
Thank you for the feedback (and in advance for whatever additional comments are forthcoming)!
As of 8d32f2f3dd, I've switched to using f-strings, which I hadn't heard about (thank you for that, as well). I'll be making further commits and comments above to track my progress tomorrow.
@dhimmel, to confirm, is it ok to include the state-of-oa-dois dataset here? Do you have rights to release it CC0?
Certainly, the source Zenodo release is CC0. Now I don't think we should necessarily track state-of-oa-dois.tsv.xz
. Instead, we can read it from its versioned URL:
https://github.com/greenelab/scihub/raw/4172526ac7433357b31790578ad6f59948b6db26/data/state-of-oa-dois.tsv.xz
In the above URL, you can switch between raw
and blob
(for webview)
As @dhimmel suggested here, I stopped tracking the database in ab6541ba50.
In f1d2d67, I added a new config. variable to slice the list of DOIs -- previously, it was hard-coded to just take the first 10 (as a test set). So with a variable (which when set to None
, will download everything), the PR can now be used without having to go back and change that slice in the code itself.
@publicus cool. Let me know when its ready for my review... i.e. no more planned commits.
I've gone through the PR discussion above, and think that all of your comments to date have been answered at this point. Does that look right to you, too?
Thank you for all of your comments and review! I'll get this running today!
Thank you for all of your comments and review! I'll get this running today!
Make sure you start your subsequent work on top of the current greenelab:master
. You need to be building off of the commit we just created by the squash merge.
Will do!
This is a work-in-progress PR. The code is all functioning, but, on @dhimmel's suggestion, I'd appreciate ongoing feedback, given the Lab's expectations for collaborators.
As soon as I get an in-progress thumbs-up on this, I'll start the UPenn downloader running on the dataset's DOIs.
What this PR currently includes
conda env export
, which doesn't specify them except at the top.r-base
in this list. It isn't used yet, but is used in the Bayesian script I've written (and will add in a future commit or PR, once the downloader is underway).tsv
dataset into the SQLite database.rerun_dois_that_are_already_in_database
in the configuration file.state-of-oa-dois
dataset from @dhimmel in this comment.An example SQLite database resulting from running 10 of the
closed
-type DOIs from the dataset.Here are three queries to quickly explore this SQLite database, using sqliteman or whatever else:
One important note about this dataset: It only indicates that full-text is available if it is digitally available. Put differently, it does not reflect whether a patron could physically go to a library shelf and get a hard copy of an article. This matches @dhimmel's original specification when we talked about it, but is nonetheless something I want to be clear about.
Things that still need to be done, following the Lab's checklist