greenelab / library-access

Collecting data on whether library access to scholarly literature
Other
5 stars 3 forks source link

Adding api query script and example database from 10 DOIs #7

Closed jglev closed 6 years ago

jglev commented 6 years ago

This is a work-in-progress PR. The code is all functioning, but, on @dhimmel's suggestion, I'd appreciate ongoing feedback, given the Lab's expectations for collaborators.

As soon as I get an in-progress thumbs-up on this, I'll start the UPenn downloader running on the dataset's DOIs.

What this PR currently includes

  1. Per @dhimmel in a comment, I've included and manually edited the environment YAML file exported by Conda. I have not specified Channels specifically, because I was paring down the output of conda env export, which doesn't specify them except at the top.
    1. I've included r-base in this list. It isn't used yet, but is used in the Bayesian script I've written (and will add in a future commit or PR, once the downloader is underway).
  2. Working code, written in Python3, for doing the following:
    1. Querying an OpenURL resolver for a given DOI.
      1. The URL and static parameters are easily configurable for each institution's unique OpenURL resolver.
      2. Similarly, the function for querying the API is split into its own file and function, allowing it to be more easily modified for a given institution's specific needs.
    2. Parsing the response XML for a Yes/No indication of whether full-text access is available for that DOI.
      1. As above, the function for parsing the API response is split into its own file and function, allowing it to be more easily modified for a given institution's specific needs.
    3. Saving the response and the Yes/No indication into an SQLite database.
  3. I expanded on the database model suggested by @dhimmel by creating a separate ID table for DOIs. This will allow us (or others) more easily to make the following two changes, if we ever want to:
    1. Import the original tsv dataset into the SQLite database.
    2. Rerun the same DOI more than once over time -- this is currently implemented (though we likely won't measure each DOI beyond one timepoint) and controlled by the variable rerun_dois_that_are_already_in_database in the configuration file.
  4. Everything is, I think, PEP8 compliant (it is according to the linter built into Spyder).
  5. The state-of-oa-dois dataset from @dhimmel in this comment.
  6. An example SQLite database resulting from running 10 of the closed-type DOIs from the dataset.

    1. Here are three queries to quickly explore this SQLite database, using sqliteman or whatever else:

      SELECT * FROM dois_table;  -- View the entire DOI table
      
      SELECT * FROM library_holdings_data;  -- View the entire DOI XML and full-text indicator table
      
      /* View the above tables, joined together */
      SELECT  
          dois_table.doi,  
          library_holdings_data.timestamp,
          library_holdings_data.xml_response,
          library_holdings_data.full_text_indicator
      FROM library_holdings_data
      JOIN dois_table
      ON library_holdings_data.doi_foreign_key = dois_table.database_id
    2. One important note about this dataset: It only indicates that full-text is available if it is digitally available. Put differently, it does not reflect whether a patron could physically go to a library shelf and get a hard copy of an article. This matches @dhimmel's original specification when we talked about it, but is nonetheless something I want to be clear about.

Things that still need to be done, following the Lab's checklist

  1. Add metadata
    1. I haven't yet signed the files, because I'm unsure whether there's a standard way to do that in the Lab.
    2. I haven't yet added a commented docstring to each of the files.
    3. For the smaller functions, I haven't yet added a docstring.
  2. Probably more, based on feedback here.
jglev commented 6 years ago

@dhimmel, to confirm, is it ok to include the state-of-oa-dois dataset here? Do you have rights to release it CC0?

jglev commented 6 years ago

Thank you for the feedback (and in advance for whatever additional comments are forthcoming)!

As of 8d32f2f3dd, I've switched to using f-strings, which I hadn't heard about (thank you for that, as well). I'll be making further commits and comments above to track my progress tomorrow.

dhimmel commented 6 years ago

@dhimmel, to confirm, is it ok to include the state-of-oa-dois dataset here? Do you have rights to release it CC0?

Certainly, the source Zenodo release is CC0. Now I don't think we should necessarily track state-of-oa-dois.tsv.xz. Instead, we can read it from its versioned URL:

https://github.com/greenelab/scihub/raw/4172526ac7433357b31790578ad6f59948b6db26/data/state-of-oa-dois.tsv.xz

In the above URL, you can switch between raw and blob (for webview)

jglev commented 6 years ago

As @dhimmel suggested here, I stopped tracking the database in ab6541ba50.

In f1d2d67, I added a new config. variable to slice the list of DOIs -- previously, it was hard-coded to just take the first 10 (as a test set). So with a variable (which when set to None, will download everything), the PR can now be used without having to go back and change that slice in the code itself.

dhimmel commented 6 years ago

@publicus cool. Let me know when its ready for my review... i.e. no more planned commits.

jglev commented 6 years ago

I've gone through the PR discussion above, and think that all of your comments to date have been answered at this point. Does that look right to you, too?

jglev commented 6 years ago

Thank you for all of your comments and review! I'll get this running today!

dhimmel commented 6 years ago

Thank you for all of your comments and review! I'll get this running today!

Make sure you start your subsequent work on top of the current greenelab:master. You need to be building off of the commit we just created by the squash merge.

jglev commented 6 years ago

Will do!