API calls in Python - Githubissues

dhimmel commented 7 years ago

@publicus and I were discussion implementing the API calls in python.

Here are some helpful resources. First, most people use the requests package for API calls, which is not builtin, but has a nice interface. Since this repository has many similarities to our Crossref API usage, you may want to checkout greenelab/crossref... particularly the python and conda setup.

There are also lot's of additional packages that you may find useful such as backoff or requests_cache. Basically, if you find yourself doing something highly involved, there may already be a python package for that!

Finally, if we ever want to do concurrent requests, which is probably not an immediate priority, I'd recommend ThreadPoolExecutor from concurrent.futures.

dhimmel commented 7 years ago

If you want an on-disk solution to make sure you don't repeat queries, you could use requests_cache, sqlite3 directly, a shelve, or some XML database (like a MongoDB for XML, not sure if this exists).

In general, I think you will find it more manageable to have one big file with all the cached responses than separate files for each response.

jglev commented 7 years ago

Thanks! I've not used backoff, requests_cache, or shelves before -- I appreciate the tips (and using this project to brush back up on Python, since R's been my daily language for a while). I'll look into those.

And if you prefer storing the XML in a database (I haven't read up on shelve yet, but I'd be comfortable using sqlite3 immediately), I'm fine with implementing that.

dhimmel commented 7 years ago

I think it'll be important to have a persistent cache, so you can avoid repeating any API calls. A sqlite3 table would definitely get the job done. I could imaging the following columns:

doi (indexed)
api_url (or not if this should be private)
timestamp
xml_response

Then when we've completed this analysis, we can upload the database file or export all of the XML snippets to a compressed text document for long-term archiving. The only downside I see to an SQLite database is that it will probably grow quite large in size, but this shouldn't be a problem.

jglev commented 7 years ago

Yep, that schema sounds good to me!

Re: the API URLs, as you suggested, I would prefer that those not be included in the dataset for the particular API I'm using. This would/will change if I eventually get the OCLC API I mentioned that I have in mind working (I'm waiting to hear back from OCLC's support team to clarify a possible bug with that API, which is why I'm not currently using it).

From the reading and experimenting I've been doing on OpenURL, I think that it's safe to consider both of these APIs as two instances of "OpenURL Resolvers" more generally. Any institutional OpenURL resolver will search a DOI and then return information that allows figuring out whether it's available in that institution's catalog. For the API I'm currently using, that means returning information about each subscription that covers a given DOI, the dates / volumes that each of those subscriptions cover, and a full_text_available key (for our purposes, I'll be able to use just that key to determine whether a DOI is available). For the OCLC API, it looks like the full_text_available key isn't returned, but the subscription date/volume information is, presented using a standardized vocabulary (See the "Coverage Fields" part of that OCLC API documentation).

Thus, my understanding is that each OpenURL resolver (across institutions) is likely to give basically the same information, but varying slightly in its presentation -- put differently, if the OCLC API doesn't end up being viable to use, a different institution will end up needing to supply its own query URL for its particular OpenURL resolver anyway, as well as possibly an extra function to do any extra processing on the returned XML.

So, what I'd like to do here is:

not include the API URL in the dataset, as you mentioned,
keep the API URL abstracted out of the query script as a configuration variable, given that each institution will need to supply its own anyway
make the API query script modular in such a way that it's easy to drop in an extra function to parse the returned XML (for the UPenn-specific implementation, that function would just find the full_text_available key and parse it; for an OCLC implementation, it would, e.g., compare the current date/year to the years of each subscription to determine whether the full text is available; for some other institution's OpenURL resolver, it might look slightly different).

greenelab / library-access

API calls in Python #4