Closed dhimmel closed 6 years ago
If you want an on-disk solution to make sure you don't repeat queries, you could use requests_cache
, sqlite3
directly, a shelve
, or some XML database (like a MongoDB for XML, not sure if this exists).
In general, I think you will find it more manageable to have one big file with all the cached responses than separate files for each response.
Thanks! I've not used backoff
, requests_cache
, or shelves
before -- I appreciate the tips (and using this project to brush back up on Python, since R's been my daily language for a while). I'll look into those.
And if you prefer storing the XML in a database (I haven't read up on shelve
yet, but I'd be comfortable using sqlite3
immediately), I'm fine with implementing that.
I think it'll be important to have a persistent cache, so you can avoid repeating any API calls. A sqlite3
table would definitely get the job done. I could imaging the following columns:
doi
(indexed)api_url
(or not if this should be private)timestamp
xml_response
Then when we've completed this analysis, we can upload the database file or export all of the XML snippets to a compressed text document for long-term archiving. The only downside I see to an SQLite database is that it will probably grow quite large in size, but this shouldn't be a problem.
Yep, that schema sounds good to me!
Re: the API URLs, as you suggested, I would prefer that those not be included in the dataset for the particular API I'm using. This would/will change if I eventually get the OCLC API I mentioned that I have in mind working (I'm waiting to hear back from OCLC's support team to clarify a possible bug with that API, which is why I'm not currently using it).
From the reading and experimenting I've been doing on OpenURL, I think that it's safe to consider both of these APIs as two instances of "OpenURL Resolvers" more generally. Any institutional OpenURL resolver will search a DOI and then return information that allows figuring out whether it's available in that institution's catalog. For the API I'm currently using, that means returning information about each subscription that covers a given DOI, the dates / volumes that each of those subscriptions cover, and a full_text_available
key (for our purposes, I'll be able to use just that key to determine whether a DOI is available). For the OCLC API, it looks like the full_text_available
key isn't returned, but the subscription date/volume information is, presented using a standardized vocabulary (See the "Coverage Fields" part of that OCLC API documentation).
Thus, my understanding is that each OpenURL resolver (across institutions) is likely to give basically the same information, but varying slightly in its presentation -- put differently, if the OCLC API doesn't end up being viable to use, a different institution will end up needing to supply its own query URL for its particular OpenURL resolver anyway, as well as possibly an extra function to do any extra processing on the returned XML.
So, what I'd like to do here is:
full_text_available
key and parse it; for an OCLC implementation, it would, e.g., compare the current date/year to the years of each subscription to determine whether the full text is available; for some other institution's OpenURL resolver, it might look slightly different).
@publicus and I were discussion implementing the API calls in python.
Here are some helpful resources. First, most people use the
requests
package for API calls, which is not builtin, but has a nice interface. Since this repository has many similarities to our Crossref API usage, you may want to checkoutgreenelab/crossref
... particularly the python and conda setup.There are also lot's of additional packages that you may find useful such as
backoff
orrequests_cache
. Basically, if you find yourself doing something highly involved, there may already be a python package for that!Finally, if we ever want to do concurrent requests, which is probably not an immediate priority, I'd recommend
ThreadPoolExecutor
fromconcurrent.futures
.