sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

only fetch UIDs and not full records from WoS #1660

Closed peetucket closed 10 months ago

peetucket commented 11 months ago

Why was this change made?

To start dealing with issues in #1642

Basically, it is very inefficient to request the entire record from WoS when we only need the WOS_UIDs to determine if we have new publications or not. This alters the logic so we use a different endpoint provided by WoS when harvesting (because for each harvest, we really only need to request UIDs to verify if we have new publications to add or not).

This won't fix the problem in #1642 completely, since if you have new publications to add, you still need to fetch the full records (which will still blow up if there are many many authors for many publications in the set of 100 you can request at once), but not only should it greatly increase the performance of each harvest but should also cut down on these exceptions, since those will only occur if and when we need to fetch actual new publications (instead of happening on each harvest as they do now).

It also eliminates what I believe to be an unused class (WebOfScience::RestRetriever) and it moves it's spec to the WebOfScience::BaseRestRetriever class (which is very similar, and is subclassed and thus used). While the WebOfScience::RestRetriever class was added in the PR that switched APIs #1612, it doesn't appear to be used or subclassed anywhere.

Finally, I removed (and thus allowed to be recreated) all of the VCR cassettes for the WoS API to be sure they the tests would still pass given these code changes. Note that a few expectations had to be updated because of the new cassettes (i.e. publication counts, which increased since the last time the VCR cassettes were created).

You may ask...why is this is a new problem given the switch the REST based API. And the answer is that it is not entirely a new problem. The old SOAP based API has similar issues when requesting full publication records for publications with lots and lots of authors. BUT: the older SOAP based API had an easier mechanism to request only UIDs on harvesting, and that's what we were doing. The new REST based API makes a bit more work to replicate this functionality. So this PR basically makes the new REST based API code work more like the older SOAP based API code worked. Namely, it only requests UIDs on each harvest instead of full publication records, and then goes back to fetch the full records only for any publications we actually need to add to our database. Which should reduce the occurrence of the problem.

How was this change tested?

Existing specs (allowing VCR cassettes to be re-created)