sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

WoS REST API causes problems with some author searches #1642

Open peetucket opened 9 months ago

peetucket commented 9 months ago

The new WoS REST API seems to have issues with some authors, throwing errors.

https://app.honeybadger.io/projects/50046/faults/99684154/01HB323CCE7F43DYFQ5SBH0QJT?page=0

For example, this author fails to get results back with an XML error as you can see from the HB error above.

You can reproduce the problem without attempting to harvest by just querying for the UIDs

author=Author.find(157210)
options = {load_time_span: '3W', relDate: '21'}
author_query = WebOfScience::QueryAuthor.new(author, options)
puts "WOS (by name): #{author_query.name_query.send(:name_query)}"
uids = author_query.uids

This is likely due to some problematic responses from Clarivate (like some invalid XML documents) and needs to be investigated with them.

If you use their Swagger interface (https://developer.clarivate.com/apis/wos) to generate the actual rest call and execute on the console, you will get a response:

curl -X 'GET' 'https://wos-api.clarivate.com/api/wos/?databaseId=WOK&usrQuery=AU%3D%28%22Miller%2CD%22%20OR%20%22Miller%2CD%2CCraig%22%20OR%20%22Miller%2CD%2CC%22%29%20AND%20AD%3D%28%22stanford%22%29&loadTimeSpan=3W&count=100&firstRecord=1' -H 'accept: application/xml' -H 'X-ApiKey: API_KEY_REDACTED'

For just the 200 header:

curl -I -X 'GET' 'https://wos-api.clarivate.com/api/wos/?databaseId=WOK&usrQuery=AU%3D%28%22Miller%2CD%22%20OR%20%22Miller%2CD%2CCraig%22%20OR%20%22Miller%2CD%2CC%22%29%20AND%20AD%3D%28%22stanford%22%29&loadTimeSpan=3W&count=100&firstRecord=1' -H 'accept: application/xml' -H 'X-ApiKey: API_KEY_REDACTED'

peetucket commented 9 months ago
  1. Perhaps add some additional logging around the XML parser in https://github.com/sul-dlss/sul_pub/blob/main/lib/web_of_science/xml_parser.rb#L25-L31 or elsewhere (in the records class) so we can identify the exact WoS ID record that is not parsing correctly, and provide this to info to Clarivate.
  2. Ignore records that do not parse instead of blowing up the whole harvest for that author (this would allow other publications to be added instead of stopping the whole process for that author).
peetucket commented 9 months ago

I did a test with the author query below by requesting only the WOS UIDs in the first query (by asking for 0 results), taking the queryID and then requesting the UIDs separately via a separate call (using the Swigger API view), as described at the bottom of the page here: https://github.com/sul-dlss/sul_pub/wiki/Clarivate-APIs#web-of-sciences-expanded-api-notes

This is fast, and then iterating over resulting WOS UID and requesting one record at a time, and this parsed a few times successfully. So I wonder if our current approach of requesting all of the records all at once has issues with very large publications resulting in giant amounts of XML.

Query that I ran, and I used a loadTimeSpan of 3W

AU=("Miller,D" OR "Miller,D,Craig" OR "Miller,D,C") AND AD=("stanford")

Wos APIs returned:

["WOS:001021700000001",
 "WOS:001028170500007",
 "WOS:001037066800001",
 "WOS:001021392200001",
 "WOS:001022697000001",
 "WOS:001021461500001",
 "WOS:001022682600001",
 "WOS:000329880700016",
 "WOS:001035480600001",
 "WOS:001035434900001",
 "WOS:001030510600001",
 "WOS:001035476900001",
 "WOS:001022781200001",
 "WOS:001035251600001",
 "WOS:001023908100001",
 "WOS:001023760300001",
 "WOS:001035462900001",
 "WOS:001035458200001",
 "WOS:001035262600001",
 "WOS:001035431100001",
 "WOS:001035282400001",
 "WOS:001035569700001",
 "WOS:001035240700001",
 "WOS:001035251000001",
 "WOS:001035243000001"]
peetucket commented 9 months ago

The last WOS UID has 2900 authors and is a giant XML record. It works fine processed singly but is plausible causes issue when wrapped up with other publications:

wos_uid='WOS:001035243000001'
results = WebOfScience.queries.retrieve_by_id([wos_uid]).next_batch.to_a;
pub_hash = results[0].pub_hash;nil
puts pub_hash[:author].size
=> 2900
puts results[0].to_xml
peetucket commented 9 months ago

Looks like most of those publications have a lot of authors ... they are physics publications:

wos_uids = ["WOS:001021700000001",
 "WOS:001028170500007",
 "WOS:001037066800001",
 "WOS:001021392200001",
 "WOS:001022697000001",
 "WOS:001021461500001",
 "WOS:001022682600001",
 "WOS:000329880700016",
 "WOS:001035480600001",
 "WOS:001035434900001",
 "WOS:001030510600001",
 "WOS:001035476900001",
 "WOS:001022781200001",
 "WOS:001035251600001",
 "WOS:001023908100001",
 "WOS:001023760300001",
 "WOS:001035462900001",
 "WOS:001035458200001",
 "WOS:001035262600001",
 "WOS:001035431100001",
 "WOS:001035282400001",
 "WOS:001035569700001",
 "WOS:001035240700001",
 "WOS:001035251000001",
 "WOS:001035243000001"]

resp = Hash.new
wos_uids.each do |wos_uid|
   results = WebOfScience.queries.retrieve_by_id([wos_uid]).next_batch.to_a;
   pub_hash = results[0].pub_hash
   resp[wos_uid] = pub_hash[:author].size
end;nil
resp
 =>
{"WOS:001021700000001"=>2900,
 "WOS:001028170500007"=>43,
 "WOS:001037066800001"=>2900,
 "WOS:001021392200001"=>2900,
 "WOS:001022697000001"=>2900,
 "WOS:001021461500001"=>2900,
 "WOS:001022682600001"=>2864,
 "WOS:000329880700016"=>17,
 "WOS:001035480600001"=>2898,
 "WOS:001035434900001"=>2856,
 "WOS:001030510600001"=>2864,
 "WOS:001035476900001"=>2933,
 "WOS:001022781200001"=>2900,
 "WOS:001035251600001"=>2864,
 "WOS:001023908100001"=>2898,
 "WOS:001023760300001"=>2898,
 "WOS:001035462900001"=>2900,
 "WOS:001035458200001"=>2900,
 "WOS:001035262600001"=>2900,
 "WOS:001035431100001"=>2900,
 "WOS:001035282400001"=>2900,
 "WOS:001035569700001"=>2900,
 "WOS:001035240700001"=>2900,
 "WOS:001035251000001"=>2900,
 "WOS:001035243000001"=>2900}
peetucket commented 9 months ago

So it seems entirely plausible a result set like this will blow up if we ask for it all in one go (which we currently do) instead of requesting just UIDs, and then iterating over them one at a time to request the records.

peetucket commented 9 months ago

Of note: the same problem happens in the current SOAP based API. In that case, while we may initially just fetch WOS UIDs when running the name query, we then pass all of the IDs in and try and batch fetch many records at once. This also causes a failure:

wos_uids = ["WOS:001021700000001",
 "WOS:001028170500007",
 "WOS:001037066800001",
 "WOS:001021392200001",
 "WOS:001022697000001",
 "WOS:001021461500001",
 "WOS:001022682600001",
 "WOS:000329880700016",
 "WOS:001035480600001",
 "WOS:001035434900001",
 "WOS:001030510600001",
 "WOS:001035476900001",
 "WOS:001022781200001",
 "WOS:001035251600001",
 "WOS:001023908100001",
 "WOS:001023760300001",
 "WOS:001035462900001",
 "WOS:001035458200001",
 "WOS:001035262600001",
 "WOS:001035431100001",
 "WOS:001035282400001",
 "WOS:001035569700001",
 "WOS:001035240700001",
 "WOS:001035251000001",
 "WOS:001035243000001"]
results = WebOfScience.queries.retrieve_by_id(wos_uids).next_batch.to_a;
/opt/app/pub/sul-pub/shared/bundle/ruby/3.2.0/gems/savon-2.14.0/lib/savon/response.rb:132:in `raise_soap_and_http_errors!': (soap:Server) (WSE0002) Error processing your request. Reason: The (server-side) Web service could not create the call to a supporting server. Error processing results of query. Cause: [{0}]. Remedy: Call customer support. This is not a problem within your SOAP client.  : Java heap space (Savon::SOAPFault)
edsu commented 8 months ago

I'm confused why the XML that is being logged is a fragment, and not a seemingly complete document. For example am I reading this HB alert correctly?

    {
      "message" => "Error processing XML record from WoS",
      "xml" => "izations>Med CtrChicagoILUSA",
      "encoded_xml" => nil
    }
peetucket commented 8 months ago

Good question - I am not 100% sure. I think the WoS response is not being fully returned correctly since it is so large.

peetucket commented 8 months ago

Here is another particularly problematic example. This author has 2210 publications (which previously would blow up just asking for the UIDs, now that we fetch UIDs more efficiently, it works fine just requesting the UIDs). However, so many of the publications have so many authors, it will blow up if you pull the publication data in groups of 100 (the default). Below, we pull one at a time and count the authors (and even this takes a long long time), showing that many many of these publications have hundreds or thousands of authors.

author=Author.find_by(cap_profile_id: 34047)
author_query = WebOfScience::QueryAuthor.new(author)
puts author_query.name_query.send(:name_query);

 => AU=("Wu,Sean" OR "Wu,Sean,M." OR "Wu,Sean,M" OR "Wu,Ming" OR "Wu,Ming,Ming-yuan" OR "Wu,Ming,M" OR "Wu,S" OR "Wu,S,M") AND AD=("stanford" OR "massachusetts general hospital")

wos_uids = author_query.uids;
wos_uids.size
 => 2210

resp = Hash.new
wos_uids.each do |wos_uid|
   results = WebOfScience.queries.retrieve_by_id([wos_uid]).next_batch.to_a;
   pub_hash = results[0].pub_hash
   resp[wos_uid] = pub_hash[:author].size
end;nil

 # wait a long long time

puts resp
 =>
{"WOS:001062395700001"=>2856,
 "WOS:001035476900001"=>2933,
 "WOS:001062376700001"=>2900,
 "WOS:001062550200002"=>2856,
 "WOS:001062421400002"=>2900,
 "WOS:001062420000001"=>2898,
 "WOS:001062395800001"=>2898,
 "WOS:001062395800002"=>2898,
 "WOS:001062554100001"=>2898,
 "MEDLINE:37955510"=>2871,
 "WOS:001022682600001"=>2864,
 "WOS:001063965200001"=>2876,
 "WOS:001062420100001"=>2928,
 "MEDLINE:37925689"=>2935,
 "WOS:001063985300002"=>7,
 "WOS:001066442000003"=>17,
 "WOS:001002149400001"=>14,
 "WOS:000952205900001"=>21,
 "MEDLINE:37931634"=>733,
 "WOS:001063420300001"=>2882,
 "WOS:001063751200019"=>11,
 "WOS:001062451200001"=>2898,
 "WOS:001035434900001"=>2856,
 "WOS:001035431100001"=>2900,
 "WOS:001055270000001"=>2913,
 "MEDLINE:37897746"=>2920,
 "WOS:001035462900001"=>2900,
 "WOS:001069745300005"=>2900,
 "WOS:001035251600001"=>2864,
 "WOS:001035240700001"=>2900,
 "WOS:001035251000001"=>2900,
 "WOS:001035243000001"=>2900,
 "WOS:001062397100001"=>2856,
 "WOS:001062421500007"=>2898,
 "WOS:001062396000001"=>2898,
 "WOS:001071193900001"=>2900,
 "WOS:001062398000001"=>2900,
 "WOS:001062454100002"=>2900,
 "WOS:001062454100001"=>2900,
 "WOS:001062376700002"=>2900,
 "WOS:001061847500001"=>2911,
 "WOS:001062358800001"=>2896,
 "MEDLINE:37897770"=>2933,
 "WOS:001058590400001"=>2898,
 "WOS:001079098400001"=>2900,
 "WOS:001061829700001"=>2856,
 "WOS:001069542700001"=>2898,
 "WOS:001060591500001"=>2898,
 "WOS:001063971600001"=>2900,
 "WOS:001063973100001"=>2900,
 "WOS:001063486300001"=>2900,
 "WOS:001061803400002"=>2900,
 "WOS:001035262600001"=>2900,
 "PPRN:42553004"=>106,
 "WOS:001074938400001"=>13,
 "WOS:001035480600001"=>2898,
 "WOS:001035458200001"=>2900,
 "WOS:001035569700001"=>2900,
 "WOS:000989629700723"=>2,
 "WOS:000989629702211"=>7,
 "WOS:000989629700722"=>8,
 "WOS:001061852200001"=>2900,
 "WOS:001068857600001"=>107,
 "WOS:001059027100001"=>29,
 "WOS:001061751900002"=>2898,
 "WOS:001061751900001"=>2898,
 "WOS:001061876200001=>2933,
etc etc etc