Open peetucket opened 9 months ago
I did a test with the author query below by requesting only the WOS UIDs in the first query (by asking for 0 results), taking the queryID and then requesting the UIDs separately via a separate call (using the Swigger API view), as described at the bottom of the page here: https://github.com/sul-dlss/sul_pub/wiki/Clarivate-APIs#web-of-sciences-expanded-api-notes
This is fast, and then iterating over resulting WOS UID and requesting one record at a time, and this parsed a few times successfully. So I wonder if our current approach of requesting all of the records all at once has issues with very large publications resulting in giant amounts of XML.
Query that I ran, and I used a loadTimeSpan of 3W
AU=("Miller,D" OR "Miller,D,Craig" OR "Miller,D,C") AND AD=("stanford")
Wos APIs returned:
["WOS:001021700000001",
"WOS:001028170500007",
"WOS:001037066800001",
"WOS:001021392200001",
"WOS:001022697000001",
"WOS:001021461500001",
"WOS:001022682600001",
"WOS:000329880700016",
"WOS:001035480600001",
"WOS:001035434900001",
"WOS:001030510600001",
"WOS:001035476900001",
"WOS:001022781200001",
"WOS:001035251600001",
"WOS:001023908100001",
"WOS:001023760300001",
"WOS:001035462900001",
"WOS:001035458200001",
"WOS:001035262600001",
"WOS:001035431100001",
"WOS:001035282400001",
"WOS:001035569700001",
"WOS:001035240700001",
"WOS:001035251000001",
"WOS:001035243000001"]
The last WOS UID has 2900 authors and is a giant XML record. It works fine processed singly but is plausible causes issue when wrapped up with other publications:
wos_uid='WOS:001035243000001'
results = WebOfScience.queries.retrieve_by_id([wos_uid]).next_batch.to_a;
pub_hash = results[0].pub_hash;nil
puts pub_hash[:author].size
=> 2900
puts results[0].to_xml
Looks like most of those publications have a lot of authors ... they are physics publications:
wos_uids = ["WOS:001021700000001",
"WOS:001028170500007",
"WOS:001037066800001",
"WOS:001021392200001",
"WOS:001022697000001",
"WOS:001021461500001",
"WOS:001022682600001",
"WOS:000329880700016",
"WOS:001035480600001",
"WOS:001035434900001",
"WOS:001030510600001",
"WOS:001035476900001",
"WOS:001022781200001",
"WOS:001035251600001",
"WOS:001023908100001",
"WOS:001023760300001",
"WOS:001035462900001",
"WOS:001035458200001",
"WOS:001035262600001",
"WOS:001035431100001",
"WOS:001035282400001",
"WOS:001035569700001",
"WOS:001035240700001",
"WOS:001035251000001",
"WOS:001035243000001"]
resp = Hash.new
wos_uids.each do |wos_uid|
results = WebOfScience.queries.retrieve_by_id([wos_uid]).next_batch.to_a;
pub_hash = results[0].pub_hash
resp[wos_uid] = pub_hash[:author].size
end;nil
resp
=>
{"WOS:001021700000001"=>2900,
"WOS:001028170500007"=>43,
"WOS:001037066800001"=>2900,
"WOS:001021392200001"=>2900,
"WOS:001022697000001"=>2900,
"WOS:001021461500001"=>2900,
"WOS:001022682600001"=>2864,
"WOS:000329880700016"=>17,
"WOS:001035480600001"=>2898,
"WOS:001035434900001"=>2856,
"WOS:001030510600001"=>2864,
"WOS:001035476900001"=>2933,
"WOS:001022781200001"=>2900,
"WOS:001035251600001"=>2864,
"WOS:001023908100001"=>2898,
"WOS:001023760300001"=>2898,
"WOS:001035462900001"=>2900,
"WOS:001035458200001"=>2900,
"WOS:001035262600001"=>2900,
"WOS:001035431100001"=>2900,
"WOS:001035282400001"=>2900,
"WOS:001035569700001"=>2900,
"WOS:001035240700001"=>2900,
"WOS:001035251000001"=>2900,
"WOS:001035243000001"=>2900}
So it seems entirely plausible a result set like this will blow up if we ask for it all in one go (which we currently do) instead of requesting just UIDs, and then iterating over them one at a time to request the records.
Of note: the same problem happens in the current SOAP based API. In that case, while we may initially just fetch WOS UIDs when running the name query, we then pass all of the IDs in and try and batch fetch many records at once. This also causes a failure:
wos_uids = ["WOS:001021700000001",
"WOS:001028170500007",
"WOS:001037066800001",
"WOS:001021392200001",
"WOS:001022697000001",
"WOS:001021461500001",
"WOS:001022682600001",
"WOS:000329880700016",
"WOS:001035480600001",
"WOS:001035434900001",
"WOS:001030510600001",
"WOS:001035476900001",
"WOS:001022781200001",
"WOS:001035251600001",
"WOS:001023908100001",
"WOS:001023760300001",
"WOS:001035462900001",
"WOS:001035458200001",
"WOS:001035262600001",
"WOS:001035431100001",
"WOS:001035282400001",
"WOS:001035569700001",
"WOS:001035240700001",
"WOS:001035251000001",
"WOS:001035243000001"]
results = WebOfScience.queries.retrieve_by_id(wos_uids).next_batch.to_a;
/opt/app/pub/sul-pub/shared/bundle/ruby/3.2.0/gems/savon-2.14.0/lib/savon/response.rb:132:in `raise_soap_and_http_errors!': (soap:Server) (WSE0002) Error processing your request. Reason: The (server-side) Web service could not create the call to a supporting server. Error processing results of query. Cause: [{0}]. Remedy: Call customer support. This is not a problem within your SOAP client. : Java heap space (Savon::SOAPFault)
I'm confused why the XML that is being logged is a fragment, and not a seemingly complete document. For example am I reading this HB alert correctly?
{
"message" => "Error processing XML record from WoS",
"xml" => "izations>Med CtrChicagoILUSA",
"encoded_xml" => nil
}
Good question - I am not 100% sure. I think the WoS response is not being fully returned correctly since it is so large.
Here is another particularly problematic example. This author has 2210 publications (which previously would blow up just asking for the UIDs, now that we fetch UIDs more efficiently, it works fine just requesting the UIDs). However, so many of the publications have so many authors, it will blow up if you pull the publication data in groups of 100 (the default). Below, we pull one at a time and count the authors (and even this takes a long long time), showing that many many of these publications have hundreds or thousands of authors.
author=Author.find_by(cap_profile_id: 34047)
author_query = WebOfScience::QueryAuthor.new(author)
puts author_query.name_query.send(:name_query);
=> AU=("Wu,Sean" OR "Wu,Sean,M." OR "Wu,Sean,M" OR "Wu,Ming" OR "Wu,Ming,Ming-yuan" OR "Wu,Ming,M" OR "Wu,S" OR "Wu,S,M") AND AD=("stanford" OR "massachusetts general hospital")
wos_uids = author_query.uids;
wos_uids.size
=> 2210
resp = Hash.new
wos_uids.each do |wos_uid|
results = WebOfScience.queries.retrieve_by_id([wos_uid]).next_batch.to_a;
pub_hash = results[0].pub_hash
resp[wos_uid] = pub_hash[:author].size
end;nil
# wait a long long time
puts resp
=>
{"WOS:001062395700001"=>2856,
"WOS:001035476900001"=>2933,
"WOS:001062376700001"=>2900,
"WOS:001062550200002"=>2856,
"WOS:001062421400002"=>2900,
"WOS:001062420000001"=>2898,
"WOS:001062395800001"=>2898,
"WOS:001062395800002"=>2898,
"WOS:001062554100001"=>2898,
"MEDLINE:37955510"=>2871,
"WOS:001022682600001"=>2864,
"WOS:001063965200001"=>2876,
"WOS:001062420100001"=>2928,
"MEDLINE:37925689"=>2935,
"WOS:001063985300002"=>7,
"WOS:001066442000003"=>17,
"WOS:001002149400001"=>14,
"WOS:000952205900001"=>21,
"MEDLINE:37931634"=>733,
"WOS:001063420300001"=>2882,
"WOS:001063751200019"=>11,
"WOS:001062451200001"=>2898,
"WOS:001035434900001"=>2856,
"WOS:001035431100001"=>2900,
"WOS:001055270000001"=>2913,
"MEDLINE:37897746"=>2920,
"WOS:001035462900001"=>2900,
"WOS:001069745300005"=>2900,
"WOS:001035251600001"=>2864,
"WOS:001035240700001"=>2900,
"WOS:001035251000001"=>2900,
"WOS:001035243000001"=>2900,
"WOS:001062397100001"=>2856,
"WOS:001062421500007"=>2898,
"WOS:001062396000001"=>2898,
"WOS:001071193900001"=>2900,
"WOS:001062398000001"=>2900,
"WOS:001062454100002"=>2900,
"WOS:001062454100001"=>2900,
"WOS:001062376700002"=>2900,
"WOS:001061847500001"=>2911,
"WOS:001062358800001"=>2896,
"MEDLINE:37897770"=>2933,
"WOS:001058590400001"=>2898,
"WOS:001079098400001"=>2900,
"WOS:001061829700001"=>2856,
"WOS:001069542700001"=>2898,
"WOS:001060591500001"=>2898,
"WOS:001063971600001"=>2900,
"WOS:001063973100001"=>2900,
"WOS:001063486300001"=>2900,
"WOS:001061803400002"=>2900,
"WOS:001035262600001"=>2900,
"PPRN:42553004"=>106,
"WOS:001074938400001"=>13,
"WOS:001035480600001"=>2898,
"WOS:001035458200001"=>2900,
"WOS:001035569700001"=>2900,
"WOS:000989629700723"=>2,
"WOS:000989629702211"=>7,
"WOS:000989629700722"=>8,
"WOS:001061852200001"=>2900,
"WOS:001068857600001"=>107,
"WOS:001059027100001"=>29,
"WOS:001061751900002"=>2898,
"WOS:001061751900001"=>2898,
"WOS:001061876200001=>2933,
etc etc etc
The new WoS REST API seems to have issues with some authors, throwing errors.
https://app.honeybadger.io/projects/50046/faults/99684154/01HB323CCE7F43DYFQ5SBH0QJT?page=0
For example, this author fails to get results back with an XML error as you can see from the HB error above.
You can reproduce the problem without attempting to harvest by just querying for the UIDs
This is likely due to some problematic responses from Clarivate (like some invalid XML documents) and needs to be investigated with them.
If you use their Swagger interface (https://developer.clarivate.com/apis/wos) to generate the actual rest call and execute on the console, you will get a response:
curl -X 'GET' 'https://wos-api.clarivate.com/api/wos/?databaseId=WOK&usrQuery=AU%3D%28%22Miller%2CD%22%20OR%20%22Miller%2CD%2CCraig%22%20OR%20%22Miller%2CD%2CC%22%29%20AND%20AD%3D%28%22stanford%22%29&loadTimeSpan=3W&count=100&firstRecord=1' -H 'accept: application/xml' -H 'X-ApiKey: API_KEY_REDACTED'
For just the 200 header:
curl -I -X 'GET' 'https://wos-api.clarivate.com/api/wos/?databaseId=WOK&usrQuery=AU%3D%28%22Miller%2CD%22%20OR%20%22Miller%2CD%2CCraig%22%20OR%20%22Miller%2CD%2CC%22%29%20AND%20AD%3D%28%22stanford%22%29&loadTimeSpan=3W&count=100&firstRecord=1' -H 'accept: application/xml' -H 'X-ApiKey: API_KEY_REDACTED'