Closed saseestone closed 9 months ago
It appears that the front end here is overloading the server with requests. We'll need to increase backend capacity or modify the frontend so that it doesn't try to fire thousands of requests simultaneously. We could do this by limiting the feature to a reasonable number or queue the requests. If we did the latter, we'd need to add a fair amount of complexity to ensure the queue is drained before the user navigates away from the page.
I'm wondering if the load balancer has something to do with this problem. If I bypass the load balancer and go directly to one of the sw-webapp-*
VMs then the select-all function works fine. There's a long discussion about connection issues with the load balancer on this issue, could be related: https://github.com/sul-dlss/operations-tasks/issues/3519
If you can track down any of those 503s to a specific app server it's unlikely that this is the same issue we're seeing with DSA - there, the app servers never even gets the traffic. @jcoyne's thought that this is a capacity issue looks more likely to me absent other evidence.
I don't know how those app servers are configured, but there's likely headroom available just from a config change.
( although if 'select all' here is generating 1,187 unique GETs to the backend server from a single client, that seems less desirable from a scalability perspective. That's an arms race I don't think we can win with server configs. )
@julianmorley we determined the max number is actually 100 requests.
@jcoyne I just tried it - I see what you mean. I think your first assumption that this is blatting the app server is right, though. Those 503s came back fast, looks like whichever server the LB sent this client to ran out of available connections.
@julianmorley @jcoyne this was simple enough to test and confirm that the POST requests resulting in the 503 response are never making it to any of thesw-webapp-*
VMs.
I tailed each of the Apache request logs on the 5 sw-webapp-*
VMs and clicked the select all widget in SearchWorks. My browser recorded a total of 100 POST requests to /selections/*
which resulted in 76 successful requests with a 200 response and 24 failed requests with the 503 response.
At the same time on each of the SearchWorks prod VMs I tailed the request logs: tail -f SearchWorks_access_ssl.log | grep 'POST /selections/'
. Across the VMs there were 76 POST requests recorded with a 200 response. 24 of the requests sent by my browser were never recorded in the Apache request log on any of the VMs. The 24 missing requests correspond to the requests that returned a 503.
I guess there could be conditions where a request makes it to the VM, but something goes wrong and it's never recorded in the request log. But I'm not sure how to determine that.
Were the requests that got through spread evenly amongst the 5 VMs? The LB is set to balance via 'least connections', so I'd expect the 76 that got through to be on at least 2, preferably 4 or 5 VMs.
The 503s were definitely coming from the LB:
HTTP/1.0 503 Service Unavailable
Retry-After: 2
Server: BigIP
Connection: Keep-Alive
Content-Length: 0
Anyhow, I tried this out with a different search; select all
worked fine with 20 & 50 items, gave 500 Internal Server Error (that actually made it to the backend servers) with 100 items. That, plus the response header from the BigIP suggesting a retry indicates you're hitting the load-balancer's anti-DDOS protection: too many instantaneous connections from a single client. 20 is fine, 50 probably fine maybe, 100 is not. I don't know why my test search gets through to the app servers (where it fails, because load) but the OPs does not.
( my search was https://searchworks.stanford.edu/?per_page=100&q=nelson&search_field=search )
Retries of the 503s worked fine, so that's more fuel for the "too many requests too fast" theory. I'd suggest trying to batch those requests client-side so you're not generating 100 simultaneous requests to the same URL.
EDIT: subsequent trial of my 'nelson' test case gave 503s, identical to OP. I'm thinking that's the LB's anti-DOS deciding that I'm a naughty person after initially allowing my tomfoolery.
Thanks @julianmorley that's helpful and makes complete sense that we're hitting some kind of anti-DOS protection at the load-balancer. And yes, the 76 successful POST requests were nicely distributed across the 5 VMs. With this info we'll look at changes in the app so we're not firing off so many requests for this feature.
We received a report from Charles Fosselman in East Asia that he tried to "Select all" for this search: https://searchworks.stanford.edu/?f%5Baccess_facet%5D%5B%5D=Online&per_page=100&q=Zhongguo+li+shi&search_field=search
And received an error.
Cory/Chris indicated in Gryphon Core that Access team members will need to look into it.
Noting two details:
Jira issue with original feedback: SW-4254