internal server error—lots of /users calls

graybeal commented 3 years ago

New Relic internal server error this morning from 6 to 10:30 AM. (This was same issue as we had yesterday afternoon.) https://alerts.newrelic.com/accounts/1033463/incidents/162978050

Plot looked like this: https://gorgon.nr-assets.net/image/7a23978f-784f-4f32-801d-898a6293a551?height=280&width=540&config.legend.enabled=false

NewRelicGrab

graybeal commented 3 years ago

Jennifer reports:

These internal server errors appear to correspond with calls to the /users endpoint. It’s unusual (at least for me) to see the /users endpoint in the list of most time consuming transactions in New Relic.

Screen Shot 2020-09-17 at 10 41 56 AM

I downloaded the REST API log from ncboprod-rest1. It’s showing 1,098 occurrences of calls like this one:

172.20.222.51 - ncbobioportal [17/Sep/2020:05:59:23 -0700] "GET /users?include=all HTTP/1.0" 500 21 5.1589

The referrer is shown as the UI’s API key. This is odd because the users page is only visible to accounts with admin privileges (I re-tested this just now).

graybeal commented 3 years ago

In discussion today, we realized the following:

The IP address is the UI system, so all indications are the commands came from our system
There is only one place where /users endpoint is called
NewRelic for the UI didn't show these calls being made
There is one place where all the /users are retrieved—to fill out the permission fields in the ontology metadata
the timing of this issue yesterday was closely aligned to when I was making modifications to the access permissions for an ontology, and it was acting a bit wonky (really slow in one case)
the software for those fields hasn't been refactored and is pretty old

graybeal commented 3 years ago

Notes from previous troubleshooting yesterday:

13:53 John Graybeal demonstrating my ability to blunder about and break things, BioPortal isn't responding to me any more. Getting 500 errors 13:56 @channel anyone else having better luck accessing bioportal 13:57 Jennifer Vendetti:house_with_garden: No, it’s unresponsive for me 13:57 4store shows 0 running queries 13:59 All I see for the site is “This website is under heavy load (queue full)” 14:00 John Graybeal can you hit it or shall I? 14:00 Alex Skrenchuk works now 14:01 John Graybeal :+1::tada: 14:01 Alex Skrenchuk did anyone do anything for it? 14:02 Jennifer Vendetti:house_with_garden: Not me 14:03 John Graybeal Not me 14:13 Jennifer Vendetti:house_with_garden: BP is down for me again, and showing the “This website is under heavy load (queue full)” message 14:13 John Graybeal it was for me too. API working OK 14:14 was right after I tried editing the administrators/viewers of the ontology. probably a coincidence 14:15 retrying now that it seems to be back 14:16 hung it up again 14:20 Both times I've gotten this NewRelic alert (the one I set up): https://alerts.newrelic.com/accounts/1033463/incidents/162748641 14:24 ok, does anyone have the ball here? the radio silence makes me worry 14:24 Jennifer Vendetti:house_with_garden: I’m investigating, but I don’t see any obvious cause of this behavior 14:25 John Graybeal ah thanks for looking into it though 14:46 Jennifer Vendetti:house_with_garden: The UI is performant again, though I’m not certain what is causing the intermittent slowness and down time. I’ve tailed the REST logs and looked at New Relic. The only suspicious thing I see is that there are a large number of Faraday:TimeoutError messages reported in New Relic for the UI during the slowness. For example, there are 478 of these errors in the last 60 minutes or so. The timeouts are reported when trying to access data for ontologies from the REST API. They are not specific to any one ontology. 14:46 Screen Shot 2020-09-16 at 2 45 33 PM

14:47 Here is a better formatted example of one of these error stack traces: 14:47 Faraday::TimeoutError: Problem retrieving:

http://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F763208000?include=prefLabel%2Cdefinition%2Csynonym%2Cobsolete%2Cproperties%2ChasChildren%2Cchildren 
Error: connect_write timeout reached
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:285:in `select_with_timeout' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:140:in `rescue in block in connect' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:121:in `block in connect' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:111:in `each' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:111:in `connect' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:47:in `initialize' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/connection.rb:467:in `new' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/connection.rb:467:in `socket' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/connection.rb:118:in `request_call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/newrelic_rpm-6.12.0.367/lib/new_relic/agent/instrumentation/excon/middleware.rb:35:in `request_call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/mock.rb:57:in `request_call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/instrumentor.rb:34:in `request_call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/idempotent.rb:19:in `request_call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/base.rb:22:in `request_call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/base.rb:22:in `request_call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/connection.rb:277:in `request' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/adapter/excon.rb:31:in `block in call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/adapter.rb:60:in `connection' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/adapter/excon.rb:31:in `call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/request/url_encoded.rb:25:in `call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/request/multipart.rb:25:in `call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/middleware/faraday-object-cache.rb:63:in `call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/middleware/faraday-last-updated.rb:16:in `call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/middleware/faraday-slices.rb:16:in `call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/middleware/faraday-user-apikey.rb:16:in `call' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/rack_builder.rb:153:in `build_response' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/connection.rb:492:in `run_request' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/connection.rb:198:in `get' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/http.rb:63:in `get' 
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ont

14:48 Alex Skrenchuk it would be very helpful if api would tell which particular backend is timing out. i.e is it 4store or redis or purl or solr (edited) 14:50 4store is most likely the culprit in this case but still... it would be nice to have that feedback 14:50 Jennifer Vendetti:house_with_garden: I’ve been monitoring 4store during these issues, and I’ve never seen it have any outstanding queries in the queue. That doesn’t mean it isn’t the issue though. Just an observation. 14:51 Alex Skrenchuk yeah... thats why it would be nice to have a definitive answer with those faraday timeouts 14:53 also whats is the default faraday timeout do we have in place? 14:55 John Graybeal also it's still responding to API requests 14:56 Jennifer Vendetti:house_with_garden: I don’t know where we specify default timeouts for Faraday 15:05 Alex Skrenchuk https://github.com/ncbo/ontologies_api_ruby_client/blob/master/lib/ontologies_api_client/http.rb#L66 15:08 hmm... so sounds like this error is not a 4store but a ontologies_api_client times out when its accessing data.bioontology.org? 15:09 i mean someone needs to look at the ontologies_api and see whats causing that timeout; which is probably 4store anyway (edited) 15:11 Jennifer Vendetti:house_with_garden: I don’t know what the underlying cause of this issue is. The problem seemed to happen between 2:12 and 2:31pm, as see in New Relic. 15:11 Screen Shot 2020-09-16 at 3 09 58 PM

15:14 Alex Skrenchuk image (7)

15:21 Alex Skrenchuk "after 58K" requests? 15:22 Jennifer Vendetti:house_with_garden: I’ve seen these errors in New Relic for the REST API. There’s been 7 of them in the last hour, and they don’t appear to be specific to any one type of REST call. 15:22 Screen Shot 2020-09-16 at 3 21 13 PM

15:24 The referring app is listed at BioPortal UI. 15:43 Jennifer Vendetti:house_with_garden: I looked at one of these Net::HTTP::Persistent errors in particular (in New Relic for the REST API). I chose the one where the request URL is /ontologies/COVID-19/metrics. The message says “… too many connection resets (due to end of file reached - EOFError) after 43507 requests …“. I don’t understand where the 43.5K requests comes from. I tried to correlate this by looking at the REST logs on ncboprod-rest1, but I only see one instance of a call to that endpoint: 171.67.213.112 - cedar-public [16/Sep/2020:14:05:49 -0700] "GET /ontologies/COVID-19/metrics HTTP/1.0" 500 30 0.2737 … that took place at 14:05:49, just before the Net::HTTP::Persistent error was logged into New Relic at 14:06. Screen Shot 2020-09-16 at 3 36 35 PM

15:53 John Graybeal I wonder if this is happening at the networking layer of the stack, not at the application layer. The connection reset scenario might go like this: the request for metrics comes in; BP attempts to reply, but for some reason an EOF is prematurely sent by the sending process, so the original requesting port tries to reset the connection (likely succeeds) and BP's networking port then tries to resend the content, but the EOF is still there. The reset/resend loop does not result in any more requests to BioPortal, just network reset on one side and network resend on the other. 16:00 Jennifer Vendetti:house_with_garden: I also looked at the Rails log files on ncboprod-ui1, and only saw 2 requests accessing the COVID-19 ontology. There also doesn’t appear to be any issue with the /metrics endpoint for the COVID-19 ontology: http://data.bioontology.org/ontologies/COVID-19/metrics. 16:38 Alex Skrenchuk data flow: UI -> (LoadBalancer) -> API -> backend service (4store/redis/solr/etc). UI -> API : error faraday timeout. This most likely means that API threw a 500 error/didn't respond in time/number of retry exceeded/something else? This error is probably not the root cause of the issue. (besides perhaps UI's inability to handle this condition more gracefully) API -> backend (persistence/caching) layer : error Net::HTTP::Persistent. This error is probably the root cause. I do not know the interworking of ruby's Net::HTTP::Persistent; i.e no idea how retry request works so can't comment of John's comment. (edited) 16:41 John Graybeal that's helpful Alex. I should say, my comment is strictly abstract analysis plus experience-based hunch, it's based on no local knowledge,

ncbo / bioportal-project

internal server error—lots of /users calls #181