Open graybeal opened 3 years ago
Jennifer reports:
These internal server errors appear to correspond with calls to the /users endpoint. It’s unusual (at least for me) to see the /users endpoint in the list of most time consuming transactions in New Relic.
I downloaded the REST API log from ncboprod-rest1. It’s showing 1,098 occurrences of calls like this one:
172.20.222.51 - ncbobioportal [17/Sep/2020:05:59:23 -0700] "GET /users?include=all HTTP/1.0" 500 21 5.1589
The referrer is shown as the UI’s API key. This is odd because the users page is only visible to accounts with admin privileges (I re-tested this just now).
In discussion today, we realized the following:
Notes from previous troubleshooting yesterday:
13:53 John Graybeal demonstrating my ability to blunder about and break things, BioPortal isn't responding to me any more. Getting 500 errors
13:56 @channel anyone else having better luck accessing bioportal
13:57 Jennifer Vendetti:house_with_garden: No, it’s unresponsive for me
13:57 4store shows 0 running queries
13:59 All I see for the site is “This website is under heavy load (queue full)”
14:00 John Graybeal can you hit it or shall I?
14:00 Alex Skrenchuk works now
14:01 John Graybeal :+1::tada:
14:01 Alex Skrenchuk did anyone do anything for it?
14:02 Jennifer Vendetti:house_with_garden: Not me
14:03 John Graybeal Not me
14:13 Jennifer Vendetti:house_with_garden: BP is down for me again, and showing the “This website is under heavy load (queue full)” message
14:13 John Graybeal it was for me too. API working OK
14:14 was right after I tried editing the administrators/viewers of the ontology. probably a coincidence
14:15 retrying now that it seems to be back
14:16 hung it up again
14:20 Both times I've gotten this NewRelic alert (the one I set up): https://alerts.newrelic.com/accounts/1033463/incidents/162748641
14:24 ok, does anyone have the ball here? the radio silence makes me worry
14:24 Jennifer Vendetti:house_with_garden: I’m investigating, but I don’t see any obvious cause of this behavior
14:25 John Graybeal ah thanks for looking into it though
14:46 Jennifer Vendetti:house_with_garden: The UI is performant again, though I’m not certain what is causing the intermittent slowness and down time. I’ve tailed the REST logs and looked at New Relic. The only suspicious thing I see is that there are a large number of Faraday:TimeoutError messages reported in New Relic for the UI during the slowness. For example, there are 478 of these errors in the last 60 minutes or so. The timeouts are reported when trying to access data for ontologies from the REST API. They are not specific to any one ontology.
14:46
14:47 Here is a better formatted example of one of these error stack traces: 14:47 Faraday::TimeoutError: Problem retrieving:
http://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F763208000?include=prefLabel%2Cdefinition%2Csynonym%2Cobsolete%2Cproperties%2ChasChildren%2Cchildren
Error: connect_write timeout reached
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:285:in `select_with_timeout'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:140:in `rescue in block in connect'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:121:in `block in connect'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:111:in `each'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:111:in `connect'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/socket.rb:47:in `initialize'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/connection.rb:467:in `new'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/connection.rb:467:in `socket'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/connection.rb:118:in `request_call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/newrelic_rpm-6.12.0.367/lib/new_relic/agent/instrumentation/excon/middleware.rb:35:in `request_call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/mock.rb:57:in `request_call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/instrumentor.rb:34:in `request_call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/idempotent.rb:19:in `request_call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/base.rb:22:in `request_call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/middlewares/base.rb:22:in `request_call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/excon-0.76.0/lib/excon/connection.rb:277:in `request'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/adapter/excon.rb:31:in `block in call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/adapter.rb:60:in `connection'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/adapter/excon.rb:31:in `call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/request/url_encoded.rb:25:in `call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/request/multipart.rb:25:in `call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/middleware/faraday-object-cache.rb:63:in `call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/middleware/faraday-last-updated.rb:16:in `call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/middleware/faraday-slices.rb:16:in `call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/middleware/faraday-user-apikey.rb:16:in `call'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/rack_builder.rb:153:in `build_response'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/connection.rb:492:in `run_request'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/gems/faraday-1.0.1/lib/faraday/connection.rb:198:in `get'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ontologies_api_ruby_client-310d3fa49017/lib/ontologies_api_client/http.rb:63:in `get'
/srv/rails/bioportal_web_ui/shared/bundle/ruby/2.6.0/bundler/gems/ont
14:48 Alex Skrenchuk it would be very helpful if api would tell which particular backend is timing out. i.e is it 4store or redis or purl or solr (edited)
14:50 4store is most likely the culprit in this case but still... it would be nice to have that feedback
14:50 Jennifer Vendetti:house_with_garden: I’ve been monitoring 4store during these issues, and I’ve never seen it have any outstanding queries in the queue. That doesn’t mean it isn’t the issue though. Just an observation.
14:51 Alex Skrenchuk yeah... thats why it would be nice to have a definitive answer with those faraday timeouts
14:53 also whats is the default faraday timeout do we have in place?
14:55 John Graybeal also it's still responding to API requests
14:56 Jennifer Vendetti:house_with_garden: I don’t know where we specify default timeouts for Faraday
15:05 Alex Skrenchuk https://github.com/ncbo/ontologies_api_ruby_client/blob/master/lib/ontologies_api_client/http.rb#L66
15:08 hmm... so sounds like this error is not a 4store but a ontologies_api_client times out when its accessing data.bioontology.org?
15:09 i mean someone needs to look at the ontologies_api and see whats causing that timeout; which is probably 4store anyway (edited)
15:11 Jennifer Vendetti:house_with_garden: I don’t know what the underlying cause of this issue is. The problem seemed to happen between 2:12 and 2:31pm, as see in New Relic.
15:11
15:14 Alex Skrenchuk
15:21 Alex Skrenchuk "after 58K" requests?
15:22 Jennifer Vendetti:house_with_garden: I’ve seen these errors in New Relic for the REST API. There’s been 7 of them in the last hour, and they don’t appear to be specific to any one type of REST call.
15:22
15:24 The referring app is listed at BioPortal UI.
15:43 Jennifer Vendetti:house_with_garden: I looked at one of these Net::HTTP::Persistent errors in particular (in New Relic for the REST API). I chose the one where the request URL is /ontologies/COVID-19/metrics. The message says “… too many connection resets (due to end of file reached - EOFError) after 43507 requests …“. I don’t understand where the 43.5K requests comes from. I tried to correlate this by looking at the REST logs on ncboprod-rest1, but I only see one instance of a call to that endpoint:
171.67.213.112 - cedar-public [16/Sep/2020:14:05:49 -0700] "GET /ontologies/COVID-19/metrics HTTP/1.0" 500 30 0.2737
… that took place at 14:05:49, just before the Net::HTTP::Persistent error was logged into New Relic at 14:06.
15:53 John Graybeal I wonder if this is happening at the networking layer of the stack, not at the application layer. The connection reset scenario might go like this: the request for metrics comes in; BP attempts to reply, but for some reason an EOF is prematurely sent by the sending process, so the original requesting port tries to reset the connection (likely succeeds) and BP's networking port then tries to resend the content, but the EOF is still there. The reset/resend loop does not result in any more requests to BioPortal, just network reset on one side and network resend on the other. 16:00 Jennifer Vendetti:house_with_garden: I also looked at the Rails log files on ncboprod-ui1, and only saw 2 requests accessing the COVID-19 ontology. There also doesn’t appear to be any issue with the /metrics endpoint for the COVID-19 ontology: http://data.bioontology.org/ontologies/COVID-19/metrics. 16:38 Alex Skrenchuk data flow: UI -> (LoadBalancer) -> API -> backend service (4store/redis/solr/etc). UI -> API : error faraday timeout. This most likely means that API threw a 500 error/didn't respond in time/number of retry exceeded/something else? This error is probably not the root cause of the issue. (besides perhaps UI's inability to handle this condition more gracefully) API -> backend (persistence/caching) layer : error Net::HTTP::Persistent. This error is probably the root cause. I do not know the interworking of ruby's Net::HTTP::Persistent; i.e no idea how retry request works so can't comment of John's comment. (edited) 16:41 John Graybeal that's helpful Alex. I should say, my comment is strictly abstract analysis plus experience-based hunch, it's based on no local knowledge,
New Relic internal server error this morning from 6 to 10:30 AM. (This was same issue as we had yesterday afternoon.) https://alerts.newrelic.com/accounts/1033463/incidents/162978050
Plot looked like this: https://gorgon.nr-assets.net/image/7a23978f-784f-4f32-801d-898a6293a551?height=280&width=540&config.legend.enabled=false