Closed rl15 closed 4 years ago
Started receiving several Gateway timeout errors @ 17:04:41 in Sequencescape connecting to labware:
A RestClient::GatewayTimeout occurred in labwhere_receptions#create:
504 Gateway Timeout
lib/lab_where_client.rb:32:in `post'
These continues until 17:32
17:22 via slack (edited for formatting):
steve 17:21 Looks like we have a few network issues. Lots of error messages from sequencescape and labwhere about 504 gateway timeouts.
emr 17:22 apparently memory is almost full in both :neutral_face: 1
emr 17:41 memory back to normal :+1: 1
Also at 17:08 from Sequencescape
A RestClient::Exceptions::ReadTimeout occurred in labwhere_receptions#create:
Timed out reading data from server
lib/lab_where_client.rb:32:in `post'
-------------------------------
Request:
-------------------------------
* URL : http://***Redacted***/labwhere_receptions
* HTTP Method: POST
* IP address : ***Redacted***
* Parameters : {"utf8"=>"✓", "labwhere_reception"=>{"user_code"=>"***Redacted***", "location_barcode"=>"postmanpat", "barcodes"=>["***Redacted***"]}, "asset_scan"=>"", "commit"=>"Update locations", "controller"=>"labwhere_receptions", "action"=>"create"}
* Timestamp : 2020-07-22 17:08:26 +0100
* Server : ***Redacted***
* Rails root : /var/www/sequencescape/releases/20200716092302
* Process: 20968
Servers were restarted by @emrojo at 17.39
Monit resource limit alert @Jul 22 2020 17:04:28 loadavg(1min) of 6.9 matches resource limit [loadavg(1min) > 4.0]
This was followed by mmonit not hearing from the labwhere host.
First exception I'm seeing from Labware @ 17:06 (actual email arrives at 17:33 although we get similar exceptions at 17:14)
A Mysql2::Error::ConnectionError occurred in labwares#show:
Unknown MySQL server host '***REDACTED***' (11)
app/models/labware.rb:29:in `find_by_code'
Host is as expected.
These errors continue until about 17:22
Timing wise its looking like the network errors are occurring after we log the excess load on the server. So chances are we're looking at symptom, rather than cause.
Memory usage of both LW and Sequencescape servers was high at time of incident, although only labwhere showed the CPU spike. In both case memory usage had been climbing gradually over time. Labwhere showing sawtooth pattern, Sequencescape does shelve up a bit on the 20th, but will need more internal memory information to understand if this represents a leak.
Looking at the kibana logs around the time the issues started:
July 22nd 2020, 17:02:19.702 | Jul 22 17:02:12 labw-prod systemd[1]: Starting Daily apt download activities...
That would explain the CPU spike. It might also explain the network disruption directly, after all, a 6.9% load average isn't exactly at the level where you'd expect network disruption.
More related logs:
July 22nd 2020, 17:03:16.767 | Jul 22 17:02:13 labw-prod systemd-resolved[2523]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
-- | --
| July 22nd 2020, 17:02:20.253 | Jul 22 17:02:13 labw-prod systemd-resolved[2523]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Similar issues with monit connecting to mmonit logged moments later:
Jul 22 17:05:30 labw-prod monit[964]: Cannot translate '***REDACTED***' to IP address -- Temporary failure in name resolution
cat /var/log/apt/history.log
And pulling out the relevant range suggests we are just downloading the packages at this point, not installing them. But I'll look into this further, might be that they are being logged elsewhere.
Start-Date: 2020-07-22 06:53:17
Commandline: /usr/bin/unattended-upgrade
Upgrade: linux-libc-dev:amd64 (4.15.0-111.112, 4.15.0-112.113)
End-Date: 2020-07-22 06:53:18
Start-Date: 2020-07-22 06:53:20
Commandline: /usr/bin/unattended-upgrade
Install: linux-headers-4.15.0-112-generic:amd64 (4.15.0-112.113, automatic), linux-modules-4.15.0-112-generic:amd64 (4.15.0-112.113, automatic), linux-image-4.15.0-112-generic:amd64 (4.15.0-112.113, automatic), linux-headers-4.15.0-112:amd64 (4.15.0-112.113, automatic)
Upgrade: linux-headers-generic:amd64 (4.15.0.111.99, 4.15.0.112.100), linux-virtual:amd64 (4.15.0.111.99, 4.15.0.112.100), linux-headers-virtual:amd64 (4.15.0.111.99, 4.15.0.112.100), linux-image-virtual:amd64 (4.15.0.111.99, 4.15.0.112.100)
End-Date: 2020-07-22 06:53:35
Start-Date: 2020-07-23 06:19:16
Commandline: /usr/bin/unattended-upgrade
Remove: linux-image-4.15.0-109-generic:amd64 (4.15.0-109.110), linux-modules-4.15.0-109-generic:amd64 (4.15.0-109.110)
End-Date: 2020-07-23 06:19:18
Main points of focus:
1) What is triggering the daily apt download activities. Are they just downloads, and if not, where are the installed packages logged.
2) Is Server returned error NXDOMAIN
a daily occurance?
2) What exactly is Server returned error NXDOMAIN
is the reduced feature level UDP a system wide change, and the cause of our troubles?
Started Daily apt download activities appears to happen about twice a day, but timing is not consistent. Meanwhile Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-000
doesn't always seem to occur, but happens pretty regularly. Big spike of it on the 22nd July at the time of the incident though.
So this implies that the alert can be triggered by any unresolved DNS issue (Don't use the accepted solution, the discussion reveals that its wrong). So a symptom, not a cause...
Here's the actual issue that resulted in the addition of the alert: https://github.com/systemd/systemd/pull/8608
So its looking like the Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-000
errors themselves are a bit misleading, but the spike in them alongside the issue is still revealing. So, next steps:
And then I'm even more out my depth than I already am.
On the memory usage of labwhere itself, monit is only showing us memory allocated to the interpreter itself (puma is the one with high memory usage) which doesn't necessarily indicate an internal memory leak. (There's quite a nice summary on the way Ruby allocates memory, and why it can be so greedy, here https://www.joyfulbikeshedding.com/blog/2019-03-14-what-causes-ruby-memory-bloat.html#ruby-memory-allocation-101)
cat /var/log/unattended-upgrades/unattended-upgrades.log (filtered to relevant timestamps)
2020-07-22 17:33:48,909 INFO Initial blacklisted packages:
2020-07-22 17:33:48,910 INFO Initial whitelisted packages:
2020-07-22 17:33:48,910 INFO Starting unattended upgrades script
2020-07-22 17:33:48,911 INFO Allowed origins are: o=Ubuntu,a=bionic, o=Ubuntu,a=bionic-security, o=UbuntuESM,a=bionic
It is interesting that this coincides with recovery.
Ugh, but it does look like this should just be doing an apt-get update
and a download, not actually updating packages.
Hmm, found others who saw similar behaviour, but no solutions: https://www.reddit.com/r/linuxadmin/comments/a0jev4/where_can_i_see_what_aptdailyservice_actually_did/
Bit of a graph of memory usage.
Above you can see the point at which memory usage dropped at restart, and the logarithmic climb following restart.
Bit more focused around the restart.
This all smells more like fragmentation than a leak: (More reading. I've been doing a lot of reading today) https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html We could always try setting the MALLOC_ARENA_MAX=2 if we want long term improvements, or even look at using jemalloc when building ruby. https://bubblin.io/blog/jemalloc
Should probably add that most sources I've seen suggest that the memory usage improvements of jemalloc are seen with version 3, not 4 or 5. Version 3 is, at time of writing, still maintained.
Adding a GC Stat data to a health endpoint would also help provide a bit of insight into the internal operation of the interpreter, this is pretty simple to achieve.
Actually, might pull the health endpoint out into a small gem, make it easy to drop in any of our apps.
Follow on issues to potentially reduce overall memory, or enable further investigation of future memory usage: https://github.com/sanger/labwhere/issues/175 And one on psd-deployment on gitlab, which I'll link as soon as I'm back on the VPN.
Description Had memory issues at C17:30 on teh 22nd of July Discussed at stand up on 23rd, agreed to look at establishing root cause First hypothesis is issue with DB connections
Who the primary contacts are for this work Eduardo Steve James
Knowledge or Stake holders
Additional context or information LabWhere has become key wheel in LH picking - is worth investing time in understanding what root cause is.