sanger / labwhere

A tool for tracking uniquely barcoded labware
MIT License
3 stars 10 forks source link

GPL-583 Why is Labwhere getting high memory use #166

Closed rl15 closed 4 years ago

rl15 commented 4 years ago

Description Had memory issues at C17:30 on teh 22nd of July Discussed at stand up on 23rd, agreed to look at establishing root cause First hypothesis is issue with DB connections

Who the primary contacts are for this work Eduardo Steve James

Knowledge or Stake holders

Additional context or information LabWhere has become key wheel in LH picking - is worth investing time in understanding what root cause is.

JamesGlover commented 4 years ago

Started receiving several Gateway timeout errors @ 17:04:41 in Sequencescape connecting to labware:

A RestClient::GatewayTimeout occurred in labwhere_receptions#create:

 504 Gateway Timeout
 lib/lab_where_client.rb:32:in `post'

These continues until 17:32

JamesGlover commented 4 years ago

17:22 via slack (edited for formatting):

steve 17:21 Looks like we have a few network issues. Lots of error messages from sequencescape and labwhere about 504 gateway timeouts.

emr 17:22 apparently memory is almost full in both :neutral_face: 1

emr 17:41 memory back to normal :+1: 1

JamesGlover commented 4 years ago

Also at 17:08 from Sequencescape

A RestClient::Exceptions::ReadTimeout occurred in labwhere_receptions#create:

 Timed out reading data from server
 lib/lab_where_client.rb:32:in `post'

-------------------------------
Request:
-------------------------------

 * URL        : http://***Redacted***/labwhere_receptions
 * HTTP Method: POST
 * IP address : ***Redacted***
 * Parameters : {"utf8"=>"✓", "labwhere_reception"=>{"user_code"=>"***Redacted***", "location_barcode"=>"postmanpat", "barcodes"=>["***Redacted***"]}, "asset_scan"=>"", "commit"=>"Update locations", "controller"=>"labwhere_receptions", "action"=>"create"}
 * Timestamp  : 2020-07-22 17:08:26 +0100
 * Server : ***Redacted***
   * Rails root : /var/www/sequencescape/releases/20200716092302
 * Process: 20968
JamesGlover commented 4 years ago

Servers were restarted by @emrojo at 17.39

JamesGlover commented 4 years ago

Monit resource limit alert @Jul 22 2020 17:04:28 loadavg(1min) of 6.9 matches resource limit [loadavg(1min) > 4.0]

JamesGlover commented 4 years ago

This was followed by mmonit not hearing from the labwhere host.

JamesGlover commented 4 years ago

First exception I'm seeing from Labware @ 17:06 (actual email arrives at 17:33 although we get similar exceptions at 17:14)

A Mysql2::Error::ConnectionError occurred in labwares#show:

 Unknown MySQL server host '***REDACTED***' (11)
 app/models/labware.rb:29:in `find_by_code'

Host is as expected.

These errors continue until about 17:22

JamesGlover commented 4 years ago

Timing wise its looking like the network errors are occurring after we log the excess load on the server. So chances are we're looking at symptom, rather than cause.

JamesGlover commented 4 years ago

Memory usage of both LW and Sequencescape servers was high at time of incident, although only labwhere showed the CPU spike. In both case memory usage had been climbing gradually over time. Labwhere showing sawtooth pattern, Sequencescape does shelve up a bit on the 20th, but will need more internal memory information to understand if this represents a leak.

JamesGlover commented 4 years ago

Looking at the kibana logs around the time the issues started:

July 22nd 2020, 17:02:19.702 | Jul 22 17:02:12 labw-prod systemd[1]: Starting Daily apt download activities...

That would explain the CPU spike. It might also explain the network disruption directly, after all, a 6.9% load average isn't exactly at the level where you'd expect network disruption.

More related logs:


July 22nd 2020, 17:03:16.767 | Jul 22 17:02:13 labw-prod systemd-resolved[2523]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
-- | --

  | July 22nd 2020, 17:02:20.253 | Jul 22 17:02:13 labw-prod systemd-resolved[2523]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
JamesGlover commented 4 years ago

Similar issues with monit connecting to mmonit logged moments later:

Jul 22 17:05:30 labw-prod monit[964]: Cannot translate '***REDACTED***' to IP address -- Temporary failure in name resolution
JamesGlover commented 4 years ago

cat /var/log/apt/history.log And pulling out the relevant range suggests we are just downloading the packages at this point, not installing them. But I'll look into this further, might be that they are being logged elsewhere.

Start-Date: 2020-07-22  06:53:17
Commandline: /usr/bin/unattended-upgrade
Upgrade: linux-libc-dev:amd64 (4.15.0-111.112, 4.15.0-112.113)
End-Date: 2020-07-22  06:53:18

Start-Date: 2020-07-22  06:53:20
Commandline: /usr/bin/unattended-upgrade
Install: linux-headers-4.15.0-112-generic:amd64 (4.15.0-112.113, automatic), linux-modules-4.15.0-112-generic:amd64 (4.15.0-112.113, automatic), linux-image-4.15.0-112-generic:amd64 (4.15.0-112.113, automatic), linux-headers-4.15.0-112:amd64 (4.15.0-112.113, automatic)
Upgrade: linux-headers-generic:amd64 (4.15.0.111.99, 4.15.0.112.100), linux-virtual:amd64 (4.15.0.111.99, 4.15.0.112.100), linux-headers-virtual:amd64 (4.15.0.111.99, 4.15.0.112.100), linux-image-virtual:amd64 (4.15.0.111.99, 4.15.0.112.100)
End-Date: 2020-07-22  06:53:35

Start-Date: 2020-07-23  06:19:16
Commandline: /usr/bin/unattended-upgrade
Remove: linux-image-4.15.0-109-generic:amd64 (4.15.0-109.110), linux-modules-4.15.0-109-generic:amd64 (4.15.0-109.110)
End-Date: 2020-07-23  06:19:18
JamesGlover commented 4 years ago

Main points of focus: 1) What is triggering the daily apt download activities. Are they just downloads, and if not, where are the installed packages logged. 2) Is Server returned error NXDOMAIN a daily occurance? 2) What exactly is Server returned error NXDOMAIN is the reduced feature level UDP a system wide change, and the cause of our troubles?

JamesGlover commented 4 years ago

Started Daily apt download activities appears to happen about twice a day, but timing is not consistent. Meanwhile Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-000 doesn't always seem to occur, but happens pretty regularly. Big spike of it on the 22nd July at the time of the incident though.

JamesGlover commented 4 years ago

So this implies that the alert can be triggered by any unresolved DNS issue (Don't use the accepted solution, the discussion reveals that its wrong). So a symptom, not a cause...

JamesGlover commented 4 years ago

Here's the actual issue that resulted in the addition of the alert: https://github.com/systemd/systemd/pull/8608

JamesGlover commented 4 years ago

So its looking like the Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-000 errors themselves are a bit misleading, but the spike in them alongside the issue is still revealing. So, next steps:

And then I'm even more out my depth than I already am.

JamesGlover commented 4 years ago

On the memory usage of labwhere itself, monit is only showing us memory allocated to the interpreter itself (puma is the one with high memory usage) which doesn't necessarily indicate an internal memory leak. (There's quite a nice summary on the way Ruby allocates memory, and why it can be so greedy, here https://www.joyfulbikeshedding.com/blog/2019-03-14-what-causes-ruby-memory-bloat.html#ruby-memory-allocation-101)

JamesGlover commented 4 years ago

cat /var/log/unattended-upgrades/unattended-upgrades.log (filtered to relevant timestamps)

2020-07-22 17:33:48,909 INFO Initial blacklisted packages:
2020-07-22 17:33:48,910 INFO Initial whitelisted packages:
2020-07-22 17:33:48,910 INFO Starting unattended upgrades script
2020-07-22 17:33:48,911 INFO Allowed origins are: o=Ubuntu,a=bionic, o=Ubuntu,a=bionic-security, o=UbuntuESM,a=bionic

It is interesting that this coincides with recovery.

JamesGlover commented 4 years ago

Ugh, but it does look like this should just be doing an apt-get update and a download, not actually updating packages.

JamesGlover commented 4 years ago

Hmm, found others who saw similar behaviour, but no solutions: https://www.reddit.com/r/linuxadmin/comments/a0jev4/where_can_i_see_what_aptdailyservice_actually_did/

JamesGlover commented 4 years ago

Bit of a graph of memory usage. Screenshot 2020-07-28 at 16 31 08

JamesGlover commented 4 years ago

Above you can see the point at which memory usage dropped at restart, and the logarithmic climb following restart.

JamesGlover commented 4 years ago

Screenshot 2020-07-28 at 16 37 37 Bit more focused around the restart.

JamesGlover commented 4 years ago

This all smells more like fragmentation than a leak: (More reading. I've been doing a lot of reading today) https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html We could always try setting the MALLOC_ARENA_MAX=2 if we want long term improvements, or even look at using jemalloc when building ruby. https://bubblin.io/blog/jemalloc

JamesGlover commented 4 years ago

Should probably add that most sources I've seen suggest that the memory usage improvements of jemalloc are seen with version 3, not 4 or 5. Version 3 is, at time of writing, still maintained.

JamesGlover commented 4 years ago

Adding a GC Stat data to a health endpoint would also help provide a bit of insight into the internal operation of the interpreter, this is pretty simple to achieve.

JamesGlover commented 4 years ago

Actually, might pull the health endpoint out into a small gem, make it easy to drop in any of our apps.

JamesGlover commented 4 years ago

Follow on issues to potentially reduce overall memory, or enable further investigation of future memory usage: https://github.com/sanger/labwhere/issues/175 And one on psd-deployment on gitlab, which I'll link as soon as I'm back on the VPN.