[CRITICAL] all scrapers failing with error 255

mlandauer commented 6 years ago

This is a problem that has recently surfaced. It started happening on January 5th 2018 and reoccurs every few days after applying the workaround fix (mentioned down the page).

The symptoms are:

All scrapers fail with error 255. No visible output is shown from any of the scrapers.

The docker daemon has high cpu usage. For example:

top - 10:22:50 up 3 days, 15:55,  1 user,  load average: 2.79, 2.62, 2.64
Tasks: 223 total,   1 running, 151 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.2 us,  3.6 sy,  0.0 ni, 69.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 24676592 total,  1616660 free,  8947660 used, 14112272 buff/cache
KiB Swap:   262140 total,     2320 free,   259820 used. 15160308 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13104 root      20   0 1970176  87328  15448 S 239.9  0.4   8729:10 dockerd
3505 deploy    20   0 3910180 1.849g   3028 S   2.0  7.9 151:16.38 ruby
3451 deploy    20   0 3645324 1.353g   2492 S   1.3  5.7 141:44.01 ruby
...

This is an entirely new kind of bug that we've never seen before. There is very little that has changed in the morph.io codebase recently so at a first guess this is unlikely to be the source of the problem. What is more likely is that something in the server setup has changed recently (docker version or other supporting software). So, to get some clues we'll need to go through logs and do some more digging.

Workaround

In the short term we can work around this problem by restarting the docker daemon with

service docker restart

You'll need to do some cleanup afterwards by running:

sudo su - deploy
cd /var/www/current
bundle exec rake app:emergency:show_queue_run_inconsistencies RAILS_ENV=production

You'll get some output that might look something like:

Getting information from sidekiq queue...
Getting information about current containers...
The following runs do not have jobs on the queue:
[]
Unfinished runs attached to scrapers that do not have jobs on the queue:
[1262377, 1265737]

For any run ids mentioned in the output they'll need to be re-added:

bundle exec rails c production
2.3.1 :001 > [1262377, 1265737].each{|id| RunWorker.perform_async(id)}

If everything worked well you should see no stray runs when you run the rake task again:

2.3.1 :002 > quit
deploy@morph:/var/www/current$ bundle exec rake app:emergency:show_queue_run_inconsistencies RAILS_ENV=production
Getting information from sidekiq queue...
Getting information about current containers...
The following runs do not have jobs on the queue:
[]
Unfinished runs attached to scrapers that do not have jobs on the queue:
[]

mlandauer commented 6 years ago

Here are the most recent package upgrades:

# cat /var/log/apt/history.log

Start-Date: 2018-01-02  06:08:54
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --autoremove
Upgrade: ca-certificates-java:amd64 (20160321, 20160321ubuntu1), linux-libc-dev:amd64 (4.4.0-103.126, 4.4.0-104.127), openssl:amd64 (1.0.2g-1ubuntu4.9, 1.0.2g-1ubuntu4.10), sysstat:amd64 (11.2.0-1ubuntu0.1, 11.2.0-1ubuntu0.2), initramfs-tools-bin:amd64 (0.122ubuntu8.9, 0.122ubuntu8.10), distro-info-data:amd64 (0.28ubuntu0.5, 0.28ubuntu0.6), libssl-dev:amd64 (1.0.2g-1ubuntu4.9, 1.0.2g-1ubuntu4.10), libssl-doc:amd64 (1.0.2g-1ubuntu4.9, 1.0.2g-1ubuntu4.10), libxml2:amd64 (2.9.3+dfsg1-1ubuntu0.4, 2.9.3+dfsg1-1ubuntu0.5), initramfs-tools-core:amd64 (0.122ubuntu8.9, 0.122ubuntu8.10), initramfs-tools:amd64 (0.122ubuntu8.9, 0.122ubuntu8.10), docker-ce:amd64 (17.09.1~ce-0~ubuntu, 17.12.0~ce-0~ubuntu), libssl1.0.0:amd64 (1.0.2g-1ubuntu4.9, 1.0.2g-1ubuntu4.10)
End-Date: 2018-01-02  06:10:28

Start-Date: 2018-01-03  18:36:38
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --autoremove
Upgrade: libseccomp2:amd64 (2.2.3-3ubuntu3, 2.3.1-2.1ubuntu2~16.04.1), grub-common:amd64 (2.02~beta2-36ubuntu3.14, 2.02~beta2-36ubuntu3.15), grub2-common:amd64 (2.02~beta2-36ubuntu3.14, 2.02~beta2-36ubuntu3.15), grub-pc:amd64 (2.02~beta2-36ubuntu3.14, 2.02~beta2-36ubuntu3.15), grub-pc-bin:amd64 (2.02~beta2-36ubuntu3.14, 2.02~beta2-36ubuntu3.15)
End-Date: 2018-01-03  18:36:46

mlandauer commented 6 years ago

The dates of the package upgrade vaguely fit the timeline. So, let's make sure all the packages are upgraded again. This gives us:

Start-Date: 2018-01-19  14:19:30
Install: libllvm5.0:amd64 (1:5.0-3~16.04.1, automatic), libdrm-common:amd64 (2.4.83-1~16.04.1, automatic)
Upgrade: libdns-export162:amd64 (1:9.10.3.dfsg.P4-8ubuntu1.9, 1:9.10.3.dfsg.P4-8ubuntu1.10), python2.7-dev:amd64 (2.7.12-1ubuntu0~16.04.2, 2.7.12-1ubuntu0~16.04.3), libisccfg140:amd64 (1:9
.10.3.dfsg.P4-8ubuntu1.9, 1:9.10.3.dfsg.P4-8ubuntu1.10), libdrm-nouveau2:amd64 (2.4.76-1~ubuntu16.04.1, 2.4.83-1~16.04.1), linux-libc-dev:amd64 (4.4.0-104.127, 4.4.0-109.132), ruby2.3:amd6
4 (2.3.1-2~16.04.2, 2.3.1-2~16.04.5), libgdk-pixbuf2.0-0:amd64 (2.32.2-1ubuntu1.3, 2.32.2-1ubuntu1.4), libc6-dev:amd64 (2.23-0ubuntu9, 2.23-0ubuntu10), libglapi-mesa:amd64 (17.0.7-0ubuntu0
.16.04.2, 17.2.4-0ubuntu1~16.04.4), python2.7-minimal:amd64 (2.7.12-1ubuntu0~16.04.2, 2.7.12-1ubuntu0~16.04.3), binutils:amd64 (2.26.1-1ubuntu1~16.04.5, 2.26.1-1ubuntu1~16.04.6), bind9-hos
t:amd64 (1:9.10.3.dfsg.P4-8ubuntu1.9, 1:9.10.3.dfsg.P4-8ubuntu1.10), dnsutils:amd64 (1:9.10.3.dfsg.P4-8ubuntu1.9, 1:9.10.3.dfsg.P4-8ubuntu1.10), libpython2.7:amd64 (2.7.12-1ubuntu0~16.04.2
, 2.7.12-1ubuntu0~16.04.3), python2.7:amd64 (2.7.12-1ubuntu0~16.04.2, 2.7.12-1ubuntu0~16.04.3), libc6:amd64 (2.23-0ubuntu9, 2.23-0ubuntu10), libisc160:amd64 (1:9.10.3.dfsg.P4-8ubuntu1.9, 1
:9.10.3.dfsg.P4-8ubuntu1.10), libpython2.7-dev:amd64 (2.7.12-1ubuntu0~16.04.2, 2.7.12-1ubuntu0~16.04.3), locales:amd64 (2.23-0ubuntu9, 2.23-0ubuntu10), libegl1-mesa:amd64 (17.0.7-0ubuntu0.
16.04.2, 17.2.4-0ubuntu1~16.04.4), libgbm1:amd64 (17.0.7-0ubuntu0.16.04.2, 17.2.4-0ubuntu1~16.04.4), libdrm-amdgpu1:amd64 (2.4.76-1~ubuntu16.04.1, 2.4.83-1~16.04.1), libisc-export160:amd64
 (1:9.10.3.dfsg.P4-8ubuntu1.9, 1:9.10.3.dfsg.P4-8ubuntu1.10), libc-bin:amd64 (2.23-0ubuntu9, 2.23-0ubuntu10), libruby2.3:amd64 (2.3.1-2~16.04.2, 2.3.1-2~16.04.5), libwayland-egl1-mesa:amd6
4 (17.0.7-0ubuntu0.16.04.2, 17.2.4-0ubuntu1~16.04.4), liblwres141:amd64 (1:9.10.3.dfsg.P4-8ubuntu1.9, 1:9.10.3.dfsg.P4-8ubuntu1.10), distro-info-data:amd64 (0.28ubuntu0.6, 0.28ubuntu0.7),
libdrm2:amd64 (2.4.76-1~ubuntu16.04.1, 2.4.83-1~16.04.1), iproute2:amd64 (4.3.0-1ubuntu3.16.04.2, 4.3.0-1ubuntu3.16.04.3), libgl1-mesa-dri:amd64 (17.0.7-0ubuntu0.16.04.2, 17.2.4-0ubuntu1~1
6.04.4), libgdk-pixbuf2.0-common:amd64 (2.32.2-1ubuntu1.3, 2.32.2-1ubuntu1.4), libdns162:amd64 (1:9.10.3.dfsg.P4-8ubuntu1.9, 1:9.10.3.dfsg.P4-8ubuntu1.10), libgl1-mesa-glx:amd64 (17.0.7-0u
buntu0.16.04.2, 17.2.4-0ubuntu1~16.04.4), libc-dev-bin:amd64 (2.23-0ubuntu9, 2.23-0ubuntu10), libpython2.7-minimal:amd64 (2.7.12-1ubuntu0~16.04.2, 2.7.12-1ubuntu0~16.04.3), multiarch-suppo
rt:amd64 (2.23-0ubuntu9, 2.23-0ubuntu10), unattended-upgrades:amd64 (0.90ubuntu0.8, 0.90ubuntu0.9), libdrm-intel1:amd64 (2.4.76-1~ubuntu16.04.1, 2.4.83-1~16.04.1), libisccc140:amd64 (1:9.1
0.3.dfsg.P4-8ubuntu1.9, 1:9.10.3.dfsg.P4-8ubuntu1.10), libdrm-radeon1:amd64 (2.4.76-1~ubuntu16.04.1, 2.4.83-1~16.04.1), libbind9-140:amd64 (1:9.10.3.dfsg.P4-8ubuntu1.9, 1:9.10.3.dfsg.P4-8u
buntu1.10), libpython2.7-stdlib:amd64 (2.7.12-1ubuntu0~16.04.2, 2.7.12-1ubuntu0~16.04.3), ubuntu-mono:amd64 (14.04+16.04.20161024-0ubuntu1, 14.04+16.04.20171116-0ubuntu1)
Remove: libllvm4.0:amd64 (1:4.0-1ubuntu1~16.04.2)
End-Date: 2018-01-19  14:20:18

mlandauer commented 6 years ago

This just happened again though this time it took 7 days for the problem to reoccur. So, unfortunately it's not fixed.

rgieseke commented 6 years ago

Happened again for the last run of https://morph.io/openclimatedata/scraper today.

mlandauer commented 6 years ago

@rgieseke thanks! The docker daemon is using 200% cpu as before so the problem has reoccurred.

Running strace on the dockerd process gives

root@morph:~# strace -p 5276
strace: Process 5276 attached
futex(0x3905bf0, FUTEX_WAIT, 0, NULL

Interestingly if you run strace on anything called dockerd

root@morph:~# strace dockerd
...
openat(AT_FDCWD, "/var/run/docker.pid", O_RDONLY|O_CLOEXEC) = 5
epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=203620016, u64=139840043810480}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 5, 0xc4204f54a4) = -1 EPERM (Operation not permitted)
fstat(5, {st_mode=S_IFREG|0644, st_size=4, ...}) = 0
read(5, "5276", 516)                    = 4
read(5, "", 512)                        = 0
close(5)                                = 0
stat("/proc/5276", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
write(2, "Error starting daemon: pid file "..., 98Error starting daemon: pid file found, ensure docker is not running or delete /var/run/docker.pid
) = 98
exit_group(1)                           = ?
+++ exited with 1 +++

which looks a lot like another dockerd instance trying to start up but then not because one is already running.

mlandauer commented 6 years ago

Tracing forked processes:

root@morph:~# strace -f -p 5276
...
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] <... futex resumed> )       = 0
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5434] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5284] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5284] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5434] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5434] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5284] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5284] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5434] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5434] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] <... futex resumed> )       = 0
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5284] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5284] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] <... futex resumed> )       = 0
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5284] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5284] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] <... futex resumed> )       = 0
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5284] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5284] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5284] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5284] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5284] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5284] <... futex resumed> )       = 0
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] <... futex resumed> )       = 0
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5284] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5284] <... futex resumed> )       = 0
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5281] pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
[pid  5434] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5434] <... futex resumed> )       = 0
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5281] futex(0xc420148110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  5284] futex(0xc420148110, FUTEX_WAKE, 1 <unfinished ...>
[pid  5281] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  5284] <... futex resumed> )       = 0

mlandauer commented 6 years ago

Using newrelic plotting dockerd cpu usage over the last seven days:

You can see that there are two times when the memory usage of the dockerd daemon goes up suddenly. This happens on 26 Jan 7:10pm and 28 Jan around 12:15am.

Zooming in on the first of those two events, showing the 24 hours preceding the event, we can see that the memory is being used up. After the workaround above is applied memory goes back down. It looks like the majority of this memory usage is due to ruby running under the deploy user. So, it's either the web process (but that wouldn't drop under the workaround above) or it's the workers on the sidekiq queue.

Zooming on on the second event, this time we see a much sharper, more defined, spike in the memory usage: Again this is due to ruby running under the deploy user.

mlandauer commented 6 years ago

You can also see that ruby is using quite a bit of cpu just before the dockerd cpu usage goes up:

mlandauer commented 6 years ago

It's happened again overnight and once again preceded by a spike in ruby memory usage. Doing the workaround dance to get things back to normal.

mlandauer commented 6 years ago

This time I'm also restarting the sidekiq queue to see if that makes any difference

rgieseke commented 6 years ago

Just happened again, after a couple of manual runs during development.

https://morph.io/openclimatedata/scraper

mlandauer commented 6 years ago

Thanks @rgieseke again for highlighting the reoccurrence of the problem. One positive thing is that with restarting the background queue it's taken a week to reoccur this time.

I've done the workaround again (including restarting the worker queue). So, you should be good to go in the short term.

gavinrozzi commented 6 years ago

Just happened to me today with this scraper https://morph.io/gavinrozzi/opramachine_requests

mlandauer commented 6 years ago

@gavinrozzi workaround done again. Sorry again about all this!

gavinrozzi commented 6 years ago

Thank you for applying the workaround, @mlandauer My scraper appears to be back to normal now.

rgieseke commented 6 years ago

https://morph.io/openclimatedata/scraper failed again with 255 …

svict4 commented 6 years ago

Hi folks, 255 is happening to me on my brand spanking new scraper at https://morph.io/svict4/australian-flag-half-mast. Thanks for your heroic efforts @mlandauer

mlandauer commented 6 years ago

Sorry for the delay in attending to this @rgieseke @svict4 - I've been snowed under doing a migration of most of OpenAustralia's services to new hosting. I'll look at this now.

mlandauer commented 6 years ago

The workaround has been applied again. I'm sorry to say that I'm not going to have any significant time to work on finding a permanent fix at least until the server migration is mostly done. That's going to be at least another couple of weeks.

I'd like to investigate the memory usage more thoroughly and more importantly I have been working on a rearchitecting of the way that the background jobs work and the way the output from scrapers get handled which will mean that long-running background jobs will be a thing of the past which should fundamentally improve the situation. Unfortunately it's a pretty big change and is going to take quite a bit of work to do...

mlandauer commented 6 years ago

But of course if the problem reoccurs I will absolutely have time to apply the workaround again...

mlandauer commented 6 years ago

Restarting nginx (and the associated ruby web processes in passenger) drops the memory usage as reported by passenger-memory-stats down from a scary 8.3GB down to a much more healthy 800MB.

3941   440.6 MB   0.3 MB     Passenger watchdog
------ Passenger processes -------
PID    VMSize     Private    Name
----------------------------------
3941   440.6 MB   0.3 MB     Passenger watchdog
3959   2622.7 MB  14.1 MB    Passenger core
4151   5109.0 MB  4531.3 MB  Passenger RubyApp: /var/www/current/public
4216   4148.1 MB  3620.7 MB  Passenger RubyApp: /var/www/current/public
13678  684.2 MB   194.1 MB   Passenger RubyApp: /var/www/current/public
### Processes: 5
### Total private dirty RSS: 8360.44 MB

becomes

----- Passenger processes ------
PID   VMSize     Private   Name
--------------------------------
6270  376.5 MB   1.2 MB    Passenger watchdog
6273  2229.7 MB  7.5 MB    Passenger core
6974  1010.6 MB  299.2 MB  Passenger RubyApp: /var/www/current/public
6990  686.5 MB   151.3 MB  Passenger RubyApp: /var/www/current/public
7000  686.5 MB   127.7 MB  Passenger RubyApp: /var/www/current/public
7012  620.6 MB   105.3 MB  Passenger RubyApp: /var/www/current/public
8064  356.4 MB   65.6 MB   Passenger AppPreloader: /var/www/current
8632  620.3 MB   69.7 MB   Passenger RubyApp: /var/www/current/public
### Processes: 8
### Total private dirty RSS: 827.55 MB

Now I would take a guess that there's a memory leak somewhere in the web app. We could work around it by setting passenger_max_requests. See https://www.phusionpassenger.com/library/config/nginx/reference/#passenger_max_requests

If we can keep the web app memory use under control then we have more headroom for the more unpredictable memory use of the background queue and the scrapers.

We're still not limiting the memory usage during the compile stage for scrapers which also doesn't help the matter.

mlandauer commented 6 years ago

Discovered by digging into newrelic that one massive spike in memory usage of the web workers was caused by a request to the download data api. Looking deeper it turned out that the api wasn't in fact streaming most of the data downloads (with the exception of the sqlite format). The api code was in fact loading the entire dataset into memory, turning it into the correct format and then delivering to the web client. Frankly, I've no idea how this hokey way of doing things managed to survive for so long without being spotted.

Anyway, I managed to fix the code so that it streams. See #1174, #1175, 777977353b0e115467128fe130f2d3a810d5614b and 812a10c738763e7290d4d35c4e3e864fbfca8eec.

The last commit even involved digging into the sqlite3 gem which turned out to have a bug where it was loading the result set completely into memory before processing it (but only in a very specific situation).

So, the long and the short of it is that I hope I've found and fixed a big reason why we get huge spikes in memory usage, exhausting memory and then causing problems with the background queue and the docker daemon.

So, fingers (and toes) crossed we're going to be seeing less of this problem now.

mlandauer commented 6 years ago

This has now not reoccured in a long time so I think it's fair to say it's now fixed. All the fixes on stopping memory running out of control have done their magic.

openaustralia / morph

[CRITICAL] all scrapers failing with error 255 #1169

Workaround