hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.95k stars 1.96k forks source link

nomad > 0.9.x causing locking contention on servers with > 100 containers and journald #6759

Closed 42wim closed 3 years ago

42wim commented 4 years ago

Nomad version

0.9.6

Operating system and Environment details

Linux 3.10.0-957.27.2.el7.x86_64 / centos 7.7

Issue

We noticed some very high load on some of our cluster nodes.

image

Looking at the top of such a system shows the system is mostly idle and almost no wait states. But with an increasing load average

top - 14:13:42 up 2 days, 19:55,  1 user,  load average: 62.34, 54.50, 55.67
Tasks: 2719 total,   2 running, 2715 sleeping,   0 stopped,   2 zombie
%Cpu(s):  3.5 us,  5.1 sy,  0.0 ni, 87.9 id,  3.3 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16334451+total, 10182440+free, 44146440 used, 17373664 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 11339288+avail Mem

Some further debug lead to the docker threads being the culprit, which shows that they are in call_rwsem_down_read_failed and call_rwsem_down_write_failed

# cat /proc/`pidof dockerd-current`/task/*/stack|grep -i call|sort|uniq -c
     51 [<ffffffffb83888d8>] call_rwsem_down_read_failed+0x18/0x30
     23 [<ffffffffb8388907>] call_rwsem_down_write_failed+0x17/0x30
    137 [<ffffffffb8776ddb>] system_call_fastpath+0x22/0x27

This are apparently minor pagefaults, which can also be seen with ps

# ps -eo min_flt,maj_flt,cmd,args,uid,gid|sort -n| tail -n 3
50461617    1 /usr/sbin/journalbeat -c /e /usr/sbin/journalbeat -c /e     0     0
1383481165  7 /usr/lib/systemd/systemd-jo /usr/lib/systemd/systemd-jo     0     0
10429516026 2262 /usr/bin/dockerd-current -- /usr/bin/dockerd-current --  0     0

So docker is causing lots of page faults together with systemd, after some rabbitholes. This seems to becaused with the mmap() calls in systemd-journald

305788 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 2401</var/log/journal/0e18c7d54b7f443da8457ca1cdfc066f/system@0af9ce60fd5b401da31bf70686f02374-0000000022ace036-0005979905d59b87.journal>, 0x4986000 <unfinished ...>
233349 mmap(NULL, 8388608, PROT_READ, MAP_SHARED, 6664</run/log/journal/0e18c7d54b7f443da8457ca1cdfc066f/system@d5a96e452e9842289ad7679c5f22ef9e-00000000002e0c5d-000597d3403943b9.journal>, 0xb4b000 <unfinished ...>

A first solution was moving our journald logs from disk (persistent) to memory (volatile) but this didn't fix it.

We also tried newer kernels up to 5.3.11 but fixing this issue in the kernel itself seems to be still a work in progress. For those that want to dive into a deeper rabbit hole: https://lwn.net/Articles/786105/

So as it doesn't seem there's a fix for the symptoms, we went for trying to fix the cause. And the process causing this is docker_logger from nomad that constantly reads from every container.

We now have a quick fix (by disabling part of this plugin) by adding a return on line 81 in docker_logger.go https://github.com/hashicorp/nomad/blob/6781d28ea5fc49adf41d162779829fa11490fb34/drivers/docker/docklog/docker_logger.go#L79-L82

go func() { 
    defer close(d.doneCh) 
        d.logger.Info("no log streaming please")
        return
    stdout, stderr, err := d.openStreams(ctx, opts) 

It seems we can't just easily remove the whole logger / docker_logger plugin because it's hardcoded.

Afterwards we get normal load averages again

image

Reproduction steps

Suggested solution

Add an option to make docker_logger optional. When you're using journalbeat to send the journald logs to an ELK stack there's no need to have nomad logging output (as in 0.8.x)

tgross commented 4 years ago

Hi @42wim! Thanks for submitting this issue!

I've seen this exact sort of behavior with dockerd before with other applications that tail the docker socket for the logs, but I agree it's a problem we don't want to perpetuate with Nomad. We're looking into some design improvements to how we manage logs with docker for this reason. I'm going to mark this as an enhancement and I'll try to update here when I have a better idea of what that's going to look like.

42wim commented 4 years ago

Seems like PR #6820 will fix this issue (haven't tested it yet)

cb22 commented 3 years ago

In case it's useful for anyone else, we avoid this problem in our cluster by running with Nomad's Docker log collection turned off, all of our Docker containers set to log via journald (in non-blocking mode, since the data is not critical and 99% of people wouldn't want a logging failure to cause a service interruption!), and a simple system job that continuously streams filtered data off journald and stores it in the exact same files and locations Nomad expects.

This works quite well, makes both nomad logs and the UI work, and we haven't encountered any issues with it in production yet.

42wim commented 3 years ago

Forgot about this, the "disable log collection" option in #6820 fixed this issue for me, so I'm going to close it.

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.