hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.93k stars 1.96k forks source link

High CPU usage #2169

Closed kak-tus closed 7 years ago

kak-tus commented 7 years ago

Nomad version

0.5.2

Operating system and Environment details

Ububtu 16.04

Issue

High CPU usage with nomad (before 15:00 at screenshots) and low - without nomad (after 15:00 at screenshots). Near 15:00 nomad was killed, but tasks continued to execution. May be related to #1995.

LA image

CPU metrics image

Other server

image

image

Reproduction steps

Runned about 20 tasks (more tasks per server - more load).

dadgar commented 7 years ago

To be clear these are the clients?

kak-tus commented 7 years ago

On the same nodes as servers-nodes (it's a small 3-nodes cluster). I know it is not recommended in docs, but I have only 3 nodes for whole cluster.

dadgar commented 7 years ago

@kak-tus And when you say you are stopping Nomad, are you stopping both server and client? Can you show the CPU per pid?

kak-tus commented 7 years ago

@dadgar Yes, both client and server (they are running in same process). Unfortunately, I don't have per process CPU graphs, but graphs at top: there was only only one change at 15:00 - nomad was killed and managed jobs (docker containers) were continue running.

jtuthehien commented 7 years ago

Hi I'm seeing similar problem Nomad executor is taking too much CPU

screen shot 2017-01-13 at 1 44 52 pm screen shot 2017-01-13 at 1 42 52 pm
diptanu commented 7 years ago

@jtuthehien Which driver are you using?

dadgar commented 7 years ago

@jtuthehien What version of Nomad are you on? Can you run: go tool pprof http://localhost:4646/debug/pprof/profile it should output something like:

Fetching profile from http://localhost:4646/debug/pprof/profile
Please wait... (30s)
Saved profile in /home/vagrant/pprof/pprof.nomad.localhost:4646.samples.cpu.002.pb.gz
Entering interactive mode (type "help" for commands)

Can you attach the profile

jtuthehien commented 7 years ago

I'm on 0.4.6 . Docker driver. I'll send the pprof when I got it

vietwow commented 7 years ago

Hi, I'm on 0.4.1 . Docker driver 1.11.2

This is my pprof debug output

rejeep commented 7 years ago

I saw similar issues with the exec driver on Nomad 0.4.1. I haven't used Nomad since, but will hopefully spend some more time on it. If I have some useful debug information I get back to you.

iconara commented 7 years ago

Sorry, @rejeep got the version wrong (my fault for confusing Chef recipes), we ran 0.5.1 when we saw the issues.

danielbenzvi commented 7 years ago

seeing the same issue here with nomad 0.5.2

dadgar commented 7 years ago

@danielbenzvi Are you seeing the 100% CPU usage as well? What are your nodes running?

danielbenzvi commented 7 years ago

@dadgar It was one node out of two in a cluster we are POCíng - all the tasks were docker images. The machine is CentOS 7.2 running on AWS (kernel 3.10.0-514.2.2.el7.x86_64) Docker driver version is 1.12.6.

During the time of the issue, the nomad client was taking 600% CPU, attempting to start and stop docker images all the time and was very slow to respond. Also over 249 tasks accumulated as "lost" and the node health was flapping (we have 34 tasks running in the cluster normally).

Nothing in the logs suggested the cause of the issue and the docker daemon responded fine.

Here are some graphs during this time: Load averages

CPU usage

dadgar commented 7 years ago

@danielbenzvi Hmm, did this just happen randomly or could you get it in this state reproducibly?

danielbenzvi commented 7 years ago

@dadgar randomly. we've been playing with nomad for the past two weeks so I'm guessing this will happen again if we choose to take nomad further.

kak-tus commented 7 years ago

I also have an issue with frequent container restart, but it is not that reason, that in the topic. In topic I have high LA, but it is stable in time.

And frequent container restart happens, when nomad servers temporary looses connection to each other. They begin to restart tasks and then connection restores - tasks all started as normal. In nomad 0.5-0.5.2 sometimes restart not succeded and I have used this workaround: some script that periodically do "nomad run" to all tasks. In nomad 0.5.4 situation with restart is better.

jippi commented 7 years ago

@danielbenzvi what is your file descriptors limit for the nomad client ? for me a too narrow number of FD have created a great deal of issues over the last year or so, having bumped it to 65536 have made all of it go away :)

danielbenzvi commented 7 years ago

@jippi our limit is 131072 open files and we're far below it...

danielbenzvi commented 7 years ago

Seeing this again.. this is crazy. Nomad starting and stopping images all the time with no clear explanation in the logs.

dadgar commented 7 years ago

@danielbenzvi Few questions:

1) Are you using service checks? If so what type are they? 2) What is the behavior you are seeing at the jobs? Are the allocations dying and then the scheduler replaces them, they are just restarting locally, etc? Maybe show nomad status <job> and nomad alloc-status <alloc> of some of the misbehaving allocs. 3) Could you share logs and the time period this has happened?

I have not been able to replicate this.

OferE commented 7 years ago

look for the strace of the threads that are causing nomad to be in 100% cpu in issue #2590 in @dovka comment. thanks

OferE commented 7 years ago

This issue can be easily reproduced by creating a batch job that requires more resources than the cluster can handle at the first moment. Once some jobs are queued - all cpu of the workers will be occupied by nomad. In version 0.5.3 - the cpu will remain occupied even if the batch job is stopped. However in version 0.5.6 - after stopping the job the cpu usgae goes to 0 (i am guessing this has something to do with the garbage collection - but i cannot be sure).

this is described in #2590

Edit - we limited the nomad agent and its many child processes to one cpu using taskset. this made nomad only consume one cpu. This didn't change the scheduling which is still poor. After a while at pick nomad utilizing just 1/3 of the cluster in a good scenario, most of the time its even less. This means that nomad scheduling for batch job is really broken.

burdandrei commented 7 years ago

image received high CPU and memory usage of nomad client 0.5.6 when running ~ 90 service groups on the machines. All tasks are docker running the same image.

burdandrei commented 7 years ago

@OferE https://github.com/hashicorp/nomad/pull/2771 this should help you run docker with docker driver and to make a workaround with exec

OferE commented 7 years ago

@burdandrei - thanks for sharing.

I started using raw-exec for this purpose and i am so happy that i chose to use raw-exec. U can add many adjustment to your infrastructure once u have the control of the docker launch. I strongly recommend to stay with raw-exec.

My exec script now handles timeouts for batch jobs. Pulling containers from ECR. startup dependencies. logging into s3. destruction of containers in a clean way. reporting of nomad anomalies.

This has no price.

burdandrei commented 7 years ago

strange, I like docker driver, cause i can stream logs with syslog to ELK, the only thing that i needed is to have soft and not memory limits.

OferE commented 7 years ago

U'll need more once your things get complex. Also - soft memory is not enough there is also IO. I'll give u an example for other thing:, u spoke about elasticsearch - how r u stopping it without losing data? graceful shut down of services is a must...

jzvelc commented 7 years ago

@OferE I use dumb-init for that which allows me to rewrite signals (e.g. SIGINT -> SIGQUIT).

burdandrei commented 7 years ago

@jzvelc you can add STOPSIGNAL to Dockerfile, and docker will honor it. Works like a charm together with kill_timeout.

dadgar commented 7 years ago

@kak-tus Wanted to bump this issue as 0.6 is now out! Let me know if this has been resolved.

burdandrei commented 7 years ago

I'll deploy 0.6 to our staging on sunday and will let you know how it's running, as discribed in https://github.com/hashicorp/nomad/issues/2771 we got a machine with ~ 80 tasks running, and i can see both high CPU and memory usage there, while running 0.5.6

dadgar commented 7 years ago

@burdandrei Thank you! Would you mind setting enable_debug = true on the client so that we could do some perf inspection in the case we want to investigate cpu/mem usage.

https://www.nomadproject.io/docs/agent/configuration/index.html#enable_debug

burdandrei commented 7 years ago

image It's 0.6 client running 117 jobs. Virtual footprint is half of what we had in 0.5.6, nad RSS is 1/5 -less than 1 GB

kak-tus commented 7 years ago

@dadgar Thank you. In 0.6 LA is better than in 0.5.2/0.5.4, but not as like as with nomad server/client killed and jobs running. But may be, CPU usage is normal for nomad process.

dadgar commented 7 years ago

@burdandrei Did CPU usage come down as well?

@kak-tus I apologize but I am having a hard time following your comment. Did CPU utilization go down?

burdandrei commented 7 years ago

It looks like yes.

dadgar commented 7 years ago

Nice! I am going to close this issue now! Thank you so much for testing!

kak-tus commented 7 years ago

@dadgar Same for me - yes.

burdandrei commented 7 years ago

@dadgar I got something to report: Looks like high memory and CPU usage in my case was generated by the job that had 114 groups. I had nomad client v0.6.3 consuming 8-12GB RSS. In the last few days I've been working adding Replicator to our stack, and he had problem with this particular job. I split the job to 114 jobs, each one having 1 group, and magically nomad agent is now on Virtual ~8GB but RSS ~150 MB.

This should be stated in docs that huge job definition with lots of groups is not desirable for cluster, like consul-template is throwing warning if it's watching more than 100 keys, maybe it's worth warning on 10 or more groups.

jippi commented 7 years ago

@burdandrei did it also fix your hashi-ui memory issues?

burdandrei commented 7 years ago

@jippi, yes, when i open cluster tab with the tab of region without 100-group job, memory usage jumps from 100 to 300. When i switch to 100-group job it jums to 2GB in 20 seconds

jippi commented 7 years ago

Must be due to the Nomad structs or something in the SDK/server since 3rd parties like hashi-ui and replicator also see the same behaviour :)

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.