hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

high memory usage in logmon #9858

Closed anastazya closed 2 years ago

anastazya commented 3 years ago

I have a cluster of 20 nodes, all running "raw-exec" tasks in PHP.

At random intervals after a while i get a lot of OOM's. I find the server with 100% swap usage and looking like this :

Screenshot 2021-01-20 at 17 45 21

If i restart the nomad agent, it all goes back to normal for a while.

I also get this in nomad log : "2021-01-20T17:52:28.522+0200 [INFO] client.gc: garbage collection skipped because no terminal allocations: reason="number of allocations (89) is over the limit (50)" <--- that message is extremely ambiguous as everything runs normal and nomad was just restarted.

bubejur commented 2 years ago

nomad-logmon-pprof-allocs-1963225446.txt nomad-logmon-pprof-heap-861159187.txt @anastazya @notnoop here we go

notnoop commented 2 years ago

Thanks @bubejur ! The profiles you included highlight a memory leak! There were 7,395 instances of buffered writers, each 64kb (accounting for 473.28MB). I was able to reproduce the memory leak and have the fix in https://github.com/hashicorp/nomad/pull/11261 .

Can you confirm that you have some tasks that are getting restarted or signaled frequently? That may explain it. I'll be curious if there is another cause or contributing factor?

bubejur commented 2 years ago

@notnoop yes, it's like cron tasks, but they are working as a "service". Also can you tell me how can i fix error like: [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=eddb6331-dcfe-92f8-c49e-eba54ffb68f6 driver=raw_exec task_name=worker-5 error="rpc error: code = Unavailable desc = transport is closing"

notnoop commented 2 years ago

Great. The fix was out in 1.1.6 release of yesterday - give a try and let us know of any questions.

Sadly, the "transport is closing" error messages are a nuance and not actionable - we should gracefully handle the case better. It's tracked in https://github.com/hashicorp/nomad/issues/10814 .

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.