Closed anastazya closed 2 years ago
nomad-logmon-pprof-allocs-1963225446.txt nomad-logmon-pprof-heap-861159187.txt @anastazya @notnoop here we go
Thanks @bubejur ! The profiles you included highlight a memory leak! There were 7,395 instances of buffered writers, each 64kb (accounting for 473.28MB). I was able to reproduce the memory leak and have the fix in https://github.com/hashicorp/nomad/pull/11261 .
Can you confirm that you have some tasks that are getting restarted or signaled frequently? That may explain it. I'll be curious if there is another cause or contributing factor?
@notnoop yes, it's like cron tasks, but they are working as a "service". Also can you tell me how can i fix error like: [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=eddb6331-dcfe-92f8-c49e-eba54ffb68f6 driver=raw_exec task_name=worker-5 error="rpc error: code = Unavailable desc = transport is closing"
Great. The fix was out in 1.1.6 release of yesterday - give a try and let us know of any questions.
Sadly, the "transport is closing" error messages are a nuance and not actionable - we should gracefully handle the case better. It's tracked in https://github.com/hashicorp/nomad/issues/10814 .
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
I have a cluster of 20 nodes, all running "raw-exec" tasks in PHP.
At random intervals after a while i get a lot of OOM's. I find the server with 100% swap usage and looking like this :
If i restart the nomad agent, it all goes back to normal for a while.
I also get this in nomad log : "2021-01-20T17:52:28.522+0200 [INFO] client.gc: garbage collection skipped because no terminal allocations: reason="number of allocations (89) is over the limit (50)" <--- that message is extremely ambiguous as everything runs normal and nomad was just restarted.