Closed anastazya closed 2 years ago
Hi @anastazya! The htop
output you have there is displaying all the threads for the Nomad agent process (all the green-colored lines). So that PID 1146 has only ~1GB RSS. Can you provide the rest of the memory stats? ps -eo pid,tid,class,rtprio,stat,vsz,rss,comm
will give a reasonable table of the whole host.
ps -eo pid,tid,class,rtprio,stat,vsz,rss,comm
I can give the stats for another host in the same state, cause i restarted nomad on the host that i took the screenshot.
Ok, so the first thing I worried about was leaking Nomad processes. You've got 93 php
processes (not counting the php-fpm
) and 187 nomad
processes. That's 2 nomad
processes for each php
process, plus 1 more for the Nomad agent. That's what I'd expect to see, as each task has a task executor
and a logmon
log collector. We'd generally expect that to add up to a couple of GB of RSS just because of the sheer number of processes.
If we do a bit of math we can see that ~7GB of RSS are dedicated to php
but a whopping 14.6GB is dedicated to nomad
! I plotted a quickie histogram and (if we discard the outlier which is the Nomad agent), we can see there's a group of ~20 processes that are eating up almost half of that total and are individually 100-300MB of RSS.
A couple things that would be useful to help debug this:
executor
or logmon
processes?nomad --version
)?It might also be useful to see if there's a relationship between the etimes
and the RSS of the Nomad process. Something like ps -eo pid,comm,args,rss,etimes
would give us a table of that relationship, including the args that let us differentiate between executor
and logmon
, or if you have graphs from whatever you're sending those collectd metrics to. That would let us see if one of the specific processes has a memory leak.
I've done some very quick tests here and I don't see any obvious leaks on the executor or logmon for a process that's doing frequent logging. But you may be encountering this issue more seriously because of the large number of tasks on the host. It looks like you're running a task-per-php
, so if it's possible to spawn them behind a single php-fpm
you might have efficiency gains that will help reduce the problem while we work on figuring this out.
(Also, I hope you don't mind but I edited your last response to wrap that ps dump in a <details>
block for readability.)
Nomad v1.0.0 (cfca6405ad9b5f66dffc8843e3d16f92f3bedb43)
The rest of the details i will compile today as this is somewhat production environment.
@anastazya I did some quick rummaging thru those numbers and it looks like there's no relationship between etime and memory (if anything, the executors seem to free up memory a tiny bit the older they are). But the logmon processes are grabbing a lot more RSS than I might expect, and there are two outlier logmon processes that are each eating up 1+GB of memory each. The primary way I'd expect to see this is if the logmon processes have queued up a lot of messages that aren't yet committed to disk, although I'm surprised that wouldn't result in dropped messages first... I need to dig into that code. In any it might be worth looking at the host's disk IO utilization.
With that clue, I'll see if I can set up a reproduction that'll cause logmon's memory to expand unexpectedly.
The primary way I'd expect to see this is if the logmon processes have queued up a lot of messages that aren't yet committed to disk, although I'm surprised that wouldn't result in dropped messages first... I need to dig into that code.
Well that was totally off-base... we have a fifo with blocking reads with a small buffer, so that's just not something that would happen without also blocking the log writer of your application.
I've been trying to get a reproduction here by having an raw_exec
task that sends "excessive" logs (~100MB/s worth). At that volume, I get the RSS of the logmon to slowly climb up to ~35MB and then it levels out. I've tried varying the line-length of the log to see if there was a bug in the way we try to span newlines for file rotation and that hasn't panned out. I've tried varying the log file size via logs.max_file_size
to see more rapid log rotation can detect a leak we'd otherwise miss, but so far I've found nothing there either. I thought I had a promising lead for file handles leaking, but that turned out to be references to the same file handle for log rotation.
The total RSS does seem proportion to log volume though: if I tune the throughput of the logs by sending small web request logs instead of literally running cat shakespere_macbeth.txt
in a loop, the RSS is considerably smaller. Given that the RSS of go applications is effectively a "high water mark" because the runtime almost never gives memory back to the OS, that suggests the logmon is just creating the various buffers it needs to do Write
and rotation, but that there's something unusual about this log workload that I haven't quite figured out yet.
With more information about your log load, I might be able to narrow things down. Do you have a logs
configuration? Are the logs unusually high volume? Do they have long lines (maybe they're wide structured events)?
The primary way I'd expect to see this is if the logmon processes have queued up a lot of messages that aren't yet committed to disk, although I'm surprised that wouldn't result in dropped messages first... I need to dig into that code.
Well that was totally off-base... we have a fifo with blocking reads with a small buffer, so that's just not something that would happen without also blocking the log writer of your application.
I've been trying to get a reproduction here by having an
raw_exec
task that sends "excessive" logs (~100MB/s worth). At that volume, I get the RSS of the logmon to slowly climb up to ~35MB and then it levels out. I've tried varying the line-length of the log to see if there was a bug in the way we try to span newlines for file rotation and that hasn't panned out. I've tried varying the log file size vialogs.max_file_size
to see more rapid log rotation can detect a leak we'd otherwise miss, but so far I've found nothing there either. I thought I had a promising lead for file handles leaking, but that turned out to be references to the same file handle for log rotation.The total RSS does seem proportion to log volume though: if I tune the throughput of the logs by sending small web request logs instead of literally running
cat shakespere_macbeth.txt
in a loop, the RSS is considerably smaller. Given that the RSS of go applications is effectively a "high water mark" because the runtime almost never gives memory back to the OS, that suggests the logmon is just creating the various buffers it needs to doWrite
and rotation, but that there's something unusual about this log workload that I haven't quite figured out yet.With more information about your log load, I might be able to narrow things down. Do you have a
logs
? Are the logs unusually high volume? Do they have long lines (maybe they're wide structured events)?
The logs config across all jobs is like this :
logs {
max_files = 1
max_file_size = 1
}
I will get back with the log volume, needs investigation.
That's very interesting... I'd expect that to result in very rapid log rotation if the volume of logs is reasonably high, which might be triggering buffering while the rotation is happening. That gives me something else to look at, at least, but the log volume would be interesting to see for sure.
May I ask why such a low value for these?
That's very interesting... I'd expect that to result in very rapid log rotation if the volume of logs is reasonably high, which might be triggering buffering while the rotation is happening. That gives me something else to look at, at least, but the log volume would be interesting to see for sure.
May I ask why such a low value for these?
We thought in the beginning to log only bear minimum and minimise the storage impact and IOPS usage. But if you say this has a great impact on log rotating, we will change those values. Any recommendation ?
But if you say this has a great impact on log rotating, we will change those values. Any recommendation ?
That entirely depends on the log volume. If you're only sending a trickle of logs it doesn't matter. But you're sending >1MB of logs a second then you're rotating the logs continuously. You almost certainly want the logs to be rotating less often than it the time it takes to actually rotate them. We don't compress the logs, so it should be quick unless the disk is getting hammered (which of course it might be if you're writing that many logs!). It's definitely worth comparing the log volume to the overall IOPS usage.
But if you say this has a great impact on log rotating, we will change those values. Any recommendation ?
That entirely depends on the log volume. If you're only sending a trickle of logs it doesn't matter. But you're sending >1MB of logs a second then you're rotating the logs continuously. You almost certainly want the logs to be rotating less often than it the time it takes to actually rotate them. We don't compress the logs, so it should be quick unless the disk is getting hammered (which of course it might be if you're writing that many logs!). It's definitely worth comparing the log volume to the overall IOPS usage.
I will be doing in depth tests next week and write here if i find something LOG related. As far as i see the logging is not that aggressive.
I've encountered a scenario where Nomad consumed a lot of memory (over 5GB), I eventually found out it happened because of tasks that keep restarting forever due to insufficient resources or bad arguments. Maybe this will help you.
Edit: Just to elaborate, the task restarting issue helped to decrease the memory usage to 1GB in some cases, in others it released some memory but was still high. I can't say for sure that the trigger to that issue is the fault task but I didn't encounter the high memory usage where all of my tasks were valid.
I've encountered a scenario where Nomad consumed a lot of memory (over 5GB), I eventually found out it happened because of tasks that keep restarting forever due to insufficient resources or bad arguments. Maybe this will help you.
Edit: Just to elaborate, the task restarting issue helped to decrease the memory usage to 1GB in some cases, in others it released some memory but was still high. I can't say for sure that the trigger to that issue is the fault task but I didn't encounter the high memory usage where all of my tasks were valid.
I can assure you that all the tasks run well and don't restart with errors.
I looked at it a bit more, the issue I described is referring only to when Nomad agent uses high memory. I also have the same issue with "nomad logmon" without any errors on the Nomad monitor logs. I'm using Windows server, if you need any additional details that will help you investigate let me know. Nomad version is 0.12.7
@Nickleby9 let's open that as a new issue then, so as not to confuse this issue.
After adding this modification :
logs { max_files = 10 max_file_size = 10 }
Today i found half the Nomad nodes like the picture. Besides restarting the nomad agent, there was nothing i could do. After nomad agent restart, it started working normal again.
I still need to come back to review this, but just saw that go1.16 dropped with the following improvement (ref https://golang.org/doc/go1.16#runtime):
On Linux, the runtime now defaults to releasing memory to the operating system promptly (using MADV_DONTNEED), rather than lazily when the operating system is under memory pressure (using MADV_FREE). This means process-level memory statistics like RSS will more accurately reflect the amount of physical memory being used by Go processes. Systems that are currently using GODEBUG=madvdontneed=1 to improve memory monitoring behavior no longer need to set this environment variable.
I suspect this is going to help us out in Nomad quite a bit because logmon is in the allocation resource boundary. I'm not sure if we'll ship that in the upcoming Nomad 1.0.4, but certainly for Nomad 1.1.
@tgross sorry, any updates? I can see this problem still not solved in 1.1 version of Nomad?
@tgross sorry, any updates? I can see this problem still not solved in 1.1 version of Nomad?
Hi @bubejur we just shipped Nomad 1.1 a couple days ago, so I'm sure @anastazya hasn't had a chance to try it out in their environment to verify whether MADV_DONTNEED
helped us out here. If you were able to reliably reproduce the problem yourself on an earlier version of Nomad and can test it out, we'd love to hear about it.
Okay. As i can see it: In previous versions (I tried all from 1.0.0 - till 1.0.6) there were problems with: [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=2eba8886-94a2-0afc-8e2b-24e6af270728 driver=raw_exec task_name=Group-1-worker-0 error="rpc error: code = Unavailable desc = transport is closing" and some time ago (about 8h) logmon process takes >5% memory usage. And after 1 day I really need to restart nomad process to free up node memory. I hoped a lot about fix it at 1.1.0, but still have this issue...
In previous versions (I tried all from 1.0.0 - till 1.0.6) there were problems with: [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=2eba8886-94a2-0afc-8e2b-24e6af270728 driver=raw_exec task_name=Group-1-worker-0 error="rpc error: code = Unavailable desc = transport is closing"
Ok, please open a new issue if you're having problems with log monitoring of your tasks rather than piling on this one.
Okay, here we go https://github.com/hashicorp/nomad/issues/10629
Do you need any additional details? Or we can just wait fix for this fix?
@bubejur if you have anything to contribute to help debug, you're welcome to (take a look through the extended discussion above). Otherwise please just wait for the original poster to reply with their tests with the new version.
Hello again, could you please tell me what you need exactly to debug this problem again?
Currently what we don't know is if the madvise changes in go1.16 helped the problem by letting logmon release unused memory back to the OS. So what we want to do is compare a client with workloads that was showing the problem described in this issue prior to Nomad 1.1.0 to the same workload running with Nomad 1.1.0+. The specific data we'd want is ps -eo pid,comm,args,rss,vsz,etimes
taken 2-3 times on the Nomad client, several hours apart each time. Ideally before and after.
The tricky bit here is that you can't just update the client with the workloads in-place, because Nomad doesn't shutdown workloads when it restarts and the logmon will be the existing processes. You'd need to update the client and reschedule all the workloads so that they have the logmon from the same version as the client.
Oh, thanks i'll update to 1.1.2 on this weekend. See you later.
Some details:
nomad -version
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)
Thanks @bubejur. I've edited your comment to wrap the ps
snapshots in details blocks for legibility. We probably should have passed a flag here to ensure that the args weren't truncated, as I can't tell which nomad
processes are the log runner vs the executor. I'll take a crack at analyzing these snapshots later today.
@tgross sorry, never used ``` before :-)
Ok, I imported those snapshots into a spreadsheet and took a look. The ps
snapshots you took are only a few seconds apart instead of a few hours apart, so that doesn't tell us much in terms of how things are changing over time. But there's two bits of evidence here that suggest that madvise didn't really help us out.
In this chart we see that most of the Nomad logmon/executor processes have RSS in the range we'd expect. Note this range is maybe a bit higher than we'd like, but that's out of scope for this issue which is really about the outlier behavior. We have a handful of processes that are running up high RSS. (Also please excuse Google Sheet's janky histogram buckets here.)
In this chart we compare RSS with etime to make sure this isn't a case where the longer a process is running the more memory it's using. We see this isn't the case, but that we've got a weird function going on here where bunches of processes that started at the same time are all using about the same amount of memory. Any chance that those timestamps correlate to Nomad client restarts?
I think at this point it'd be worth digging into one of those outlier processes in more detail. Let's look at the mapped memory for a few different processes using pmap -xp $PID
. It'd be helpful to see this for one of the processes from the highest RSS group, one from the middle RSS group (highest etime), and for the Nomad agent itself.
got it, i'm gonna try ti grab more stats differed by time.
In the meantime, I'm going to ping my colleague @notnoop on this one, as this kind of thing is right up his alley and I'm not going to be available to work on this one after this week.
@notnoop how you doing?
@notnoop @tgross Hi! Any updates?
@bubejur Hey. This ticket is on my list for this week to investigate and follow up. Thank you for your patience.
@notnoop @tgross any updates?
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)
After updating to v1.1.2, we still see the same issue with nomad logmon.
The only resolve seems to be restarting nomad ( systemctl restart nomad )
@notnoop @tgross any updates?
I implemented a cron job across all our nomad clients that executes "systemctl restart nomad && swapoff -a && swapon -a" every 24 hours, at random minutes ( not all at once ).
So far i managed to keep memory usage under the OOM limit, but this method might have side effects that we are not yet aware of.
@anastazya i guess it might not to be production solution :-)
@anastazya i guess it might not to be production solution :-)
Desperate times call for desperate measures. I have far too many nodes and tasks to just wait. I needed a fast solution to this OOM problem, cause tasks that failed when OOM manifests lead to data loss and that's huge issue for my setup.
got same on 1.1.4 :-(
@notnoop @tgross Hi! Any updates?
@notnoop @tgross same on 1.1.5
@bubejur Thanks for your patience. I was OOO for awhile and catching up now. I have built a binary to instrument logmon to emit memory heap/allocs profiles in the b-logmon-inspect
branch. If possible, can you try running with the binaries generated in https://app.circleci.com/pipelines/github/hashicorp/nomad/17688/workflows/351ab2f1-12c7-4930-b88b-c3eb4f875a70/jobs/175438/artifacts .
When you notice a high memory usage of a logmon process, you can send it a SIGUSR2
signal (e.g. kill -SIGUSR2 <logmon_pid>
), and two memory profiles prefixed with nomad-logmon-pprof-
will be generated in the host temporary directory - the exact name should be noted in a log line.
With the pprof files, it'll be easier to identify the cause of high memory usage and test hypothesis for fixing it. Thank you so much again for your patience.
@bubejur Thanks for your patience. I was OOO for awhile and catching up now. I have built a binary to instrument logmon to emit memory heap/allocs profiles in the
b-logmon-inspect
branch. If possible, can you try running with the binaries generated in https://app.circleci.com/pipelines/github/hashicorp/nomad/17688/workflows/351ab2f1-12c7-4930-b88b-c3eb4f875a70/jobs/175438/artifacts .When you notice a high memory usage of a logmon process, you can send it a
SIGUSR2
signal (e.g.kill -SIGUSR2 <logmon_pid>
), and two memory profiles prefixed withnomad-logmon-pprof-
will be generated in the host temporary directory - the exact name should be noted in a log line.With the pprof files, it'll be easier to identify the cause of high memory usage and test hypothesis for fixing it. Thank you so much again for your patience.
I will be freeing up some time in the next days and run that binary against some of our servers to get some info.
I have a cluster of 20 nodes, all running "raw-exec" tasks in PHP.
At random intervals after a while i get a lot of OOM's. I find the server with 100% swap usage and looking like this :
If i restart the nomad agent, it all goes back to normal for a while.
I also get this in nomad log : "2021-01-20T17:52:28.522+0200 [INFO] client.gc: garbage collection skipped because no terminal allocations: reason="number of allocations (89) is over the limit (50)" <--- that message is extremely ambiguous as everything runs normal and nomad was just restarted.