Closed cforce closed 6 months ago
Pinging code owners for receiver/hostmetrics: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.
Sure thing. Actually we've been looking at this a few times. What do you intend to do? Do you want to open a new issue with gopsutil or link there to this? I am a bit fuzzy on what you're going for.
The removal of this cache has caused the OpenTelemetry Collector (otel collector), which relies on 'gopsutil', to repeatedly access BootTime for every process and host metric with a high frequency, leading to increased resource usage. To mitigate this issue, one potential solution is to reintroduce BootTime caching in the collector's code. This change is expected to greatly improve performance since hostmetric continually calls 'gopsutil' in an "endless loop."
You can find further details and discussions regarding this issue in the following links:
It is suggested that 'gopsutil' may need to support caching for such high-demand use cases, and the synchronization of NTP and BootTime should occur at defined intervals or be handled differently to address this specific scenario effectively.
We use dependabot and keep our dependencies up to date. You can check with go.mod on the components.
3.23.10 was released 2 days ago, we use 3.23.9 right now, and will update Monday as part of our dependabot updates: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/go.mod#L12
It's really best to make these changes in gopsutil itself, rather than on our end. My concern is breaking compatibility as time goes on if we make changes here, as well as maintainability, and clean code. I'd request that you push for changes there first to see if these fixes could get any traction. If not, it may even be worthwhile to submit fixes there instead of here.
We may end up having to make changes in this repository, but I think it should be the last resort. Someone else may have thoughts as well, so I'll leave it open for discussion.
@crobert-1, I'm concerned that optimizing gopsutil for otel collector scenarios, where goopsutil is called in an endless loop with timeouts, may not yield significant improvements without the loop's context.
Another potential solution could involve utilizing eBpf kernel hooks, which offer more resource-efficient callbacks from the kernel to user space. This approach might outperform conventional user space polling, such as in hostmetrics. I'm curious if there's any plan to include eBpf support on the roadmap, as I've only found it for I network purposes and not yet deeply integrated.
Interesting projects using in context of eBPF collecting host metrics https://github.com/iovisor/bpftrace https://docs.px.dev/tutorials/custom-data/distributed-bpftrace-deployment/ https://docs.px.dev/about-pixie/pixie-ebpf/ https://docs.px.dev/about-pixie/data-sources/#data-sources Added as proposal to otel eBPF- https://github.com/open-telemetry/opentelemetry-ebpf/issues/241
FYI @mtwo @atoulme
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
Its not solved but closed :( goptsutil- how its used in otel collector needs tuning for lower cpu footprint
Took a look at this today.
I have opened a PR that adds a config field to enable caching boot times. Enabling these did show major improvements in CPU usage. I don't think we can have this on by default, because it would probably constitute a breaking change; the receiver being resilient to changing boot times (a relatively rare occurrence but a possibility) is something I would consider a user facing feature, and changing it by default would be breaking. It is possible this could be a featuregate instead, and then we eventually transition to caching by default and opting out instead.
Below, I did a bunch of additional general research around CPU usage of the receiver that isn't directly actionable but may be interesting to folks.
From my profiling, having to read /proc/<pid>/stat
and /proc/stat
so many times is the bulk of the cpu usage.
I decided to make a toy branch on my fork that copies over the code from gopsutil that is necessary to read data from stat, reads it to processMetadata
at scrape time, and uses the data from processMetadata
instead of using the processHandle
methods, which would all always read stat again.
This did reduce the CPU usage quite a bit.
There is still one extra wasted fillFromTidStatWithContext
on process handle creation because the createTime
is read when the process handle object is created. Nothing in our code seems to rely on that field, so building with a local copy of gopsutil
that comments that out yields even better results.
This last one is actually exactly pretty much half as many milliseconds as what exists today.
So overall what this demonstrates is that we're definitely victim of these wasted /proc/pid/stat
reads (which I don't think was actually a doubt in anyone's minds, but I thought it would be nice to demonstrate the impact).
It's possible we could come up with something like the above approach, where instead of using gopsutil
functionality for some things, we circumvent their functionality in certain spots so that we can avoid these extra reads. However, this implementation above is very quick and dirty, and made a ton of assumptions in an effort to quickly make up a solution. Doing this would probably be pretty challenging to maintain on our side. I've definitely had my frustrations in the past with how gopsutil
's needs and concerns clash with how we want to use it in this receiver, but I definitely understand why we use it. There are so many nasty edge cases and different ways to do this across different platforms that it is really nice to rely on a library that is quite well researched and put together in terms of the correctness in how it process the data.
I have some hopes for v4 of gopsutil
, as they have plans to address this sort of issue. Of particular interest here is that they seem to plan on allowing APIs between platforms to differ. This would be huge here, since Linux could just expose reading /proc/pid/stat
as a public function that we could make use of however we wanted. The Windows API for a lot of the stats that we read/re-read proc/pid/stat
for are actually properly separated actions in Windows, and so we can stick to that API for Windows and try to find a nice way to diverge on Linux.
There might be ways to fix this upstream directly, but it does look like the problem is more about the clash between how gopsutil
expects to be used and how we actually use it. The way we want this to work (scraping all info at the start of the scrape and recording all the metrics) is somewhat in contrast to the gopsutil
expectation of fetching new stats in real time. So ways to fix this upstream aren't really coming to mind right now.
@braydonk thank you for the summary! Let's see if gopsutil v4 will help us eventually.
the receiver being resilient to changing boot times (a relatively rare occurrence but a possibility)
Can you please elaborate on this? Under what circumstances would it change? If the chance is negligible and it's not expected to change in typical scenarios, I would lean toward a feature gate instead of providing this option to the end users
The most common scenario where the boot time can change is when using a ntp service of some kind to sync time over a network. This can cause the boot time to jump around.
Google Compute Engine VMs for example do bake in a service like this in their supported images called chrony
. I imagine other cloud providers have something similar, could be chrony
or systemd-timesyncd
but same idea. These services can adjust the boot time under the hood. I don't have much experience with them to say how common that sort of thing is though. I have to expect it's the kind of thing that would only change on something like a timezone switch, but it's also possible that something like a service outage causing a VM to lose connection to the central time server could result in strange behaviour. This is all theorizing though, if someone does have more experience with tracking this stuff please feel free to chime in.
An alternative strategy here would be that we could have it so gopsutil
actually reads the Boot time once with the cache disabled at the very beginning of fetching all processes, and after it has done it once we enable the cache. It would look pretty awkward in code but it seems doable. If that would work I'd be more comfortable changing that without worrying too much about breaking changes, as we'd reread the boot time once every scrape and then just have all future reads use the cached boot time. It would probably still be a pretty good performance win.
Related to my last comment, I tested this out by adding a function to gopsutil
that will manually reread the boot time and store the new value in the cache. With this, in the process/processes scrapers' scrape loops, I would refresh the boot time at the start, and all uses for the rest of the scrape would be the cached boot time. This yielded similarly positive results, and I now think this would actually be the best way to implement this.
Before:
After:
I submitted a PR to gopsutil with my changes: https://github.com/shirou/gopsutil/pull/1616
An alternative strategy here would be that we could have it so gopsutil actually reads the Boot time once with the cache disabled at the very beginning of fetching all processes, and after it has done it once we enable the cache.
Can you please clarify how this is different from https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31998? What boot time is cached if we go with https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31998?
In the PR I have open, the boot time is read and cached at the beginning of the first scrape, and for the entire lifetime of the collector process it will remain.
The alternative I proposed above is to re-read the boot time every scrape, but just once. Then that boot time that's read at the beginning of each scrape is cached and used for all subsequent boot time calls in the scrape. I think this is better because we do account for the rare case of boot time drifts or changes due to network time sync, but we only do the full process of parsing /proc/stat
to get the boot time once total, rather than 4 times per process. This is I think really unlikely to produce negative results, whereas using the boot time cached for the lifetime of the process has the rare chance to mess up some metrics.
@braydonk Tx so much - might be a game changer for embedded systems usage I would even think of switching off the boot time sync by feature flag for those systems which do not auto sync at all but anyway will rebooted once a day. In this case just read boot time at start and that’s it until next restart.
The drift in the boot time caused by the NTP resync potentially can just update the start time of the reported datapoints and nothing else, right? I cannot imagine how this can cause any significant issues. The drift is likely to be no more than a few milliseconds. And I would argue that keeping the start time of reported data points consistent over the lifespan of the data collection may be even more important than representing "true" clock time at that point in time.
The drift in the boot time caused by the NTP resync potentially can just update the start time of the reported datapoints and nothing else, right?
Yeah that is the main spot where the boot time is used. (For context for folks following along, this is the calculation of creation time which we use as the start time for each datapoint).
I think you're right about this after sleeping on it. I can't shake the feeling that it feels like a breaking change, but I can't really come up with any scenarios where this would negatively effect anything. So I propose a new plan:
With a featuregate that defaults on, since it seems low risk, the boot time cache is always enabled.
If we get an issue with some unforeseen problem caused by us cachine btime
, then we can instruct the user to turn the featuregate off. In the old behaviour the boot time is still read vastly more times than necessary, so with the featuregate off we can use https://github.com/shirou/gopsutil/pull/1616. It will be similar functionality as it exists today, but we just read the boot time once per scrape, which is all that was necessary. It means turning off the featuregate won't go back to the old performance.
We can keep the featuregate around for some amount of releases, and if there are no problems we can remove it and it can become the default. I'd say probably 4 or 5 releases would be long enough for someone to take issue.
Does that sound like a good plan?
Re @cforce's last comment, in this scenario the new default will become the boot time always being cached, with the featuregate as a backup. So there should be no concerns for the use cases you specify here, and really what we'll need to keep an eye on is cloud provided VM services or things that use a ntp
syncing service that might break something.
That turn out perfectly may solve the reported issue. I hope shirou will accept the MR soon https://github.com/shirou/gopsutil/pull/1616 . Voted - if that helps.
@braydonk sounds good to me! 👍 Please update the PR accordingly
https://github.com/shirou/gopsutil/pull/1616 is now closed since it turns out we can use host.BootTimeWithContext
to manually refresh it, it's an awkward API but it will work. Updating the PR now to reflect the suggested plan.
With #32126 merged, this will be released in version 0.99.0 of the collector. By default, the boot time will always be cached at the start for all receivers. This changes nothing for the other receivers, which all read boot time once and cached it anyway. This saves lots of CPU for processes
and even more for process
since now the cached boot time will be used instead of rereading it multiple times per-process.
The old behaviour of process
scraping meant the boot time would be refreshed every scrape, but it would do that by being reread 4 times per-process. Under the new behaviour, the boot time will be read once at start time and then used for the rest of the collector run. With the new hostmetrics.process.bootTimeCache
option disabled, it sort of replicates old behaviour by rereading boot time once per scrape, but then caches that boot time for the rest of the scrape. So the CPU savings should be there regardless of if the feature flag is switched on or off.
Resolved by #32126
Component(s)
receiver/hostmetricsreceiver
Is your feature request related to a problem? Please describe.
Hostmetrics process(es) metrics are retrieved using the go lib gopstutil. There are several known issues with CPU utilization peaks mainly connected to getting the "boot time each time" per process. See
A bit off-topic but relevant for hostmetrics "proc/net" network connection metrics is this one. The
system.network.connections
metric being disabled and not collecting the information from the host does waste less CPU cycles, especially through the flame graph query. It is found that gopsutil is used. This will scan files under the proc directory. The more links, the more CPU. The more resources are occupied, see if it can be adjusted. However, this needs otel collector >= v0.85.0 at least, as before there was a bug that disabling did still execute the scraping. See receiver/hostmetricsreceiver/internal/scraper/networkscraper/network_scraper.go (line 160, extra "if !s.config.Metrics.SystemNetworkConnections.Enabled {"). If you try around with load/perf on a Docker container env, be aware that you need to mount the host filesystem to simulate a proper /proc access.Describe the solution you'd like
I would like to get improvements and fixes related to the high CPU utilization issues caused by retrieving process metrics using gopstutil. Specifically, addressing the known issues mentioned in gopstutil Issue #1283, gopstutil Issue #842, and gopstutil Issue #1070.
Describe alternatives you've considered
Alternative solutions could involve optimizing the way process metrics are collected and exploring methods to reduce CPU utilization by less /not at all getting "boot time each time per process " or for metrics like
system.network.connections
. These alternatives may require changes to the gopstutil library or adjustments in the Otell Collector code.Additional context
Additionally, I suggest addressing the matter of the
system.network.connections
metric and its impact on CPU cycles, as described in gopstutil Issue #695.I have performed an analysis on CPU spikes for the OTEL collector by modifying the configuration and running a cputop script for approximately 5 minutes. The host metrics scrapers list used in OTEL collector includes CPU, memory, network, disk, load, paging, processes, and process metrics.
Based on the data collected, it is evident that certain metrics significantly impact CPU spikes while others do not. For example:
This information should guide our approach to optimizing the collector's performance and reducing CPU utilization in specific areas.