reduce cpu consumption of hostmetric receiver (process & processes)

cforce commented 1 year ago

Component(s)

receiver/hostmetricsreceiver

Is your feature request related to a problem? Please describe.

Hostmetrics process(es) metrics are retrieved using the go lib gopstutil. There are several known issues with CPU utilization peaks mainly connected to getting the "boot time each time" per process. See

A bit off-topic but relevant for hostmetrics "proc/net" network connection metrics is this one. The system.network.connections metric being disabled and not collecting the information from the host does waste less CPU cycles, especially through the flame graph query. It is found that gopsutil is used. This will scan files under the proc directory. The more links, the more CPU. The more resources are occupied, see if it can be adjusted. However, this needs otel collector >= v0.85.0 at least, as before there was a bug that disabling did still execute the scraping. See receiver/hostmetricsreceiver/internal/scraper/networkscraper/network_scraper.go (line 160, extra "if !s.config.Metrics.SystemNetworkConnections.Enabled {"). If you try around with load/perf on a Docker container env, be aware that you need to mount the host filesystem to simulate a proper /proc access.

gopstutil Issue #695

Describe the solution you'd like

I would like to get improvements and fixes related to the high CPU utilization issues caused by retrieving process metrics using gopstutil. Specifically, addressing the known issues mentioned in gopstutil Issue #1283, gopstutil Issue #842, and gopstutil Issue #1070.

Describe alternatives you've considered

Alternative solutions could involve optimizing the way process metrics are collected and exploring methods to reduce CPU utilization by less /not at all getting "boot time each time per process " or for metrics like system.network.connections. These alternatives may require changes to the gopstutil library or adjustments in the Otell Collector code.

Additional context

Additionally, I suggest addressing the matter of the system.network.connections metric and its impact on CPU cycles, as described in gopstutil Issue #695.

I have performed an analysis on CPU spikes for the OTEL collector by modifying the configuration and running a cputop script for approximately 5 minutes. The host metrics scrapers list used in OTEL collector includes CPU, memory, network, disk, load, paging, processes, and process metrics.

Based on the data collected, it is evident that certain metrics significantly impact CPU spikes while others do not. For example:

Process and Processes metrics create an impact on CPU spike.
CPU, memory, and paging metrics do not seem to have a significant effect on CPU spike.

This information should guide our approach to optimizing the collector's performance and reducing CPU utilization in specific areas.

github-actions[bot] commented 1 year ago

Pinging code owners for receiver/hostmetrics: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme commented 1 year ago

Sure thing. Actually we've been looking at this a few times. What do you intend to do? Do you want to open a new issue with gopsutil or link there to this? I am a bit fuzzy on what you're going for.

cforce commented 1 year ago

Does collector use latest gopsutil version v3.23.10 including this merges
- https://github.com/shirou/gopsutil/pull/1514
Improve usage of features from gops util - see A significant performance issue has been observed in the 'gopsutil' library on Linux when comparing versions 2.20.4 and 2.20.5. In version 2.20.4, a particular test case with approximately 280 processes exhibited a CPU usage of 14%. However, in version 2.20.5, the CPU usage skyrocketed to 195%. This performance degradation can be attributed to the removal of BootTime caching in order to address a specific scenario where a device initially reports an incorrect date (e.g., epoch or a more recent date set by a tool like 'fake_hwclock') and later corrects it using NTP. This results in 'gopsutil' reporting incorrect system uptime.

The removal of this cache has caused the OpenTelemetry Collector (otel collector), which relies on 'gopsutil', to repeatedly access BootTime for every process and host metric with a high frequency, leading to increased resource usage. To mitigate this issue, one potential solution is to reintroduce BootTime caching in the collector's code. This change is expected to greatly improve performance since hostmetric continually calls 'gopsutil' in an "endless loop."

You can find further details and discussions regarding this issue in the following links:

It is suggested that 'gopsutil' may need to support caching for such high-demand use cases, and the synchronization of NTP and BootTime should occur at defined intervals or be handled differently to address this specific scenario effectively.

atoulme commented 1 year ago

We use dependabot and keep our dependencies up to date. You can check with go.mod on the components.

3.23.10 was released 2 days ago, we use 3.23.9 right now, and will update Monday as part of our dependabot updates: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/go.mod#L12

crobert-1 commented 1 year ago

It's really best to make these changes in gopsutil itself, rather than on our end. My concern is breaking compatibility as time goes on if we make changes here, as well as maintainability, and clean code. I'd request that you push for changes there first to see if these fixes could get any traction. If not, it may even be worthwhile to submit fixes there instead of here.

We may end up having to make changes in this repository, but I think it should be the last resort. Someone else may have thoughts as well, so I'll leave it open for discussion.

cforce commented 1 year ago

@crobert-1, I'm concerned that optimizing gopsutil for otel collector scenarios, where goopsutil is called in an endless loop with timeouts, may not yield significant improvements without the loop's context.

Another potential solution could involve utilizing eBpf kernel hooks, which offer more resource-efficient callbacks from the kernel to user space. This approach might outperform conventional user space polling, such as in hostmetrics. I'm curious if there's any plan to include eBpf support on the roadmap, as I've only found it for I network purposes and not yet deeply integrated.

Interesting projects using in context of eBPF collecting host metrics https://github.com/iovisor/bpftrace https://docs.px.dev/tutorials/custom-data/distributed-bpftrace-deployment/ https://docs.px.dev/about-pixie/pixie-ebpf/ https://docs.px.dev/about-pixie/data-sources/#data-sources Added as proposal to otel eBPF- https://github.com/open-telemetry/opentelemetry-ebpf/issues/241

FYI @mtwo @atoulme

github-actions[bot] commented 9 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/hostmetrics: @dmitryax @braydonk

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 7 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

cforce commented 7 months ago

Its not solved but closed :( goptsutil- how its used in otel collector needs tuning for lower cpu footprint

braydonk commented 7 months ago

Took a look at this today.

I have opened a PR that adds a config field to enable caching boot times. Enabling these did show major improvements in CPU usage. I don't think we can have this on by default, because it would probably constitute a breaking change; the receiver being resilient to changing boot times (a relatively rare occurrence but a possibility) is something I would consider a user facing feature, and changing it by default would be breaking. It is possible this could be a featuregate instead, and then we eventually transition to caching by default and opting out instead.

Below, I did a bunch of additional general research around CPU usage of the receiver that isn't directly actionable but may be interesting to folks.

From my profiling, having to read /proc/<pid>/stat and /proc/stat so many times is the bulk of the cpu usage. Flamegraph of CPU profile using existing hostmetricsreceiver

I decided to make a toy branch on my fork that copies over the code from gopsutil that is necessary to read data from stat, reads it to processMetadata at scrape time, and uses the data from processMetadata instead of using the processHandle methods, which would all always read stat again.

This did reduce the CPU usage quite a bit. Flamegraph of CPU usage with new changes There is still one extra wasted fillFromTidStatWithContext on process handle creation because the createTime is read when the process handle object is created. Nothing in our code seems to rely on that field, so building with a local copy of gopsutil that comments that out yields even better results. Final flamegraph of CPU usage with the last change This last one is actually exactly pretty much half as many milliseconds as what exists today.

So overall what this demonstrates is that we're definitely victim of these wasted /proc/pid/stat reads (which I don't think was actually a doubt in anyone's minds, but I thought it would be nice to demonstrate the impact).

It's possible we could come up with something like the above approach, where instead of using gopsutil functionality for some things, we circumvent their functionality in certain spots so that we can avoid these extra reads. However, this implementation above is very quick and dirty, and made a ton of assumptions in an effort to quickly make up a solution. Doing this would probably be pretty challenging to maintain on our side. I've definitely had my frustrations in the past with how gopsutil's needs and concerns clash with how we want to use it in this receiver, but I definitely understand why we use it. There are so many nasty edge cases and different ways to do this across different platforms that it is really nice to rely on a library that is quite well researched and put together in terms of the correctness in how it process the data.

I have some hopes for v4 of gopsutil, as they have plans to address this sort of issue. Of particular interest here is that they seem to plan on allowing APIs between platforms to differ. This would be huge here, since Linux could just expose reading /proc/pid/stat as a public function that we could make use of however we wanted. The Windows API for a lot of the stats that we read/re-read proc/pid/stat for are actually properly separated actions in Windows, and so we can stick to that API for Windows and try to find a nice way to diverge on Linux.

There might be ways to fix this upstream directly, but it does look like the problem is more about the clash between how gopsutil expects to be used and how we actually use it. The way we want this to work (scraping all info at the start of the scrape and recording all the metrics) is somewhat in contrast to the gopsutil expectation of fetching new stats in real time. So ways to fix this upstream aren't really coming to mind right now.

dmitryax commented 7 months ago

@braydonk thank you for the summary! Let's see if gopsutil v4 will help us eventually.

the receiver being resilient to changing boot times (a relatively rare occurrence but a possibility)

Can you please elaborate on this? Under what circumstances would it change? If the chance is negligible and it's not expected to change in typical scenarios, I would lean toward a feature gate instead of providing this option to the end users

braydonk commented 7 months ago

The most common scenario where the boot time can change is when using a ntp service of some kind to sync time over a network. This can cause the boot time to jump around.

Google Compute Engine VMs for example do bake in a service like this in their supported images called chrony. I imagine other cloud providers have something similar, could be chrony or systemd-timesyncd but same idea. These services can adjust the boot time under the hood. I don't have much experience with them to say how common that sort of thing is though. I have to expect it's the kind of thing that would only change on something like a timezone switch, but it's also possible that something like a service outage causing a VM to lose connection to the central time server could result in strange behaviour. This is all theorizing though, if someone does have more experience with tracking this stuff please feel free to chime in.

An alternative strategy here would be that we could have it so gopsutil actually reads the Boot time once with the cache disabled at the very beginning of fetching all processes, and after it has done it once we enable the cache. It would look pretty awkward in code but it seems doable. If that would work I'd be more comfortable changing that without worrying too much about breaking changes, as we'd reread the boot time once every scrape and then just have all future reads use the cached boot time. It would probably still be a pretty good performance win.

braydonk commented 7 months ago

Related to my last comment, I tested this out by adding a function to gopsutil that will manually reread the boot time and store the new value in the cache. With this, in the process/processes scrapers' scrape loops, I would refresh the boot time at the start, and all uses for the rest of the scrape would be the cached boot time. This yielded similarly positive results, and I now think this would actually be the best way to implement this.

Before: flamegraph with hostmetrics receiver and gopsutil as they are today

After: flamegraph with the new gopsutil and changes in the scrapers

I submitted a PR to gopsutil with my changes: https://github.com/shirou/gopsutil/pull/1616

dmitryax commented 7 months ago

An alternative strategy here would be that we could have it so gopsutil actually reads the Boot time once with the cache disabled at the very beginning of fetching all processes, and after it has done it once we enable the cache.

Can you please clarify how this is different from https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31998? What boot time is cached if we go with https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31998?

braydonk commented 7 months ago

In the PR I have open, the boot time is read and cached at the beginning of the first scrape, and for the entire lifetime of the collector process it will remain.

The alternative I proposed above is to re-read the boot time every scrape, but just once. Then that boot time that's read at the beginning of each scrape is cached and used for all subsequent boot time calls in the scrape. I think this is better because we do account for the rare case of boot time drifts or changes due to network time sync, but we only do the full process of parsing /proc/stat to get the boot time once total, rather than 4 times per process. This is I think really unlikely to produce negative results, whereas using the boot time cached for the lifetime of the process has the rare chance to mess up some metrics.

cforce commented 7 months ago

@braydonk Tx so much - might be a game changer for embedded systems usage I would even think of switching off the boot time sync by feature flag for those systems which do not auto sync at all but anyway will rebooted once a day. In this case just read boot time at start and that’s it until next restart.

dmitryax commented 7 months ago

The drift in the boot time caused by the NTP resync potentially can just update the start time of the reported datapoints and nothing else, right? I cannot imagine how this can cause any significant issues. The drift is likely to be no more than a few milliseconds. And I would argue that keeping the start time of reported data points consistent over the lifespan of the data collection may be even more important than representing "true" clock time at that point in time.

braydonk commented 7 months ago

The drift in the boot time caused by the NTP resync potentially can just update the start time of the reported datapoints and nothing else, right?

Yeah that is the main spot where the boot time is used. (For context for folks following along, this is the calculation of creation time which we use as the start time for each datapoint).

I think you're right about this after sleeping on it. I can't shake the feeling that it feels like a breaking change, but I can't really come up with any scenarios where this would negatively effect anything. So I propose a new plan:

With a featuregate that defaults on, since it seems low risk, the boot time cache is always enabled.
If we get an issue with some unforeseen problem caused by us cachine btime, then we can instruct the user to turn the featuregate off. In the old behaviour the boot time is still read vastly more times than necessary, so with the featuregate off we can use https://github.com/shirou/gopsutil/pull/1616. It will be similar functionality as it exists today, but we just read the boot time once per scrape, which is all that was necessary. It means turning off the featuregate won't go back to the old performance.
We can keep the featuregate around for some amount of releases, and if there are no problems we can remove it and it can become the default. I'd say probably 4 or 5 releases would be long enough for someone to take issue.

Does that sound like a good plan?

Re @cforce's last comment, in this scenario the new default will become the boot time always being cached, with the featuregate as a backup. So there should be no concerns for the use cases you specify here, and really what we'll need to keep an eye on is cloud provided VM services or things that use a ntp syncing service that might break something.

cforce commented 7 months ago

That turn out perfectly may solve the reported issue. I hope shirou will accept the MR soon https://github.com/shirou/gopsutil/pull/1616 . Voted - if that helps.

dmitryax commented 7 months ago

@braydonk sounds good to me! 👍 Please update the PR accordingly

braydonk commented 7 months ago

https://github.com/shirou/gopsutil/pull/1616 is now closed since it turns out we can use host.BootTimeWithContext to manually refresh it, it's an awkward API but it will work. Updating the PR now to reflect the suggested plan.

braydonk commented 6 months ago

With #32126 merged, this will be released in version 0.99.0 of the collector. By default, the boot time will always be cached at the start for all receivers. This changes nothing for the other receivers, which all read boot time once and cached it anyway. This saves lots of CPU for processes and even more for process since now the cached boot time will be used instead of rereading it multiple times per-process.

The old behaviour of process scraping meant the boot time would be refreshed every scrape, but it would do that by being reread 4 times per-process. Under the new behaviour, the boot time will be read once at start time and then used for the rest of the collector run. With the new hostmetrics.process.bootTimeCache option disabled, it sort of replicates old behaviour by rereading boot time once per scrape, but then caches that boot time for the rest of the scrape. So the CPU savings should be there regardless of if the feature flag is switched on or off.

crobert-1 commented 6 months ago

Resolved by #32126

cforce commented 5 months ago

open-telemetry / opentelemetry-collector-contrib