Closed Mario-Hofstaetter closed 2 years ago
After changing config of your biggest instance to:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 768
batch:
extensions:
memory_ballast:
size_mib: 512
and it is still GC'ing in a short interval:
2022-05-08T22:40:03+02:00 {"level":"info","ts":1652042403.575912,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":159}
2022-05-08T22:40:03+02:00 {"level":"info","ts":1652042403.5041053,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":702}
2022-05-08T22:39:30+02:00 {"level":"info","ts":1652042370.570825,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":163}
2022-05-08T22:39:30+02:00 {"level":"info","ts":1652042370.489251,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":641}
2022-05-08T22:38:56+02:00 {"level":"info","ts":1652042336.586582,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":166}
2022-05-08T22:38:56+02:00 {"level":"info","ts":1652042336.5032008,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":680}
2022-05-08T22:38:21+02:00 {"level":"info","ts":1652042301.5798883,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":186}
2022-05-08T22:38:21+02:00 {"level":"info","ts":1652042301.5080392,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":677}
2022-05-08T22:37:48+02:00 {"level":"info","ts":1652042268.5550425,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":158}
2022-05-08T22:37:48+02:00 {"level":"info","ts":1652042268.492242,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":645}
2022-05-08T22:37:14+02:00 {"level":"info","ts":1652042234.574366,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":159}
2022-05-08T22:37:14+02:00 {"level":"info","ts":1652042234.4942825,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":643}
2022-05-08T22:36:40+02:00 {"level":"info","ts":1652042200.5750558,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":159}
2022-05-08T22:36:40+02:00 {"level":"info","ts":1652042200.501883,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":714}
2022-05-08T22:36:06+02:00 {"level":"info","ts":1652042166.5794554,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":169}
2022-05-08T22:36:06+02:00 {"level":"info","ts":1652042166.4941897,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":655}
2022-05-08T22:35:32+02:00 {"level":"info","ts":1652042132.5813339,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":160}
2022-05-08T22:35:32+02:00 {"level":"info","ts":1652042132.4957755,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":619}
2022-05-08T22:35:06+02:00 {"level":"info","ts":1652042106.5584857,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":170}
2022-05-08T22:35:06+02:00 {"level":"info","ts":1652042106.4969077,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":632}
2022-05-08T22:34:33+02:00 {"level":"info","ts":1652042073.5752532,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":148}
2022-05-08T22:34:33+02:00 {"level":"info","ts":1652042073.4907186,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":620}
2022-05-08T22:33:59+02:00 {"level":"info","ts":1652042039.5686462,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":135}
2022-05-08T22:33:59+02:00 {"level":"info","ts":1652042039.4970694,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":621}
2022-05-08T22:33:24+02:00 {"level":"info","ts":1652042004.566677,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":143}
2022-05-08T22:33:24+02:00 {"level":"info","ts":1652042004.5086606,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":631}
2022-05-08T22:32:50+02:00 {"level":"info","ts":1652041970.5390697,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":147}
2022-05-08T22:32:50+02:00 {"level":"info","ts":1652041970.490576,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":668}
2022-05-08T22:32:10+02:00 {"level":"info","ts":1652041930.5783658,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":112}
2022-05-08T22:32:10+02:00 {"level":"info","ts":1652041930.5014062,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":663}
Is this normal behavior? IMHO 700MB of memory is a lot for scraping 8 MB of metrics.
Been playing with memory limiter the last couple of days and still anxious about memory usage.
using only metrics pipeline and a imho reasonable amount of metrics, the logs of otelcol are full of info
and warn
messages showing GC
actions, and even Dropping data
, while consume > 600 MB of memory?
Currently the prometheus exporter exposes: 34377 lines of metrics, 4.39 MB. 600 MB of memory seems too much for this.
What am I doing wrong?
pprof
outputsSince github does not accept .txt or .gz at the moment, link to tar.gz with all files from /pprof: https://1drv.ms/u/s!AnvnX1Qo7mIHj6VrpUxcn95c8Ue7Tw?e=DfQS6J
cc @dashpole @Aneurysm9
Note that this happens with v0.48.0 and v0.51.0, so both before and after fixing #9278
Update: Currently running 0.51.0
locally using the following config.
Got rid of all processors (except memory_limiter
, extensions, and the prometheus exporter). scrape_interval: 10s
is now relatively low, otelcol is basically only logging to console and discarding all metrics.
Doing this, the process currently sits at around ~ 780 MB of memory, slowly increasing it seems:
Log output is like this, two targets currently are unavailable:
2022-05-13T12:13:59.163+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 20}
2022-05-13T12:14:02.857+0200 warn internal/otlp_metricsbuilder.go:161 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1652436840494, "target_labels": "map[__name__:up app:AX.Process.AxCommunicationClients.PLCDevices instance:localhost:18087 instance_origin:localhost:18087 job:localmetrics]"}
2022-05-13T12:14:02.857+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 5}
2022-05-13T12:14:03.955+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 199}
2022-05-13T12:14:06.596+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 199}
2022-05-13T12:14:07.731+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 199}
2022-05-13T12:14:08.421+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 138}
2022-05-13T12:14:08.840+0200 warn internal/otlp_metricsbuilder.go:161 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1652436846490, "target_labels": "map[__name__:up app:AX.Server.Service instance:localhost:18086 instance_origin:localhost:18086 job:localmetrics]"}
2022-05-13T12:14:08.840+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 5}
2022-05-13T12:14:09.155+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 20}
2022-05-13T12:14:12.840+0200 warn internal/otlp_metricsbuilder.go:161 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1652436850489, "target_labels": "map[__name__:up app:AX.Process.AxCommunicationClients.PLCDevices instance:localhost:18087 instance_origin:localhost:18087 job:localmetrics]"}
2022-05-13T12:14:12.840+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 5}
2022-05-13T12:14:13.936+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 199}
2022-05-13T12:14:16.556+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 199}
2022-05-13T12:14:17.668+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 199}
2022-05-13T12:14:18.424+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 138}
2022-05-13T12:14:18.857+0200 warn internal/otlp_metricsbuilder.go:161 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1652436856492, "target_labels": "map[__name__:up app:AX.Server.Service instance:localhost:18086 instance_origin:localhost:18086 job:localmetrics]"}
2022-05-13T12:14:18.858+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 5}
2022-05-13T12:14:19.165+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 20}
2022-05-13T12:14:22.862+0200 warn internal/otlp_metricsbuilder.go:161 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "scrape_timestamp": 1652436860493, "target_labels": "map[__name__:up app:AX.Process.AxCommunicationClients.PLCDevices instance:localhost:18087 instance_origin:localhost:18087 job:localmetrics]"}
2022-05-13T12:14:22.862+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 5}
2022-05-13T12:14:23.951+0200 INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 199}
Things I will / could try next:
tls_config
localhost:8888
telemetry
metrics (going blind) and only scrape something elseAdding a few dozen failing prometheus scrape targets, while removing some working endpoints, did not increase memory usage but rather has lowered it.
@dashpole Regarding https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/10546#issuecomment-1151113059
I'll give Release 0.53.0 a try and see if it improves memory consumption for my metrics usage.
The benchmarking I did (admittedly, a while ago) for the prometheus receiver was ~22 MB at idle + ~6KB / series at a 60s scrape interval. Assuming a 10s interval increases memory usage by 6x (which probably is an overestimate), that would predict ~ 1.4GB of total usage in your case, which isn't too far off.
A few things that would be helpful to try:
Thank you for the suggestions @dashpole
Assuming a 10s interval increases memory usage by 6x (which probably is an overestimate)
But why though? Shouldn't all previous samples be thrown away after a new successful scrape of targets? Only the most recent datapoints are exposed on the (prometheus
-) exporter? or is there a bigger buffer interally because other exporters (!= prometheus) may export a history of last X values? 👀
get a baseline for comparison using the Prometheus server (in agent mode
Good idea.
Try out your setup on linux
Thats not possible / irrelevant unfortunately, because (a) our environments are heavily on the windows side, and (b) the actual endpoints are only reachable on localhost (firewalls)
But why though? Shouldn't all previous samples be thrown away after a new successful scrape of targets? Only the most recent datapoints are exposed on the (prometheus-) exporter? or is there a bigger buffer interally because other exporters (!= prometheus) may export a history of last X values? 👀
Other than the batch processor, + sending queue, nothing should be holding onto multiple scrapes of metrics, so you are probably right that lowering the scrape interval shouldn't matter too much. You could try a higher interval to see if it makes a big difference
You could try a higher interval to see if it makes a big difference
Will do. Also, how about not using batch
processor altogether? Not sure why I have not yet tested that..
I configured it because it is recommended, its running with the default settings, so that should be
send_batch_size (default = 8192): Number of spans, metric data points, or log records after which a batch will be sent regardless of the timeout.
timeout (default = 200ms): Time duration after which a batch will be sent regardless of size.
send_batch_max_size (default = 0): The upper limit of the batch size. 0 means no upper limit of the batch size. This property ensures that larger batches are split into smaller units. It must be greater or equal to send_batch_size.
I am unsure if it makes much sense for metrics pipeline if that is strictly using prometheus receiver and exporter?
send_batch_size
of 8192 will be exceeded on every scrape for our bigger targets, timeout 200ms seems not much.
Or maybe our BIG metric endpoints are the problem because a single scrape of the receiver is exceeding the 8192 metric points?
I will also try setting that to like 100000
so that one scrape of our BIGGEST endpoints fits within one batch, if that makes any sense.
Also, how about not using batch processor altogether?
Definitely worth a try
I am unsure if it makes much sense for metrics pipeline if that is strictly using prometheus receiver and exporter?
Agreed.
First observations.
After updating from 0.48.0
to 0.53.0
, memory usage on our server with most metrics dropped noticeable but not by a huge amount, from memory_rss
mean ~ 780 MiB to about 686 MiB mean, other memory metrics changed similarly.
Removing the batch
processor on the same machine had negligible effect ( < 30 MiB difference)
Other tests will follow.
We also observed memory leak from prometheus receiver. For example, following dashboard is for one instance that's pulling the kube state metrics for a k8s cluster, metrics pipeline are configured as this
metrics:
receivers: [ prometheus ]
processors: [ attributes/drop_labels, batch/metrics ]
exporters: [ nop ]
here are the heap profiles taken for running for 6 hours and ~1day.
update: otel collector in above case was running the latest prometheus receiver(cb16d48cf34b486aafa6aafe367208beac160665), which imported prometheus v0.36.2, to compare to how prometheus is working, I've started prometheus agent running v2.36.0 with following config
scrape_configs:
- job_name: 'prometheus'
metrics_path: /metrics
scrape_interval: 120s
scrape_timeout: 90s
static_configs:
- targets: [ 'kube-state-metrics.kube-system:8080' ]
no memory leak observed for more than 12 hrs.
First of all: when @Mario-Hofstaetter says there's a memory leak, I pay attention. I still have nightmares about https://github.com/jaegertracing/jaeger/issues/2638
Reading through this, I'm not sure yet there's a leak: apparently, the memory usage increases up to the threshold of the memory limiter, and stabilizes there. I have seen something similar in Jaeger in the past and found out that Go won't release the memory to the OS despite not using it anymore (GODEBUG=madvdontneed=1
used to cause some effect, not sure it's still the case in newer Go versions). Your previous profiling data didn't show anything of interest to me, just that the biggest offender seems to be the memory ballast extension, which wasn't surprising. Is there anything in the pprof data that would suggest that we have indeed a leak?
@jpkrohling sorry for letting this issue go stale.
From what I have seen so far, it does not entirely look like a memory leak, as memory usage does not increase indefinitely, or the memory limiter component is preventing the leak by its hard limit (?).
I have not yet tested prometheus in agent mode as "baseline" for memory usage for our metrics environment.
At the moment I barely have any time to contribute to this unfortunately topic I am afraid. It looks like @newly12 is also more skilled in providing insights.
I had changes to disable the metrics adjuster, the metrics adjuster basically sets the startTimestamp for metrics, in order to do so, it caches 2 copies(initial, previous) of every metric, as the startTimestamp is only defined in open telemetry model but not prometheus model, given use cases like scrape prometheus metrics and publish metrics to a prometheus-model storage(via prometheus remote write exporter), I think it is a fair ask to provide an option to disable it due to significant memory consumption and potential leak? After the change the memory use is pretty stable.
@newly12 That's the memory usage of otelcol?! 👀 How many metric series are scraped on that instance?
@Mario-Hofstaetter ~5M metrics, prometheus agent consumes ~25G memory as well.
@newly12 Thats quite a dramatic difference. I'm definitely supportive of being able to disable start time tracking in that case. cc @Aneurysm9.
Given that, I think we should consider disabling it by default in the future, or moving it to a separate processor or to exporterhelper. Speaking from Google's perspective, we've had to reimplement start time tracking in our exporter regardless, since not all receivers follow the spec for handling unknown start time that the prom receiver implements.
Still not entirely sure what we are hitting, maybe its the start time calculation too. Have been running two configurations for some days now, memory was stable, and just today when activity started on the servers memory usage again made a considerable jump.
I am running prometheus (2.36.2
) in agent mode too an the servers with the same scrape configuration (but no working remote_write target, dunno if that matters), and it generally had a little lower memory consumation and remained stable today:
Server 002 is running still 0.48.0
barebones (no extensions, no ballast
, no batch
)
processors:
memory_limiter:
check_interval: 1s
limit_mib: 2048
batch:
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [memory_limiter]
exporters: [prometheus]
Server 001 runs currently 0.54.0
with:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 2048
batch:
service:
extensions: [health_check,zpages]
pipelines:
metrics:
receivers: [prometheus]
processors: [memory_limiter,batch]
exporters: [prometheus]
Exporters config on both is still:
exporters:
prometheus:
endpoint: 0.0.0.0:7299
metric_expiration: 5m # default = 5m
send_timestamps: true
I suspected batch
processor, pprof
, zpages
... so tried different combinations so far... the jump on server 002 ruled out all suspects so far (leaving the start time calculation?).
Looking at the time series count over time today on these servers, a (moderate) increase of unique series is visible, but not quite matching the jump in memory at ~ 13:00 and ~ 13:50 Local Time.
So my current thesis is an increase in metric series count (cardinality) can lead to a sudden increase in memory usage (?).
metric_expiration
in exporter?send_timestamps: true
in exporter?send_timestamps probably won't have an impact. Lowering metric_expiration might have an impact, but make sure it is at least as high as your scrape interval (preferably at least twice your scrape interval).
Updated to the fresh 0.55.0
release on both servers after it came at and made some more config changes.
Did not seem to matter.
It looks like if the metrics of our application change (due to activity == new series because new label variants and/or application restart), memory may make a jump.
So at this point it seems useless to try more, and maybe wait for #12215
Current plan is to use minimal config, no ballast
, no batch
, no pprof
, and set a reasonable memory_limit
.
That should keep memory usage within acceptable bounds.
Memory on Server 002 is a bit lower, which may be due to metric_expiration: 4m
instead of 5m, and/or this instance has a bit lower metric series count.
prometheus agent memory did not change again.
Things left to try out:
telemetry
?make a build from https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/12215 and try it out? This is showing pretty stable in our env.
make a build from #12215 and try it out? This is showing pretty stable in our env.
@RalphSu maybe on the weekend, gotta learn how to build a go app / otelcol first 🙈
hint: use the OpenTelemetry Collector Builder -- https://github.com/open-telemetry/opentelemetry-collector/tree/main/cmd/builder
@jpkrohling I tried building using the Collector Builder but failed miserably. I am sorry I am unfamiliar with go tooling..
I like to try out disable_start_time
because meanwhile I know how to provoke the memory increase.
Whats the error in this config? Thanks..
exporters:
- gomod: "github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter v0.55.0"
receivers:
- gomod: "github.com/newly12/opentelemetry-collector-contrib/receiver/prometheusreceiver main"
processors:
- import: go.opentelemetry.io/collector/processor/memorylimiterprocessor
gomod: go.opentelemetry.io/collector v0.55.0
replaces:
# a list of "replaces" directives that will be part of the resulting go.mod
- github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver => github.com/newly12/opentelemetry-collector-contrib/receiver/prometheusreceiver main
Running with
.\ocb_0.55.0_windows_amd64.exe --config=./otelcol-builder.yaml --output-path=./tmp/
I am getting different variants of this error:
Error: failed to update go.mod: exit status 1. Output: "go: github.com/newly12/opentelemetry-collector-contrib/receiver/prometheusreceiver@v0.44.1-0.20220716201014-d4e2edcf6ea1: parsing go.mod:
module declares its path as: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver
but was required as: github.com/newly12/opentelemetry-collector-contrib/receiver/prometheusreceiver"
Using
receivers:
- gomod: "github.com/newly12/opentelemetry-collector-contrib/receiver/prometheusreceiver main"
import: "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver"
makes the error list only longer.
Flipping the replaces statement got the compile to work, but I guess I was not using the Fork then because starting using disable_start_time: true
failed.
May have time on the weekend to go read into golang documentation.
I ran some more tests (with 0.55.0
) after realizing what actions cause new metric series which then cause memory increase.
Within 30 minutes otelcol process goes from <400 MiB to > 2 GiB of memory (see below). After restarting otelcol, memory is again below 400 MiB.
I could provid pprof
dumps of the different states, if anyone would be interessed, or wait until #12215 is merged or I get the compile working...
Is it probable that disable_start_time
will fix this behavior or is there something else going on in our metrics? @newly12 @dashpole
(sorry for being bothersome in this issue)
memory_limiter
is currently running with limit_mib: 3000
due to these tests.
After restarting our apps and otelcol I ran about 880 "tasks" in our app which resulted in the addition of some metrics:
Also some windows_exporter metrics come and go as process_ids change. Together thats a few thousand increase in metric count over ~ 30 minutes. I exported the prometheus exporter output every minuten in a textfile:
Time | Metrics bytes | Metrics lines | otelcol_process_memory_rss ~ |
---|---|---|---|
2022-07-19 13:22:59 | 11476774 | 44499 | 336 MiB |
2022-07-19 13:45:08 | 12925711 | 50688 | 1.26 GiB |
2022-07-19 14:08:25 | 14027869 | 54462 | 2.22 GiB |
pprof dumps would be useful with or without disable start time set. Thanks for your investigation!
pprof dumps would be useful with or without disable start time set. Thanks for your investigation!
@dashpole for disable_start_time I would need assistence regarding https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/9998#issuecomment-1189033814
I have repeated my test running pprof
in default settings and saved everything I found 3 times:
Please let me know if this helps. I could also provide our raw metrics (in the prometheus text format), which could simplify running tests?
Since I do not fully understand the memory_limiter
/ Garbage Collection interaction, I re-ran the test with
memory_limiter:
check_interval: 1s
limit_mib: 1000
The process now peaks at around 1250 MiB, but the receiver is not happy (data dropped due to high memory usage
) and fails to scrape its own telemetry metrics.
So the memory_limiter is no suitable solution, if not using very high limits, or not at all.
{"level":"warn","ts":1658245285.2995648,"caller":"memorylimiterprocessor/memorylimiter.go:309","msg":"Memory usage is above soft limit. Dropping data.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":875}
{"level":"info","ts":1658245279.445666,"caller":"memorylimiterprocessor/memorylimiter.go:295","msg":"Memory usage back within limits. Resuming normal operation.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":771}
{"level":"info","ts":1658245279.445666,"caller":"memorylimiterprocessor/memorylimiter.go:273","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":771}
{"level":"warn","ts":1658245279.2954862,"caller":"memorylimiterprocessor/memorylimiter.go:283","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":1069}
{"level":"error","ts":1658245278.462892,"caller":"scrape/scrape.go:1273","msg":"Scrape commit failed","kind":"receiver","name":"prometheus","pipeline":"metrics","scrape_pool":"localmetrics","target":"http://localhost:18087/metrics","error":"data dropped due to high memory usage","stacktrace":"github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1273\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1342\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1224"}
{"level":"error","ts":1658245278.429332,"caller":"scrape/scrape.go:1273","msg":"Scrape commit failed","kind":"receiver","name":"prometheus","pipeline":"metrics","scrape_pool":"localmetrics","target":"http://localhost:9080/metrics","error":"data dropped due to high memory usage","stacktrace":"github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1273\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1342\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1224"}
{"level":"error","ts":1658245273.3404717,"caller":"scrape/scrape.go:1273","msg":"Scrape commit failed","kind":"receiver","name":"prometheus","pipeline":"metrics","scrape_pool":"localmetrics","target":"http://localhost:9182/metrics","error":"data dropped due to high memory usage","stacktrace":"github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1273\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1342\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1224"}
{"level":"error","ts":1658245272.5216439,"caller":"scrape/scrape.go:1273","msg":"Scrape commit failed","kind":"receiver","name":"prometheus","pipeline":"metrics","scrape_pool":"localmetrics","target":"http://localhost:8888/metrics","error":"data dropped due to high memory usage","stacktrace":"github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1273\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1342\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1224"}
{"level":"warn","ts":1658245267.2991486,"caller":"memorylimiterprocessor/memorylimiter.go:309","msg":"Memory usage is above soft limit. Dropping data.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":814}
{"level":"info","ts":1658245261.4467704,"caller":"memorylimiterprocessor/memorylimiter.go:295","msg":"Memory usage back within limits. Resuming normal operation.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":795}
{"level":"info","ts":1658245261.4467704,"caller":"memorylimiterprocessor/memorylimiter.go:273","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":795}
{"level":"error","ts":1658245261.4117634,"caller":"scrape/scrape.go:1273","msg":"Scrape commit failed","kind":"receiver","name":"prometheus","pipeline":"metrics","scrape_pool":"localmetrics","target":"http://localhost:9080/metrics","error":"data dropped due to high memory usage","stacktrace":"github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1273\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1342\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1224"}
{"level":"error","ts":1658245261.365305,"caller":"scrape/scrape.go:1273","msg":"Scrape commit failed","kind":"receiver","name":"prometheus","pipeline":"metrics","scrape_pool":"localmetrics","target":"http://localhost:18087/metrics","error":"data dropped due to high memory usage","stacktrace":"github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1273\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1342\ngithub.com/prometheus/prometheus/scrape.(*scrapeLoop).run\n\tgithub.com/prometheus/prometheus@v0.36.2/scrape/scrape.go:1224"}
{"level":"warn","ts":1658245261.2909968,"caller":"memorylimiterprocessor/memorylimiter.go:283","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"metrics","cur_mem_mib":1023}
I've been tracking this for a couple of days and we're seeing the exact same behavior here; our instance is running in K8s so getting pprof dumps would be a major PITA, so I just wanted to chime in and thank @Mario-Hofstaetter profusely for putting in the work 👍
@dashpole, will you look into this, or should I put it on my queue?
If you have time, you are welcome to look into it. I have a strange dev setup, and was having trouble opening the pprof profiles earlier
If you have time
I have a few other items on my queue, but I think this might take precedence. Given that this seems related to the metrics part, you (or @gouthamve?) would probably find the problem faster than me, but if you can't, I can give it a try.
@Mario-Hofstaetter please try this builder config.
exporters:
- gomod: "github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter v0.55.0"
receivers:
- gomod: "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.55.0"
processors:
- import: go.opentelemetry.io/collector/processor/memorylimiterprocessor
gomod: go.opentelemetry.io/collector v0.55.0
replaces:
# a list of "replaces" directives that will be part of the resulting go.mod
- github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver => github.com/newly12/opentelemetry-collector-contrib/receiver/prometheusreceiver prom_receiver_mem
disable_start_time: true
looks promisingMemory stayed < 400 MiB while running the actions in our application which caused new metrics. Memory usage of otelcol is now below prometheus in agent mode.
Memory footprint could also be especially low because of the custom build only containing used components? Will re-run the test with the custom binary but disable_start_time
on false to double check.
Thank you @newly12 for the builder config, what a fail using the wrong branch in the replace for the fork.
Also thank you so much for your fix. If this proves to be stable without issues, we can run the custom build and solve this issue that troubled us for months now.
The compile worked despite forgetting to run
$ GO111MODULE=on go install go.opentelemetry.io/collector/cmd/builder@latest
..I have no idea what that environment variable does, but I will repeat the compile nevertheless.
I started the built custom otelcol binary after restarting our application and re-ran my actions from yesterday.
Not sure why otelcol_process_runtime_total_sys_memory_bytes
still slightly increased, will watch over a period of days.
disable_start_time
Running the same build (now including otlp
/ jaeger
receiver / exporter and batch
processor if later needed), I re-ran the tests with disable_start_time: false
and true
Note: The amount of metrics was still increasing at the end, which is why the memory usage was still slowly rising.
I will now let this otelcol process run for several days without restarts.
Looking good.
Woohoo! Looking forward to seeing this released...
One other observation we've made is regarding windows. We're still collecting profiles etc. but it looks like our collector running on windows consistently uses more memory that on linux. We're running in K8s so the current theory as profiles don't show significant differences is that memory reporting in windows is different as we see the memory use growing over time and then level out which looks a little like the windows is not taking back memory until needed by something else. But so far I lack background on the windows side of things to confirm anything and we haven't prioritized it yet.
Any update on this issue? We tried to build our own otel collector and the fix in #12215 worked ( run over a week on a cluster with 60 nodes )
I commented in #12215 -- have we investigated why the garbage collection support in the metrics adjuster does not appear to work?
Support for start time is serious business. Prometheus has a heuristic that makes it unable to correctly calculate rates at restarts. OTLP makes it possible to correctly calculate rates around restarts, but that will be broken by the proposed fix to disable the adjuster. See the comment here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/internal/otlp_metrics_adjuster.go#L28
@Mario-Hofstaetter, @holograph, @Doron-Bargo, could you give 0.57.2 a try? It was released last week.
@Mario-Hofstaetter, @holograph, @Doron-Bargo, could you give 0.57.2 a try? It was released last week.
@jpkrohling The release notes do not mention neither this (#9998) nor the PR (#12215) ? Has the new option been released? Or have there been other changes that might reduce momory consumption?
Has anyone looked at the memory dumps yet? It does not appear so, therefore it is still unclear if the memory usage (without disabling the start timestamp) is buggy?
My understanding of this problem is that the OTel collector maintains a duplicative map of every active counter/histogram/summary timeseries in order to establish the start time of each series.
This is incredibly wasteful.
The Prometheus scrape manager includes the necessary map already, and IMO a good solution would be to extend the Prometheus scrape manager to include a small amount of new information about each series. Specifically, when the scrape observes a reset it is required to use its local information about the reset time as the start time of the reset series. Without the receiver adding this information--which it has on hand--the consumer is forced to read their database in order to establish the meaning (i.e., a contributed rate interpretation) of the point being written, which is a major efficiency concern and the reason OpenTelemetry includes a start timestamp.
I'm afraid the other ways of fixing this problem require replacing the Prometheus scrape manager in the OTC Prometheus receiver.
@Mario-Hofstaetter I think @jpkrohling was referring #12765 which seems to be promising. would you mind to give it a try in your environment?
Well, I'll be damned. I ran my test scenario with Release otelcol 0.57.2
vanilla, and it looks like (for our scenario at least) the memory consumption has IMPROVED DRAMATICALLY and is fixed so to speak ✔
Big shoutout to @balintzs if #12765 was the golden change 👌🏻
On our biggest instance after running my test, otelcol
process uses less memory than prometheus
in agent mode.
Will try again with a non-minimalistic configuration and run it long-term, but this looks very promising.
Give it a try @Doron-Bargo @kwiesmueller @holograph @RalphSu
If memory stays stable, what should happen with this issue? Close as solved I guess after more feedback from the community? @newly12 @jpkrohling
Gonna open some beers as soon this PITA is closed 🍺
If memory stays stable, what should happen with this issue? Close as solved I guess after more feedback from the community?
I think @jmacd has concerns about the current way we do things, so I'd either open a new issue to address his specific concerns and close this, or keep this one here open until his point is addressed.
Coming to this very late, but could others also verify if otelcol 0.57.2
fixes things for others as well? If yes, we can close this issue as we are still doing start time tracking but less buggily :)
Long-term though, I @jmacd is right that we should be doing this in Prometheus itself: https://github.com/prometheus/prometheus/issues/10164#issuecomment-1215037396
I'll update this issue once I have some buy-in from the Prometheus maintainers.
@gouthamve This sounds great! Thank you for the link.
Ignore this, in retesting #12765.
~Patched https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/12765 into v0.55 and it's amazing. Cpu reduced from 100m to 7m, and memory reduced from 1GiB to 500MiB.~
Oh, also updated use_start_time_metric flag. Lemme double check.
I jumped straight from 0.55 to 0.58, been running for the last 5 hours or so. So far memory utilization went down by about 30% and I'm not observing an upward trend, however the slow creep upwards in memory utilization is only observable after a much longer period of time (24 hours or so) so I'm still tracking. Fingers crossed!
Well over 12 hours in, I'm ready to call this a win:
The yellow graph represents the memory utilization of the previous 0.55 instance, stabilizing around the 2GB mark and then slowly creeping to the 2.4GB range over the course of a few days. The green line is 0.58, stabilizing very quickly around the more-modest 1.5GB mark and so far maintaining consistent memory usage patterns. Well done @jpkrohling and everyone involved, and thanks again to @Mario-Hofstaetter for relentlessly pushing this issue, it's great news to bring to my customers :-)
Given the successful memory reductions, I think we can close this issue. I've opened https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/13451 to track lowering the memory used by caches in the prometheus receiver.
I think https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/13922 would make the memory usage go to 1/3, because we no longer need the store the attributes for initial and previous.
Describe the bug
I am infrequently observing high memory usage of
otelcol.exe
on Windows Systems. The memory is however within the set bounds ofmemory_limiter
(see config below). But I am not sure if this memory usage is intended behavior, since there is negligible load on the systems. I am not experienced in golang so please excuse my lack of knowledge.My planned actions to counteract this behavior is reduce the configured
limit_mib
and fix thememory_ballast
.Steps to reproduce
?? Full config see below.
What did you expect to see?
otelcol
should not consume "much" memory given the workload is little, and/or memory should eventually be garbage collected?What did you see instead?
otelcol
process uses (?) up to 2GB of memory. Currently on my local machine here it sits at ~ 1.5GB. Last stdout log message frommemory_limiter
after machine was started from hibernation:Is this the intended behavior of memory usage of the process?
What are recommended values for
memory_limiter
? The documentation is a little vague.On my machine, the
prometheus
exporter currently emits ~ 19362 lines of metrics, less than 3 MB in size. On our biggest instance, the prometheus exporter has 37654 metric lines , ~ 8MB. Is this a lot?The documentation on that page uses
limit_mib: 4000
, which seems kinda HUGE for this kind of application?After re-reading those docs just now, this line caught my attension:
So it actually is the expected behavor of otelcol to stay around
limit_mib
indefinetely?My trace queue size is currently at zero:
I had the suspicion the trace buffer was filling memory when VPN was disconnected at
jaeger
server not reachable, since the default queue size of 5000 (?) seemed rather high.Looking at logs from last night, it has been the
traces
responsible for causing the memory increase:If the
sending_queue
queue_size
of 5000 does not fit within memory limits, what is going to happen? Oldest trace spans are going to get dropped from queue?I did however also have on machine where otelcol used 2GB of memory suddenly, and no traces where being queued. No warnings at the time, unfortunately I have not info logs from that date (otelcol was still windows application event logging)
There are no apps emitting traces on that machine yet, so no Idea what has happend there.
I have collected and attached various debug information from http://localhost:1777/debug/pprof/, in case the memory usage is not okay.
What version did you use?
What config did you use?
I am using two config files (currently 3 two debug) consisting of the following parts:
ConfigFile 1 (common and metrics):
I just noticed an error in my config... the
memory_ballast
extension is not used it seems..Second File used optionally (if traces are required):
Third config file currently used to add
pprof
:Environment OS: Windows 10 21H2, Windows Server 2019