Open sevagh opened 5 years ago
Looks like after the metric[]
slice is constructed and encapsulated in the BulkIndexRequest sample
, it's never freed/released?
Inuse space good:
shanssian:prometheus-es-adapter $ go tool pprof -inuse_space -top http://bleep:9000/debug/pprof/heap | head -n10
Fetching profile over HTTP from http://bleep:9000/debug/pprof/heap
Saved profile in /home/shanssian/pprof/pprof.prometheus-es-adapter.alloc_objects.alloc_space.inuse_objects.inuse_space.039.pb.gz
File: prometheus-es-adapter
Type: inuse_space
Time: Feb 20, 2019 at 4:40am (PST)
Showing nodes accounting for 65.88MB, 100% of 65.88MB total
flat flat% sum% cum cum%
25.01MB 37.96% 37.96% 32.51MB 49.34% github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
20MB 30.36% 68.32% 20MB 30.36% github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
7.50MB 11.39% 79.71% 7.50MB 11.39% github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.NewBulkIndexRequest (inline)
7MB 10.63% 90.33% 7MB 10.63% github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*Label).Unmarshal
2.40MB 3.64% 93.97% 2.40MB 3.64% bytes.(*Buffer).String (inline)
Alloc_space bad:
File: prometheus-es-adapter
Type: alloc_space
Time: Feb 20, 2019 at 4:40am (PST)
Showing nodes accounting for 167229.67MB, 98.79% of 169269.66MB total
Dropped 264 nodes (cum <= 846.35MB)
flat flat% sum% cum cum%
45555.63MB 26.91% 26.91% 78240.27MB 46.22% github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
19662.38MB 11.62% 38.53% 19662.38MB 11.62% bytes.makeSlice
10465.85MB 6.18% 44.71% 16375.52MB 9.67% github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
7419.20MB 4.38% 49.10% 32684.64MB 19.31% encoding/json.Marshal
Png dump of alloc_space:
Inuse_space for good measure (since I may be misunderstanding alloc_space): Also, runtime.MemStats:
# runtime.MemStats
# Alloc = 163905888
# TotalAlloc = 197824212744
# Sys = 1208749816
# Lookups = 0
# Mallocs = 2158845849
# Frees = 2157028671
# HeapAlloc = 163905888
# HeapSys = 1136525312
# HeapIdle = 778420224
# HeapInuse = 358105088
# HeapReleased = 0
# HeapObjects = 1817178
# Stack = 4325376 / 4325376
# MSpan = 8442536 / 15089664
# MCache = 55296 / 65536
# BuckHashSys = 1904911
# GCSys = 45088768
# OtherSys = 5750249
# NextGC = 188449376
# LastGC = 1550666769451819398
I'm still monitoring the situation but it looks like adding ES_BATCH_MAX_AGE=30
is preventing this issue. Perhaps it's related to the GC running?
I've been trying to look at the code since my previously deployed version of this adapter (which still works fine). The only thing I can see, since your wrapper around the BulkWriter is largely the same, just moved around to different files, is this: https://github.com/pwillie/prometheus-es-adapter/commit/616b927fcb3522db5ca517ad42465deee042a44d#diff-836546cc53507f6b2d581088903b1785 Changing the Elastic client. I'll file an issue there.
Thanks for the detailed report. Not sure when I will get to this but please keep me posted with any further developments.
Do you guys have any update on this? We've been using this adapter and it gets regularly evicted from Kubernetes as it's consuming too much RAM (10GB+ in under 24H)... Thanks for your support!
I've actually switched to using metricbeat and the Prometheus module. You can use metricbeat to scrape the same targets as Prometheus, rather than go through the remote_*
adapters.
Thanks @sevagh for your workaround, sadly that won't work for us as the Prometheus module doesn't seem to be compatible with AWS/Kubernetes service discovery...
@Kanshiroron, I haven't had a chance to dig into this one yet sorry.
Hey @pwillie I am actually testing a workaround I'll keep you posted with the results. Thanks for following up
Hey @pwillie, after tweaking the adapter configuration, I no longer see the issue appearing. It has been running for 10 days now without any issue. I changed the default settings to following:
ES_WORKERS: "20"
ES_BATCH_MAX_AGE: "30" # in seconds
ES_BATCH_MAX_DOCS: "5000"
ES_BATCH_MAX_SIZE: "16384" # in bytes
ES_INDEX_SHARDS: "1"
ES_INDEX_REPLICAS: "2"
I suspect the number of shard and replicas to be causing the issue (use to have 5 index shards and 3 replicas)
Thanks for the update and great to hear of your experience.
You're welcome. What I think was happening is that the es-adapter was receiving more metrics that it was able to push to ElasticSearch. It would be nice to have an internal check to make sure the number of metrics in the buffer don't keep increasing.
Hello.
I just upgraded to the latest version of this adapter (with Prometheus 2.6 pollers) and the following ES settings:
The adapter has a memory leak. I added the pprof debug endpoint. Here's the pprof/heap
-top
output:Another one:
It's mostly from this method: https://github.com/pwillie/prometheus-es-adapter/blob/master/pkg/elasticsearch/write.go#L69
Moreover, the
inuse_space
doesn't seem to be the source of the leak - I think it's a GC problem (but I'm not sure how well I can continue debugging). When I usepprof -alloc_space
I think the bulk of the leak is evident there:Running with
pprof list Write
:Here's the Prometheus protobuf marshaling code: