pwillie / prometheus-es-adapter

Prometheus remote storage adapter for Elasticsearch
Apache License 2.0
61 stars 31 forks source link

Leaking connections/file descriptors #9

Open sevagh opened 6 years ago

sevagh commented 6 years ago

Hello,

I'm running this adapter. Even with a high ulimit (131072), it seems to be leaking connections:

$ sudo lsof -p 2721 | wc -l
131080
May 11 03:33:40 bloop prometheus-es-adapter[2721]: 2018/05/11 10:33:40 http: Accept error: accept tcp [::]:9201: accept4: too many open files; retrying in 1s
May 11 03:33:41 bloop prometheus-es-adapter[2721]: 2018/05/11 10:33:41 http: Accept error: accept tcp [::]:9201: accept4: too many open files; retrying in 1s
May 11 03:33:42 bloop prometheus-es-adapter[2721]: 2018/05/11 10:33:42 http: Accept error: accept tcp [::]:9201: accept4: too many open files; retrying in 1s
May 11 03:33:43 bloop prometheus-es-adapter[2721]: 2018/05/11 10:33:43 http: Accept error: accept tcp [::]:9201: accept4: too many open files; retrying in 1s

I'm trying to find out where this is occurring - perhaps in the Elastic client you use, perhaps in the http server in this adapter.

Have you seen any behavior like this?

pwillie commented 6 years ago

Hi @sevagh,

Thanks for the report. Afraid I have not seen this before. Will dig into it when I get a chance. What config are you running it with?

sevagh commented 6 years ago

I was running it with the default settings. Seems like it felt a bottleneck with elasticsearch, and the Prometheus remote storage queue created too many shards/goroutines.

Today, I ran it with some settings:

Environment="ES_WORKERS=4"
Environment="ES_BATCH_COUNT=-1"
Environment="ES_BATCH_SIZE=-1"
Environment="ES_BATCH_INTERVAL=30"

Now there's no connections piling up. (closed by accident and re-opened)

pwillie commented 6 years ago

What settings did you have when you experienced the issues?

sevagh commented 6 years ago

Before:

ExecStart=/usr/local/bin/prometheus-es-adapter \
        --es_url=https://myelastic:9200 \
        --es_user=local_prometheus-adapter \
        --listen 0.0.0.0:9201

I didn't modify the default values for:

ES_WORKERS | 0 | Number of batch workers
ES_BATCH_COUNT | 1000 | Max items for bulk Elasticsearch insert operation
ES_BATCH_SIZE | 4096 | Max size in bytes for bulk Elasticsearch insert operation
ES_BATCH_INTERVAL | 10 | Max period in seconds between bulk Elasticsearch insert operations
sevagh commented 6 years ago

I think this is a case of "I should not rely on defaults in production" - user error.

pwillie commented 6 years ago

I don't believe we should write this off as user error. Not sure when I will get to them but a few thoughts:

  1. perhaps the defaults need tweaking??
  2. really needs to degrade gracefully regardless of config
  3. should apply back pressure
  4. should surface this condition through metrics
  5. some combination of the above
sevagh commented 6 years ago

I'll see about replicating the old bad settings and trace it to a line of code here.

pwillie commented 5 years ago

@sevagh how did you get on?

sevagh commented 5 years ago

Oops, never really revisited this. I promise on Monday I'll try to recreate.

On the other hand, with the production configuration, this adapter has been running without a crash since basically May. Really great work here.