Leaking connections/file descriptors

pwillie / prometheus-es-adapter

Prometheus remote storage adapter for Elasticsearch

Apache License 2.0

61 stars 31 forks source link

Leaking connections/file descriptors #9

Open sevagh opened 6 years ago

sevagh commented 6 years ago

Hello,

I'm running this adapter. Even with a high ulimit (131072), it seems to be leaking connections:

$ sudo lsof -p 2721 | wc -l
131080

May 11 03:33:40 bloop prometheus-es-adapter[2721]: 2018/05/11 10:33:40 http: Accept error: accept tcp [::]:9201: accept4: too many open files; retrying in 1s
May 11 03:33:41 bloop prometheus-es-adapter[2721]: 2018/05/11 10:33:41 http: Accept error: accept tcp [::]:9201: accept4: too many open files; retrying in 1s
May 11 03:33:42 bloop prometheus-es-adapter[2721]: 2018/05/11 10:33:42 http: Accept error: accept tcp [::]:9201: accept4: too many open files; retrying in 1s
May 11 03:33:43 bloop prometheus-es-adapter[2721]: 2018/05/11 10:33:43 http: Accept error: accept tcp [::]:9201: accept4: too many open files; retrying in 1s

I'm trying to find out where this is occurring - perhaps in the Elastic client you use, perhaps in the http server in this adapter.

Have you seen any behavior like this?

pwillie commented 6 years ago

Hi @sevagh,

Thanks for the report. Afraid I have not seen this before. Will dig into it when I get a chance. What config are you running it with?

sevagh commented 6 years ago

I was running it with the default settings. Seems like it felt a bottleneck with elasticsearch, and the Prometheus remote storage queue created too many shards/goroutines.

Today, I ran it with some settings:

Environment="ES_WORKERS=4"
Environment="ES_BATCH_COUNT=-1"
Environment="ES_BATCH_SIZE=-1"
Environment="ES_BATCH_INTERVAL=30"

Now there's no connections piling up. (closed by accident and re-opened)

pwillie commented 6 years ago

What settings did you have when you experienced the issues?

sevagh commented 6 years ago

Before:

ExecStart=/usr/local/bin/prometheus-es-adapter \
        --es_url=https://myelastic:9200 \
        --es_user=local_prometheus-adapter \
        --listen 0.0.0.0:9201

I didn't modify the default values for:

ES_WORKERS | 0 | Number of batch workers
ES_BATCH_COUNT | 1000 | Max items for bulk Elasticsearch insert operation
ES_BATCH_SIZE | 4096 | Max size in bytes for bulk Elasticsearch insert operation
ES_BATCH_INTERVAL | 10 | Max period in seconds between bulk Elasticsearch insert operations

sevagh commented 6 years ago

I think this is a case of "I should not rely on defaults in production" - user error.

pwillie commented 6 years ago

I don't believe we should write this off as user error. Not sure when I will get to them but a few thoughts:

perhaps the defaults need tweaking??
really needs to degrade gracefully regardless of config
should apply back pressure
should surface this condition through metrics
some combination of the above

sevagh commented 6 years ago

I'll see about replicating the old bad settings and trace it to a line of code here.

pwillie commented 5 years ago

@sevagh how did you get on?

sevagh commented 5 years ago

Oops, never really revisited this. I promise on Monday I'll try to recreate.

On the other hand, with the production configuration, this adapter has been running without a crash since basically May. Really great work here.