influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

Metric corruption when using http2 nginx reverse proxy with influxdb output #2854

Closed rlex closed 6 years ago

rlex commented 7 years ago

Bug report

Telegraf stopped working with nginx reverse proxy since 1.3 release

Relevant telegraf.conf:

# Telegraf configuration

[tags]

# Configuration for telegraf agent
[agent]
    interval = "10s"
    debug = false
    hostname = "my.host.fqdn"
    round_interval = true
    flush_interval = "10s"
    flush_jitter = "0s"
    collection_jitter = "0s"
    metric_batch_size = 1000
    metric_buffer_limit = 10000
    quiet = false

[[outputs.influxdb]]
    urls = ["https://host:8086"]
    database = "telegraf_metrics"
    username = "telegraf_metrics"
    password = "pass"
    retention_policy = ""
    write_consistency = "any"
    timeout = "5s"

[[inputs.cpu]]
    percpu = true
    totalcpu = true
    fielddrop = ["time_*"]
[[inputs.disk]]
    ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.mem]]
[[inputs.system]]
[[inputs.swap]]
[[inputs.internal]]
[[inputs.kernel]]
[[inputs.processes]]
[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]
[[inputs.kernel_vmstat]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.nstat]]
    dump_zeros = true
[[inputs.conntrack]]
    files = ["ip_conntrack_count","ip_conntrack_max", "nf_conntrack_count","nf_conntrack_max"]
    dirs = ["/proc/sys/net/ipv4/netfilter","/proc/sys/net/netfilter"]

System info:

Telegraf 1.3.0-1, InfluxDB 1.2.4-1 , nginx 1.13.0-1~xenial

Steps to reproduce:

  1. Install telegraf+influxdb+nginx
  2. make a simple reverse proxy ( location / { proxy_pass http://influxdb:8086; } )
  3. start pumping metrics from telegraf (telegraf -> nginx -> influxdb)
  4. From this point, influxdb will be filled with broken, unreadable metrics. They looks like partially-written metrics, something like ??????st=some.of.host.i.monitor, or ???sk=sda1.

Expected behavior:

Metrics should be arriving proparly

Actual behavior:

Lots of broken metrics with strange unicode (?) symbols.

Additional info:

Sometimes i saw lines like "2017-05-21T03:33:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '.ru timst=rails-2 total=4387237888i,used=2909868032i,free=1477369856i,used_percent=66.32574084845248 1495337590000000000': invalid boolean]"

I suspect this is somehow related to https://github.com/influxdata/telegraf/pull/2251

danielnelson commented 7 years ago

I tried this out in debian, although with fewer input plugins, but was unable to reproduce.

ii  telegraf                                  1.3.0-1                              amd64
ii  influxdb                                  1.2.4-1                              amd64
ii  nginx-full                                1.10.3-1                             amd64
[[outputs.influxdb]]
  urls = ["https://loaner:443"]
  database = 'telegraf_nginx'
  retention_policy = ""
  write_consistency = "any"
  username = "dbn"
  password = "howdy"
  timeout = "5s"
  ssl_ca = "/home/dbn/.ssl/ca/cacert.pem"
$ cat /etc/nginx/sites-available/default
upstream backend {
        server 127.0.0.1:8086 fail_timeout=0;
}

server {
        listen 443 default_server;

        ssl    on;
        ssl_certificate      /etc/nginx/cert.pem;
        ssl_certificate_key  /etc/nginx/key.pem;
        server_name _;
        root /var/www;

        location / {
                proxy_set_header Host $http_host;
                proxy_redirect off;
                proxy_pass http://backend;
        }
}

Does it work if if you take nginx out? What is the output of locale?

rlex commented 7 years ago

It works if i take nginx out - it's how it works right now. It also works with reverse proxy if i downgrade telegraf back to 1.2.1. I also observed that problem with some self-built version (looking at my clone of repo, that was c66e2896c658cc2d3ccf9fdec3be8072e87162a2) but that was on testbox and i forgot to post issue for it.

When upgrading to 1.3 i mostly left config intact except adding 2 new inputs:

[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]

Locale output:

root@heimdall:~# locale
LANG=en_US.utf8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

Telegraf envs, ubuntu 16.04 LTS:

root@heimdall:~# cat /proc/22480/environ
LANG=en_US.utf8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binHOME=/etc/telegrafLOGNAME=telegrafUSER=telegrafSHELL=/bin/false

Debian 8:

LANG=en_US.UTF-8LC_CTYPE=en_US.UTF-8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binHOME=/etc/telegrafLOGNAME=telegrafUSER=telegrafSHELL=/bin/false

Here is full nginx configs: nginx.conf:

#Ansible managed, do not touch
user nginx  nginx;

worker_processes 4;

pid /var/run/nginx.pid;

worker_rlimit_nofile 1024;

include /etc/nginx/modules-enabled/*.conf;

events {
        worker_connections 512;
        use epoll;
        multi_accept on;
}

http {
        include /etc/nginx/mime.types;
        default_type application/octet-stream;
        sendfile on;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log error;
        server_tokens off;
        types_hash_max_size 2048;

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;
}

Corresponding vhost:

#Ansible managed, do not touch
server {
   listen snip:80;
   listen snip:443 ssl http2;
   listen [snip]:443 ssl http2;
   listen [snip]:80;
   server_name snip;
   ssl_certificate /etc/nginx/ssl/snip/fullchain.pem;
   ssl_certificate_key /etc/nginx/ssl/snip/privkey.pem;
   ssl_dhparam /etc/nginx/ssl/dhparam.pem;
   client_max_body_size 10M;
   if ($scheme = http) {
       return 301 https://$server_name$request_uri;

   }
   location /.well-known/acme-challenge/ {
       alias /var/www/letsencrypt/;

   }
   location / {
       proxy_pass http://localhost:3000;
       proxy_set_header Host $host;
       proxy_set_header X-Real-IP $remote_addr;
       proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
       proxy_set_header X-Forwarded-Proto $scheme;

   }
}

Nginx also auto-includes following settings (in http {} context) Gzip:

#Ansible managed, do not touch
gzip on;
gzip_comp_level 5;
gzip_min_length 256;
gzip_http_version 1.1;
gzip_buffers 16 8k;
gzip_proxied any;
gzip_vary on;
gzip_disable "MSIE [1-6]\.(?!.*SV1)";
gzip_types application/atom+xml application/javascript application/json application/rss+xml application/vnd.ms-fontobject application/x-font-ttf application/x-javascript application/x-web-app-manifest+json application/xhtml+xml application/xml font/opentype image/svg+xml image/x-icon text/css text/plain text/x-component text/javascript;

Open file cache:

#Ansible managed, do not touch
open_file_cache max=200000 inactive=300s;
open_file_cache_valid 300s;
open_file_cache_min_uses 2;
open_file_cache_errors on;

SSL settings:

#Ansible managed, do not touch
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_prefer_server_ciphers on;
ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH";
ssl_ecdh_curve secp384r1;
ssl_session_cache shared:SSL:10m;
ssl_session_tickets off;
ssl_stapling on;
ssl_stapling_verify on;
add_header Strict-Transport-Security "max-age=63072000";
rlex commented 7 years ago

Still testing. I created second influxdb output via same config i posted in previous comment, and started getting broken metrics again. For example, one measurement from agent with hostname like "heimdall.srv.example.org" was:

heimdalerface=eth0

On other case, there was measurement with (rabbitmq plugin, rabbit name rabbit@php-4):

bit@php-4
bit@php-4,auto_delete=false,durable=true (related series)

Also there was measurement with name

 ƒ�����vmstat

And stuff like measurement =SPU with series =SPU,type=Spurious\ interrupts

What's interesting is that test database i created started getting metrics which should not be in that DB (test is p3, i see broken metrics with hosts that should be in database p1) I tried disabling all my tweaks (ssl, gzip, open file cache) with no luck.

UPD: looks like i have found issue. That was http/2. Still testing.

rlex commented 7 years ago

Two hours with disabled http/2 and no broken metrics. No other parameters was changed. I bet on http/2

danielnelson commented 7 years ago

So the workaround was changing listen [snip]:443 ssl http2; to listen [snip]:443 ssl;?

rlex commented 7 years ago

Yes. But i would like to have it working both ways, since it's not possible to disable http2 for just one vhost - at least on nginx 1.13.0 if i enable http2 on one vhost (on same IP) it will be automatically enabled for all other hosts.

danielnelson commented 7 years ago

Yeah for sure this needs fixed.

danielnelson commented 7 years ago

Could be related to #2862

danielnelson commented 7 years ago

Unfortunately, I don't think that fix will help, so I'm going to reopen.

danielnelson commented 7 years ago

@gavinzhou are you also using http2?

danielnelson commented 7 years ago

I tried to duplicate this again with http2 but still no luck. Here is my nginx configuration:

[+]dbn@debian-stretch:~$ cat /etc/nginx/nginx.conf 
user www-data;
worker_processes 4;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

events {
        worker_connections 768;
        use epoll;
        multi_accept on;
}

http {
        sendfile on;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
        types_hash_max_size 2048;

        include /etc/nginx/mime.types;
        default_type application/octet-stream;

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
        ssl_prefer_server_ciphers on;

        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log;

        gzip on;
        gzip_disable "msie6";

        gzip_vary on;
        gzip_proxied any;
        gzip_comp_level 6;
        gzip_buffers 16 8k;
        gzip_http_version 1.1;
        gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;
}
[+]dbn@debian-stretch:~$ cat /etc/nginx/sites-enabled/default 
upstream backend {
        server 127.0.0.1:8086 fail_timeout=0;
}

server {
        #listen 80 default_server;
        #listen [::]:80 default_server;

        listen 443 ssl http2;

        ssl_certificate /etc/nginx/cert.pem;
        ssl_certificate_key /etc/nginx/key.pem;
        root /var/www/html;

        server_name _;

        location /status {
                stub_status on;
        }

        location /{
                proxy_set_header Host $http_host;
                proxy_redirect off;
                proxy_pass http://backend;
        }
}

I also added a query string to the output as hinted by @gavinzhou:

[[outputs.influxdb]]
  urls = ["https://debian-stretch?test=testtest"]
  database = "http2"
  username = "dbn"
  password = "dbn"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  insecure_skip_verify = true

My inputs are:

2017-06-06T18:30:43Z I! Loaded inputs: inputs.cpu inputs.disk inputs.diskio inputs.internal inputs.interrupts inputs.linux_sysctl_fs inputs.netstat inputs.nstat inputs.system

Versions:

ii  nginx                                 1.10.3-1                    all
ii  influxdb                              1.2.4-1                     amd64

No sign of bad metrics. How long does it usually take for corrupt metrics to appear?

rlex commented 7 years ago

Around 10-30 minutes. Maybe multiple clients causes it?

rlex commented 7 years ago

(i'm actually pretty sure it's caused by multiple clients - i have several databases for different projects, and i saw broken metrics, let's say from db "staging" in db "testing" and vice-versa)

danielnelson commented 7 years ago

Okay, I only ran it for about 10 minutes. I'll try to leave it running a longer and with multiple clients.

rlex commented 7 years ago

i can probably forward some of my personal stuff to your server if that helps

danielnelson commented 7 years ago

I'll let you know if I need it, about how many telegraf's are you running?

rlex commented 7 years ago

Around 30 i believe, counting not personal. ~400 points / sec

gavinzhou commented 7 years ago

hi @danielnelson thank for your test

telegraf1.3 -> nginx(http) -> influxdb with auth is working

Telegraf 1.3 will remove the query part of the output.influxdb URLs http://127.0.0.1/?test=testtest -> http://127.0.0.1/

1.2 no problem

nginx.conf

user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;
include /usr/share/nginx/modules/*.conf;
events {
    worker_connections 1024;
}
http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';
    access_log  /var/log/nginx/access.log  main;
    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;
    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;
    include /etc/nginx/conf.d/*.conf;
    index   index.html index.htm;
    upstream backend {
            server 127.0.0.1:8086 fail_timeout=0;
    }
    server {
        listen       80 default_server;
        listen       [::]:80 default_server;
        server_name  localhost;
        root         /usr/share/nginx/html;
        include /etc/nginx/default.d/*.conf;
        location / {
                proxy_set_header Host $http_host;
                proxy_redirect off;
                proxy_pass http://backend;
        }

    }
}

telegraf.conf

[global_tags]

[agent]
  interval = "1s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "1s"
  flush_jitter = "0s"
  precision = ""
  debug = true
  logfile = ""
  hostname = "tele13-verup"
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["http://127.0.0.1/?test=testtest"] # UDP endpoint example
  database = "test"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "_write"
  password = "_write"

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
danielnelson commented 7 years ago

@gavinzhou On the other issue you mentioned you were getting corrupted measurements, is this correct?

danielnelson commented 7 years ago

@rlex Just a status update, I'm still unable to reproduce the issue after running overnight with 10 telegrafs spamming InfluxDB with an assortment of inputs at 1s intervals.

gavinzhou commented 7 years ago

@danielnelson hi this other issue, reopen #2891 ?

danielnelson commented 7 years ago

Whoops, sorry I was confused.

danielnelson commented 7 years ago

@rlex In the 1.3.4 release there are a couple fixes which I think could potentially fix this problem, do you think you could retest?

rlex commented 7 years ago

Sorry for long reply time - testing it now.

MattMacGillivray commented 7 years ago

I am seeing the same problem (broken metrics with strange unicode chars). I have influxdb behind jwilder/nginx-proxy, which is terminating ssl. The default config uses http2, as noted above. 3 nodes running telegraf, sending metrics - 2x ubuntu, 1x armhf linux (raspberry pi).

Trying to consistently recreate the issue. The one thing I noticed was that the locale on raspberry pi was en_GB.UTF8 while the ubuntu machines (including influxdb) were en_US.UTF8.

Using telegraf 1.3.5

MattMacGillivray commented 7 years ago

Possibly related: nginx trouble with http2 ( via https://github.com/jwilder/nginx-proxy/issues/543 )

rlex commented 7 years ago

As for my tests, recent patches hasn't fixed it, sadly.

bobmshannon commented 7 years ago

Does setting the following environment variable at telegraf runtime workaround the issue?

GODEBUG=http2client=0 
MattMacGillivray commented 7 years ago

So far (6hrs of monitoring on 4 hosts), the GODEBUG=http2client=0 suggestion seems to be working.

bobmshannon commented 7 years ago

@qmnonic Cool. My initial reaction is that disabling HTTP/2 on either the server or client side as a workaround for this is an interesting result. As far as I know nginx uses HTTP/1.0 or HTTP/1.1 (see proxy_http_version) when proxying traffic to the upstream server (in this case, InfluxDB) regardless of whether HTTP/2 is enabled or not.

Can we try turning back on HTTP/2 and then disabling gzip to see if it has any affect? Something like this:

   location / {
       gzip off;
       proxy_pass http://localhost:3000;
       proxy_set_header Host $host;
       proxy_set_header X-Real-IP $remote_addr;
       proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
       proxy_set_header X-Forwarded-Proto $scheme;

   }

What version of InfluxDB are you running?

rlex commented 7 years ago

i tried disabling gzip and this hasn't helped. Can we maybe have nohttp2 option or something similar in telegraf.conf?

danielnelson commented 7 years ago

I don't want to add an option to the configuration because then it becomes part of the stable interface. In the meantime you can use GODEBUG=http2client=0 to disable.

danielnelson commented 7 years ago

Could be interesting to see if this problem remains with go 1.9, though I have no indication that it would be.

rlex commented 7 years ago

The thing that caught my eye

HTTP/2 now uses the priority write scheduler by default. Frames are scheduled by following HTTP/2 priorities as described in RFC 7540 Section 5.3.
rlex commented 7 years ago

@danielnelson 1.4.0 will be compiled with go1.9?

danielnelson commented 7 years ago

The official packages will be with go1.8.

sebastianarena commented 7 years ago

Hi. Here's my take on this. Coming from a previous reported issue #3157. Tested for several days changing lots of things. At the end of the day, removing Nginx completely solved all my problems. There must be some buffering occurring on Nginx's side that causes tons of problems for influx. HTTP2 removal helped a lot, but after a few days invalid measurements would re-appear. It's been 48hs now testing directly to influx, with 0 invalid measurements. Hope it helps!

danielnelson commented 7 years ago

I'm looking through the code again, and I think it may be possible for the Content-Length header to be incorrect. Can someone, if it has not already been done, test with the content_encoding = "gzip" option? This would disable the Content-Length header and use chunked transfer encoding.

danielnelson commented 7 years ago

@rlex Nightly builds are now built with Go 1.9, do you think you could also test with content_encoding = "gzip"?

danielnelson commented 7 years ago

@scanterog Has been testing with content_encoding = "gzip" for about a day without corruption, so I think it is the Content-Length. I will get it fixed.

scanterog commented 7 years ago

Hi guys. It has been 5 days for me now and all looks ok. No errors anymore. So I can say content_encoding = "gzip" actually fixes it.

danielnelson commented 7 years ago

Great news! I'll work on fixing the Content-Length header which I think will also allow it to work when gzip is not enabled.

LeJav commented 6 years ago

I have exactly the same problem with telegraf 1.4.4: # telegraf --version Telegraf v1.4.4 (git: release-1.4 ddcb93188f3fb5393d811cdd742bfe6ec799eba9) I have installed telegraf with official release from: https://dl.influxdata.com/telegraf/releases/telegraf_1.4.4-1_amd64.deb

bobmshannon commented 6 years ago

@LeJav Are you also using nginx as an HTTP/2 reverse proxy? Is telegraf's content_encoding set to gzip?

The reason I ask is because I wonder whether this nginx issue which describes a bug where the client request body may be corrupted when streaming data using HTTP/2.

If the answer to the above answer is yes, it would be interesting to try the following to see if it helps at all:

  1. Disable proxy_request_buffering.
  2. Upgrade nginx to 1.13.6 which includes a fix for this issue.
LeJav commented 6 years ago

@bobmshannon

May thanks for your help. I think that you are right and the problem seems related to nginx bug.

For nginx reverse proxy, I have not specified proxy_request_buffering, so it is on by default (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_request_buffering)

I have set content_encoding set to gzip in telegraf config since more than 2 days and I have no more problem since.

I am using nginx 1.10.0

Is it worth to try proxy_request_buffering to off and remove content_encoding in telegraf config?

danielnelson commented 6 years ago

Is it worth to try proxy_request_buffering to off and remove content_encoding in telegraf config?

@LeJav If you can do this it would be very much appreciated.

LeJav commented 6 years ago

@danielnelson : I will try it next week end

LeJav commented 6 years ago

Sorry for this long delay. I have: . set proxy_request_buffering to off . removed content_encoding=gzip nginx is 1.10.0

I will check within a couple of days if it is OK or if new corrupted measurements have been created.

LeJav commented 6 years ago

16 hours later, I have several corrupted measurements: ăÇ�����face=eth0 ăÇ�����az-telegraf-net-int ăÇ�����face=eth0 ăÇ���¾az-telegraf-net-int ăÇ���¾face=eth0 ăÇ�����az-telegraf-net-int ăÇ�����face=eth0 ...

So this does not work. I will reset removed content_encoding=gzip which is working fine.

danielnelson commented 6 years ago

The InfluxDB output has been rewritten for 1.6.0, as well as the metric seriailzer. It would be very much appreciated if someone could retest this issue.