Closed rlex closed 6 years ago
I tried this out in debian, although with fewer input plugins, but was unable to reproduce.
ii telegraf 1.3.0-1 amd64
ii influxdb 1.2.4-1 amd64
ii nginx-full 1.10.3-1 amd64
[[outputs.influxdb]]
urls = ["https://loaner:443"]
database = 'telegraf_nginx'
retention_policy = ""
write_consistency = "any"
username = "dbn"
password = "howdy"
timeout = "5s"
ssl_ca = "/home/dbn/.ssl/ca/cacert.pem"
$ cat /etc/nginx/sites-available/default
upstream backend {
server 127.0.0.1:8086 fail_timeout=0;
}
server {
listen 443 default_server;
ssl on;
ssl_certificate /etc/nginx/cert.pem;
ssl_certificate_key /etc/nginx/key.pem;
server_name _;
root /var/www;
location / {
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://backend;
}
}
Does it work if if you take nginx out? What is the output of locale
?
It works if i take nginx out - it's how it works right now. It also works with reverse proxy if i downgrade telegraf back to 1.2.1. I also observed that problem with some self-built version (looking at my clone of repo, that was c66e2896c658cc2d3ccf9fdec3be8072e87162a2) but that was on testbox and i forgot to post issue for it.
When upgrading to 1.3 i mostly left config intact except adding 2 new inputs:
[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]
Locale output:
root@heimdall:~# locale
LANG=en_US.utf8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
Telegraf envs, ubuntu 16.04 LTS:
root@heimdall:~# cat /proc/22480/environ
LANG=en_US.utf8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binHOME=/etc/telegrafLOGNAME=telegrafUSER=telegrafSHELL=/bin/false
Debian 8:
LANG=en_US.UTF-8LC_CTYPE=en_US.UTF-8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binHOME=/etc/telegrafLOGNAME=telegrafUSER=telegrafSHELL=/bin/false
Here is full nginx configs: nginx.conf:
#Ansible managed, do not touch
user nginx nginx;
worker_processes 4;
pid /var/run/nginx.pid;
worker_rlimit_nofile 1024;
include /etc/nginx/modules-enabled/*.conf;
events {
worker_connections 512;
use epoll;
multi_accept on;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log error;
server_tokens off;
types_hash_max_size 2048;
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
Corresponding vhost:
#Ansible managed, do not touch
server {
listen snip:80;
listen snip:443 ssl http2;
listen [snip]:443 ssl http2;
listen [snip]:80;
server_name snip;
ssl_certificate /etc/nginx/ssl/snip/fullchain.pem;
ssl_certificate_key /etc/nginx/ssl/snip/privkey.pem;
ssl_dhparam /etc/nginx/ssl/dhparam.pem;
client_max_body_size 10M;
if ($scheme = http) {
return 301 https://$server_name$request_uri;
}
location /.well-known/acme-challenge/ {
alias /var/www/letsencrypt/;
}
location / {
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Nginx also auto-includes following settings (in http {} context) Gzip:
#Ansible managed, do not touch
gzip on;
gzip_comp_level 5;
gzip_min_length 256;
gzip_http_version 1.1;
gzip_buffers 16 8k;
gzip_proxied any;
gzip_vary on;
gzip_disable "MSIE [1-6]\.(?!.*SV1)";
gzip_types application/atom+xml application/javascript application/json application/rss+xml application/vnd.ms-fontobject application/x-font-ttf application/x-javascript application/x-web-app-manifest+json application/xhtml+xml application/xml font/opentype image/svg+xml image/x-icon text/css text/plain text/x-component text/javascript;
Open file cache:
#Ansible managed, do not touch
open_file_cache max=200000 inactive=300s;
open_file_cache_valid 300s;
open_file_cache_min_uses 2;
open_file_cache_errors on;
SSL settings:
#Ansible managed, do not touch
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_prefer_server_ciphers on;
ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH";
ssl_ecdh_curve secp384r1;
ssl_session_cache shared:SSL:10m;
ssl_session_tickets off;
ssl_stapling on;
ssl_stapling_verify on;
add_header Strict-Transport-Security "max-age=63072000";
Still testing. I created second influxdb output via same config i posted in previous comment, and started getting broken metrics again. For example, one measurement from agent with hostname like "heimdall.srv.example.org" was:
heimdalerface=eth0
On other case, there was measurement with (rabbitmq plugin, rabbit name rabbit@php-4):
bit@php-4
bit@php-4,auto_delete=false,durable=true (related series)
Also there was measurement with name
�����vmstat
And stuff like measurement =SPU with series =SPU,type=Spurious\ interrupts
What's interesting is that test database i created started getting metrics which should not be in that DB (test is p3, i see broken metrics with hosts that should be in database p1) I tried disabling all my tweaks (ssl, gzip, open file cache) with no luck.
UPD: looks like i have found issue. That was http/2. Still testing.
Two hours with disabled http/2 and no broken metrics. No other parameters was changed. I bet on http/2
So the workaround was changing listen [snip]:443 ssl http2;
to listen [snip]:443 ssl;
?
Yes. But i would like to have it working both ways, since it's not possible to disable http2 for just one vhost - at least on nginx 1.13.0 if i enable http2 on one vhost (on same IP) it will be automatically enabled for all other hosts.
Yeah for sure this needs fixed.
Could be related to #2862
Unfortunately, I don't think that fix will help, so I'm going to reopen.
@gavinzhou are you also using http2?
I tried to duplicate this again with http2 but still no luck. Here is my nginx configuration:
[+]dbn@debian-stretch:~$ cat /etc/nginx/nginx.conf
user www-data;
worker_processes 4;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
events {
worker_connections 768;
use epoll;
multi_accept on;
}
http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
ssl_prefer_server_ciphers on;
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log;
gzip on;
gzip_disable "msie6";
gzip_vary on;
gzip_proxied any;
gzip_comp_level 6;
gzip_buffers 16 8k;
gzip_http_version 1.1;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
[+]dbn@debian-stretch:~$ cat /etc/nginx/sites-enabled/default
upstream backend {
server 127.0.0.1:8086 fail_timeout=0;
}
server {
#listen 80 default_server;
#listen [::]:80 default_server;
listen 443 ssl http2;
ssl_certificate /etc/nginx/cert.pem;
ssl_certificate_key /etc/nginx/key.pem;
root /var/www/html;
server_name _;
location /status {
stub_status on;
}
location /{
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://backend;
}
}
I also added a query string to the output as hinted by @gavinzhou:
[[outputs.influxdb]]
urls = ["https://debian-stretch?test=testtest"]
database = "http2"
username = "dbn"
password = "dbn"
retention_policy = ""
write_consistency = "any"
timeout = "5s"
insecure_skip_verify = true
My inputs are:
2017-06-06T18:30:43Z I! Loaded inputs: inputs.cpu inputs.disk inputs.diskio inputs.internal inputs.interrupts inputs.linux_sysctl_fs inputs.netstat inputs.nstat inputs.system
Versions:
ii nginx 1.10.3-1 all
ii influxdb 1.2.4-1 amd64
No sign of bad metrics. How long does it usually take for corrupt metrics to appear?
Around 10-30 minutes. Maybe multiple clients causes it?
(i'm actually pretty sure it's caused by multiple clients - i have several databases for different projects, and i saw broken metrics, let's say from db "staging" in db "testing" and vice-versa)
Okay, I only ran it for about 10 minutes. I'll try to leave it running a longer and with multiple clients.
i can probably forward some of my personal stuff to your server if that helps
I'll let you know if I need it, about how many telegraf's are you running?
Around 30 i believe, counting not personal. ~400 points / sec
hi @danielnelson thank for your test
telegraf1.3 -> nginx(http) -> influxdb with auth is working
Telegraf 1.3 will remove the query part of the output.influxdb URLs
http://127.0.0.1/?test=testtest
-> http://127.0.0.1/
1.2 no problem
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;
include /usr/share/nginx/modules/*.conf;
events {
worker_connections 1024;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
include /etc/nginx/conf.d/*.conf;
index index.html index.htm;
upstream backend {
server 127.0.0.1:8086 fail_timeout=0;
}
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name localhost;
root /usr/share/nginx/html;
include /etc/nginx/default.d/*.conf;
location / {
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://backend;
}
}
}
[global_tags]
[agent]
interval = "1s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "1s"
flush_jitter = "0s"
precision = ""
debug = true
logfile = ""
hostname = "tele13-verup"
omit_hostname = false
[[outputs.influxdb]]
urls = ["http://127.0.0.1/?test=testtest"] # UDP endpoint example
database = "test"
retention_policy = ""
write_consistency = "any"
timeout = "5s"
username = "_write"
password = "_write"
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
@gavinzhou On the other issue you mentioned you were getting corrupted measurements, is this correct?
@rlex Just a status update, I'm still unable to reproduce the issue after running overnight with 10 telegrafs spamming InfluxDB with an assortment of inputs at 1s intervals.
@danielnelson hi this other issue, reopen #2891 ?
Whoops, sorry I was confused.
@rlex In the 1.3.4 release there are a couple fixes which I think could potentially fix this problem, do you think you could retest?
Sorry for long reply time - testing it now.
I am seeing the same problem (broken metrics with strange unicode chars). I have influxdb behind jwilder/nginx-proxy, which is terminating ssl. The default config uses http2, as noted above. 3 nodes running telegraf, sending metrics - 2x ubuntu, 1x armhf linux (raspberry pi).
Trying to consistently recreate the issue. The one thing I noticed was that the locale on raspberry pi was en_GB.UTF8 while the ubuntu machines (including influxdb) were en_US.UTF8.
Using telegraf 1.3.5
Possibly related: nginx trouble with http2 ( via https://github.com/jwilder/nginx-proxy/issues/543 )
As for my tests, recent patches hasn't fixed it, sadly.
Does setting the following environment variable at telegraf runtime workaround the issue?
GODEBUG=http2client=0
So far (6hrs of monitoring on 4 hosts), the GODEBUG=http2client=0
suggestion seems to be working.
@qmnonic Cool. My initial reaction is that disabling HTTP/2 on either the server or client side as a workaround for this is an interesting result. As far as I know nginx uses HTTP/1.0 or HTTP/1.1 (see proxy_http_version
) when proxying traffic to the upstream server (in this case, InfluxDB) regardless of whether HTTP/2 is enabled or not.
Can we try turning back on HTTP/2 and then disabling gzip to see if it has any affect? Something like this:
location / {
gzip off;
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
What version of InfluxDB are you running?
i tried disabling gzip and this hasn't helped. Can we maybe have nohttp2 option or something similar in telegraf.conf?
I don't want to add an option to the configuration because then it becomes part of the stable interface. In the meantime you can use GODEBUG=http2client=0
to disable.
Could be interesting to see if this problem remains with go 1.9, though I have no indication that it would be.
The thing that caught my eye
HTTP/2 now uses the priority write scheduler by default. Frames are scheduled by following HTTP/2 priorities as described in RFC 7540 Section 5.3.
@danielnelson 1.4.0 will be compiled with go1.9?
The official packages will be with go1.8.
Hi. Here's my take on this. Coming from a previous reported issue #3157. Tested for several days changing lots of things. At the end of the day, removing Nginx completely solved all my problems. There must be some buffering occurring on Nginx's side that causes tons of problems for influx. HTTP2 removal helped a lot, but after a few days invalid measurements would re-appear. It's been 48hs now testing directly to influx, with 0 invalid measurements. Hope it helps!
I'm looking through the code again, and I think it may be possible for the Content-Length header to be incorrect. Can someone, if it has not already been done, test with the content_encoding = "gzip"
option? This would disable the Content-Length header and use chunked transfer encoding.
@rlex Nightly builds are now built with Go 1.9, do you think you could also test with content_encoding = "gzip"?
@scanterog Has been testing with content_encoding = "gzip"
for about a day without corruption, so I think it is the Content-Length. I will get it fixed.
Hi guys. It has been 5 days for me now and all looks ok. No errors anymore. So I can say content_encoding = "gzip"
actually fixes it.
Great news! I'll work on fixing the Content-Length header which I think will also allow it to work when gzip is not enabled.
I have exactly the same problem with telegraf 1.4.4:
# telegraf --version Telegraf v1.4.4 (git: release-1.4 ddcb93188f3fb5393d811cdd742bfe6ec799eba9)
I have installed telegraf with official release from:
https://dl.influxdata.com/telegraf/releases/telegraf_1.4.4-1_amd64.deb
@LeJav Are you also using nginx as an HTTP/2 reverse proxy? Is telegraf's content_encoding
set to gzip
?
The reason I ask is because I wonder whether this nginx issue which describes a bug where the client request body may be corrupted when streaming data using HTTP/2.
If the answer to the above answer is yes, it would be interesting to try the following to see if it helps at all:
proxy_request_buffering
.@bobmshannon
May thanks for your help. I think that you are right and the problem seems related to nginx bug.
For nginx reverse proxy, I have not specified proxy_request_buffering, so it is on by default (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_request_buffering)
I have set content_encoding
set to gzip
in telegraf config since more than 2 days and I have no more problem since.
I am using nginx 1.10.0
Is it worth to try proxy_request_buffering to off and remove content_encoding in telegraf config?
Is it worth to try proxy_request_buffering to off and remove content_encoding in telegraf config?
@LeJav If you can do this it would be very much appreciated.
@danielnelson : I will try it next week end
Sorry for this long delay. I have: . set proxy_request_buffering to off . removed content_encoding=gzip nginx is 1.10.0
I will check within a couple of days if it is OK or if new corrupted measurements have been created.
16 hours later, I have several corrupted measurements: ăÇ�����face=eth0 ăÇ�����az-telegraf-net-int ăÇ�����face=eth0 ăÇ���¾az-telegraf-net-int ăÇ���¾face=eth0 ăÇ�����az-telegraf-net-int ăÇ�����face=eth0 ...
So this does not work. I will reset removed content_encoding=gzip which is working fine.
The InfluxDB output has been rewritten for 1.6.0, as well as the metric seriailzer. It would be very much appreciated if someone could retest this issue.
Bug report
Telegraf stopped working with nginx reverse proxy since 1.3 release
Relevant telegraf.conf:
System info:
Telegraf 1.3.0-1, InfluxDB 1.2.4-1 , nginx 1.13.0-1~xenial
Steps to reproduce:
Expected behavior:
Metrics should be arriving proparly
Actual behavior:
Lots of broken metrics with strange unicode (?) symbols.
Additional info:
Sometimes i saw lines like "2017-05-21T03:33:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse '.ru timst=rails-2 total=4387237888i,used=2909868032i,free=1477369856i,used_percent=66.32574084845248 1495337590000000000': invalid boolean]"
I suspect this is somehow related to https://github.com/influxdata/telegraf/pull/2251