veepee-oss / influxdb-relay

Service to replicate InfluxDB data for high availability.
MIT License
198 stars 47 forks source link

influxdb-relay intermittently stops sending data to InfluxDB #8

Closed egeexyz closed 5 years ago

egeexyz commented 5 years ago

Use Case:

I have 5 Telegraf instances sending data to a single influxdb-relay. Each instance is on a different machine (including influxdb-relay and influxdb itself). My single Influxdb-relay instance is forwarding data to two InfluxDB servers.

Issue:

From my testing, Influxdb-relay appears to handle up to 3 Telegraf streams just fine. Once a 4th or 5th is added, the input streams simply stop. If I restart the Telegraf instances that stop showing up in InfluxDB, the streams begin again. However, another stream will eventually stop.

I've attached two screenshots from Chronograf of this occurring.

lost01 lost02

rockyluke commented 5 years ago

Could you please try the latest release. We have fixed a huge memory leak.

egeexyz commented 5 years ago

So it looks like the memory leak fix may have improved things slightly as I was able to run about 5 Telegraf instances sending data to influxdb-relay before data was lost.

As you can see from the attached image, the issue is still happening even with the latest code. stillbroken

rockyluke commented 5 years ago

Thanks for the report. We didn't encouter the issue on our side with around 500 telegraf.

Please find attache our infrastructure maybe it can help us to understand the issue.

influxdb - v2 0

camskkz commented 5 years ago

Hello,

Do you have any error logs that could help us debug this issue ? Because as @rockyluke said we have hundreds of telegraf sending metrics to the relay without issues.

egeexyz commented 5 years ago

Not at the moment, I've moved passed this issue. The only error logs I can think of would be from Telegraf (which reported that it could not connect to the Relay) and from the relay itself which simply says it has received a request. I'm not aware of any other logs.

Could it be platform related? The instances with Telegraf are running Ubuntu 16.04 while the instance running the Relay was running Amazon Linux 2 though I don't recall which version of Go it had.

camskkz commented 5 years ago

It might, but I am not sure, our telegrafs are running on multiple differents platforms (ubuntu, debian, centos, windows). Our influxdb-relay is running on a debian 8.11 with go 1.7.4 at the moment. Did you have this issue with our fork only or with the official version also ?

egeexyz commented 5 years ago

I will be revisiting this issue either tomorrow or early next week and I'll be able to provide more data then.

We require a relay like this for our data replication process so I'd very much like to get this working! :smile:

egeexyz commented 5 years ago

Well we're off to an interesting start. I've got the relay setup on a remote machine where it forwards data to two other remote machines. I used the Sample config as my base config and it looks like this

# -*- toml -*-

# InfluxDB
[[http]]
name = "example-http-influxdb"
bind-addr = "0.0.0.0:9096"
output = [
  { name="master", location = "34.240.14.15:8086/write", timeout="10s", type="influxdb" },
  { name="slave", location = "34.254.170.17:8086/write", timeout="10s", type="influxdb" },
]
# EOF

I've got several remote Telegraf instances pointing at it and I'm receiving this error

2018/11/13 09:12:40 Problem posting to relay "example-http-influxdb" backend "master": parse 34.240.14.15:8086/write: first path segment in URL cannot contain colon
2018/11/13 09:12:40 Problem posting to relay "example-http-influxdb" backend "slave": parse 34.254.170.17:8086/write: first path segment in URL cannot contain colon

I'm unsure if I am missing something obvious here because I'm effectively using the sample conf with my IPs added to it.

Oursin commented 5 years ago

location = "34.240.14.15:8086/write"

I think your problem here is that you didn't specify the URL scheme http:// at the start of the locations. Could you try adding the scheme and tell us if that changes anything for you ?

egeexyz commented 5 years ago

Looks good so far. I'll slowly increase the number of Telegraf instances pointing to it and let you know how it goes

egeexyz commented 5 years ago

I'm pleased to report that I have about 25 Telegraf instances plugged into influxdb-relay and everything appears to be working fine.

Previously, the connections would drop very quickly, within minutes of adding the 5th or 6th connection. I've been running for well over an hour and everything looks great.

I'll close this issue and re-open it if connections drop again.

egeexyz commented 5 years ago

It looks like this issue has resurfaced and is worse than before. I'm only able to get 5 instances to show up in InfluxDB and we have about 20 sending data to the relay. The connections just slowly drop off for no apparent reason or pattern.

image

Oursin commented 5 years ago

Hello ! Did you try with the official InfluxDB-Relay ? Do you experience the same problems ? We don't really understand why you have this problem either, so it could help to see if it is linked with our modifications, thanks !

egeexyz commented 5 years ago

I just tried it with the official relay and I have the same problem with it. I think this issue might be related to https://github.com/vente-privee/influxdb-relay/issues/5 because the logs for Telegraf are filled with the "could not create db" error, even though the database exists in InfluxDB.

What else is strange is even if I set skip_database_creation = true in the Telegraf config, it still throws that error. I upgrade to the latest Telegraf, 1.9.0 and the issue persists.

egeexyz commented 5 years ago

Are you guys using Telegraf with your influxdb-relay? I've noticed that Telegraf returns a 404 if it cannot reach a service. So if Influxdb-relay is not running, it returns a 404. If influxdb-relay is running and it has a problem connecting, it again returns a 404. This makes it nearly impossible to determine what is actually going wrong.

For testing purposes, I have a Telegraf daemon running on the same instance is influxdb-relay. This is the influxdb output in the toml:

[[outputs.influxdb]]
  urls = ["http://localhost:9096"]

  database = "ancillary"

  skip_database_creation = true
camskkz commented 5 years ago

Hello,

We are indeed using telegraf with influxdb-relay, but we have an nginx in front of it. I have checked our logs (from influxdb and nginx) and I dont see any 404 and we have hundreds of telegraf running on various environment. We also have a telegraf running on the same server as the relay and we do not have issue with it.

As for #5, telegraf tries indeed to create the DB when it starts but that's it as far as I know.