Telegraf plugin on Opnsense stops suddenly

Pecadis commented 6 years ago

Hi,

firstly, thank you to implement Telegraf as a plugin for Opnsense. It is a great benefit to the usability in terms of Configuration. Unfortunately, i am currently scratching my head because the telegraf Plugin on 2 of 3 (FW01 and FW02) of my Opnsense Firewalls are always stopping after a certain time. ~On the 3rd Opnsense (FW03), it is working flawlessly.~ It is also appearing on FW03 but the interessting part is, that all other maschines (OtherVM in the graph) in the same Network as FW01 and FW02 are sending the Metrics without any issues.

FW03 has the FW Rule set to Allow any connection to the InfluxDB from any Sources on port 8086.

Here you have a brief overview about my network.

FW01 ----> FW03 --> Influxdb
FW02 ----^
OtherVM -^

And here the Telegraf settings of the Firewalls. All FW are set up identically. And have the latest update (OPNsense 18.1.11-amd64FreeBSD 11.1-RELEASE-p11OpenSSL 1.0.2o 27 Mar 2018)

Interval: 1 Round Interval: true Metric Batch Size: 10000 Metric Buffer Limit: 100000 Collection Jitter: 0 Flush Interval: 2 Flush Jitter: 1

Activated Inputs: CPU Per-CPU Total CPU Disk Disk IO Memory Processes Swap System Network Output Settings:

InfluxURL: http//ip.of.influxdb:8086 database: monitor Timeout: 5 Playing around with different Batch Sizes and Bufferlimits and Timeouts doesn’t work either

Did i missed something or any kind of Config?

Thanks in advance Pecadis

mimugmail commented 6 years ago

Does the process stop on both Firewall or does the graphing stop on Influx?

Pecadis commented 6 years ago

Interestingly, Influx is not getting any more data but Opnsense says that the Services are still running, when i manually restart the services, influx is getting the data instanly again.

mimugmail commented 6 years ago

Then check the Firewall logs and traffic in FW03

Pecadis commented 6 years ago

i already monitored it and couldn't find anything obvious, the other VM's in the same network as FW01 and FW02 could push the data through FW03 without any issues.

mimugmail commented 6 years ago

On FW03 look at the traffic in inside Interface .. is it blocked? Forwarded? There must be anything ...

Pecadis commented 6 years ago

Thanks,

i'll check it.

Pecadis commented 6 years ago

I have monitored it so far and couldn't find any indications why it had stopped working (red marker) On FW01 and FW03 i still can see the other machines sending down the metrics. FW02 is still sending them atm.

Edit1: I have rebooted all FW and the issue is also appearing on FW03.

mimugmail commented 6 years ago

You have to check the logs and traffic of FW03 .. we dont need graphs for this. Do you see a packet incoming on the interface connected at FW01+FW02, good. Then look at the interface where influx sits if it leaves. If no, check the logs on FW03, there must be something.

Pecadis commented 6 years ago

Ok, i could redirect all the syslog logs and align them (i need to align the timezone on all machines later...)

Basically here are my outcomes.

My CentralLog Server is still receiving some posts /write?db=monitor HTTP/1.1 from the FW02 in this case but not the /write?consistency=any&db=monitor HTTP/1.1 when the metrics are not being sent anymore.

I have attached the logs.

stopped_19-49-59.410801.txt started_22-04-53.366433.txt

mimugmail commented 6 years ago

@Pecadis for me it's some kind of configuration error or firewall between, not related to the plugin. Can we close this one?

Pecadis commented 6 years ago

@mimugmail i understand your point entirely and i agree with you that it is partly a configuration error but i would expect the plugin to work even with non-ideal settings or at least having an error-message when it can't send out certain messages to the collector.

mimugmail commented 6 years ago

But this is not related to the plugin, more to the software itself. So Influx should send an alert if it hasn't received metrics since the last X cycles. I don't think there is a way to achieve this via the client.

Pecadis commented 6 years ago

Influxdb does notify me that the metrics are not coming in, as i already posted.

The Points are:

When i just restart the plugin after the loss of metrics, without changing anything in the config, it will work again until it crashes again.
when changing the config to the right conditions (which are hard to get) it will work (most probably)
The Central log Server and the Firewall rules haven't been changed from the beginning. The only part i adjusted was on the Plugin side with the Config.

I would like to help as good as i can to avoid this kind of inconsistency in the future in general, at least pointing out a combination of Settings which one should avoid. If you want, i can send you my actual Settings for all of my Firewalls if you want.

mimugmail commented 6 years ago

Ok, just to summarize (I don't get notified when you edit existing comments):

On the right side of FW3 is your InfluxDB
On the left side of FW3 is FW1 + FW2
Metrics from FW3 stop suddenly to Influx
Metrics vom FW1+FW2 stop suddenly
Other systems on the left side of FW3 work flawlessly

Correct?

If yes, are the working systems in 5. also FreeBSD? I'd suspect the FreeBSD port of telegraf and not the plugin itself. The plugin only calls telegraf to stop or start. We are not responsible for the telegraf code. There were some bugs with ping plugin some months ago so chance that this problem is specific to just FreeBSD is high.

After this we could try to debug within a TeamViewer session and might open an issue on the telegraf project.

Pecadis commented 6 years ago

Yup, that's right.

I wasn't aware that there is a difference between the FreeBSD Port and the plugin. Sorry for that. Generally your argument is right and my other Machines are non-BSD. What is the right channel to address this kind of issue? I have opened an Issue on the influxdb Forum at the same time as this one but haven't received any feedback yet.

https://community.influxdata.com/t/telegraf-plugin-on-opnsense-stops-suddenly/5746

mimugmail commented 6 years ago

As a next step you should try to install a OPNsense device in the same network as the Influx to totally exclude any firewall issues. If the problem still occurs on the device we have to search in the logs for errors when the crash happens and open an issue at https://github.com/influxdata/telegraf/issues But ATM we don't have enough data since you are the only guy reporting such an error.

mimugmail commented 6 years ago

btw .. you updated to latest 18.7.2 since there was an update to Telegraf pkg?

Pecadis commented 6 years ago

Good idea, i will set up a new OPNsense directly in the network of Influx. That might take some time to test but i will come back.

yes, i updated to 18.7.2 yesterday. I will observe the Functionality.

mimugmail commented 6 years ago

Thanks 👍

mimugmail commented 6 years ago

Any progress on this?

Pecadis commented 6 years ago

I have tested it on different ways but it looks like that the issue doesn't appear on a Plain Opnsense without any load or configuration. I would just say that we close this issue until more people are affected by it.

Thanks for your ideas and support anyway.

opnsense / plugins

Telegraf plugin on Opnsense stops suddenly #724