Closed Pecadis closed 6 years ago
Does the process stop on both Firewall or does the graphing stop on Influx?
Interestingly, Influx is not getting any more data but Opnsense says that the Services are still running, when i manually restart the services, influx is getting the data instanly again.
Then check the Firewall logs and traffic in FW03
i already monitored it and couldn't find anything obvious, the other VM's in the same network as FW01 and FW02 could push the data through FW03 without any issues.
On FW03 look at the traffic in inside Interface .. is it blocked? Forwarded? There must be anything ...
Thanks,
i'll check it.
I have monitored it so far and couldn't find any indications why it had stopped working (red marker) On FW01 and FW03 i still can see the other machines sending down the metrics. FW02 is still sending them atm.
Edit1: I have rebooted all FW and the issue is also appearing on FW03.
You have to check the logs and traffic of FW03 .. we dont need graphs for this. Do you see a packet incoming on the interface connected at FW01+FW02, good. Then look at the interface where influx sits if it leaves. If no, check the logs on FW03, there must be something.
Ok, i could redirect all the syslog logs and align them (i need to align the timezone on all machines later...)
Basically here are my outcomes.
My CentralLog Server is still receiving some posts /write?db=monitor HTTP/1.1
from the FW02 in this case but not the /write?consistency=any&db=monitor HTTP/1.1
when the metrics are not being sent anymore.
I have attached the logs.
@Pecadis for me it's some kind of configuration error or firewall between, not related to the plugin. Can we close this one?
@mimugmail i understand your point entirely and i agree with you that it is partly a configuration error but i would expect the plugin to work even with non-ideal settings or at least having an error-message when it can't send out certain messages to the collector.
But this is not related to the plugin, more to the software itself. So Influx should send an alert if it hasn't received metrics since the last X cycles. I don't think there is a way to achieve this via the client.
Influxdb does notify me that the metrics are not coming in, as i already posted.
The Points are:
I would like to help as good as i can to avoid this kind of inconsistency in the future in general, at least pointing out a combination of Settings which one should avoid. If you want, i can send you my actual Settings for all of my Firewalls if you want.
Ok, just to summarize (I don't get notified when you edit existing comments):
Correct?
If yes, are the working systems in 5. also FreeBSD? I'd suspect the FreeBSD port of telegraf and not the plugin itself. The plugin only calls telegraf to stop or start. We are not responsible for the telegraf code. There were some bugs with ping plugin some months ago so chance that this problem is specific to just FreeBSD is high.
After this we could try to debug within a TeamViewer session and might open an issue on the telegraf project.
Yup, that's right.
I wasn't aware that there is a difference between the FreeBSD Port and the plugin. Sorry for that. Generally your argument is right and my other Machines are non-BSD. What is the right channel to address this kind of issue? I have opened an Issue on the influxdb Forum at the same time as this one but haven't received any feedback yet.
https://community.influxdata.com/t/telegraf-plugin-on-opnsense-stops-suddenly/5746
As a next step you should try to install a OPNsense device in the same network as the Influx to totally exclude any firewall issues. If the problem still occurs on the device we have to search in the logs for errors when the crash happens and open an issue at https://github.com/influxdata/telegraf/issues But ATM we don't have enough data since you are the only guy reporting such an error.
btw .. you updated to latest 18.7.2 since there was an update to Telegraf pkg?
Good idea, i will set up a new OPNsense directly in the network of Influx. That might take some time to test but i will come back.
yes, i updated to 18.7.2 yesterday. I will observe the Functionality.
Thanks 👍
Any progress on this?
I have tested it on different ways but it looks like that the issue doesn't appear on a Plain Opnsense without any load or configuration. I would just say that we close this issue until more people are affected by it.
Thanks for your ideas and support anyway.
Hi,
firstly, thank you to implement Telegraf as a plugin for Opnsense. It is a great benefit to the usability in terms of Configuration. Unfortunately, i am currently scratching my head because the telegraf Plugin on 2 of 3 (FW01 and FW02) of my Opnsense Firewalls are always stopping after a certain time. ~On the 3rd Opnsense (FW03), it is working flawlessly.~ It is also appearing on FW03 but the interessting part is, that all other maschines (OtherVM in the graph) in the same Network as FW01 and FW02 are sending the Metrics without any issues.
FW03 has the FW Rule set to Allow any connection to the InfluxDB from any Sources on port 8086.
Here you have a brief overview about my network.
And here the Telegraf settings of the Firewalls. All FW are set up identically. And have the latest update (OPNsense 18.1.11-amd64FreeBSD 11.1-RELEASE-p11OpenSSL 1.0.2o 27 Mar 2018)
Interval: 1 Round Interval: true Metric Batch Size: 10000 Metric Buffer Limit: 100000 Collection Jitter: 0 Flush Interval: 2 Flush Jitter: 1
Activated Inputs: CPU Per-CPU Total CPU Disk Disk IO Memory Processes Swap System Network Output Settings:
InfluxURL: http//ip.of.influxdb:8086 database: monitor Timeout: 5 Playing around with different Batch Sizes and Bufferlimits and Timeouts doesn’t work either
Did i missed something or any kind of Config?
Thanks in advance Pecadis