Closed v98765 closed 1 year ago
Hi,
At least a log message to STDERR, on init, about a failing conn to a vcenter server (with no retry mechanism)
Here is the error message from your logs, showing the failing connection error immediate on the start of Telegraf:
Jan 26 08:40:46 centos8 telegraf[15954]: 2022-01-26T08:40:46Z E! [telegraf] Error running agent: starting input inputs.vsphere: Post "https://localhost/sdk": dial tcp 127.0.0.1:443: connect: connection refused
To be clear this is expected behavior. If a connection is not made, then we exit. If we were to ignore the connection issue, you could possibly never get metrics or think that since Telegraf continued to run, everything is in a good state and therefore no action is required on your part.
Based on your expected behavior I am not sure there is anything further to add here. Could you explain more of what you were after?
Thanks
I'll try to explain
/tmp/telegraf-1.21.2/usr/bin# ./telegraf -config /etc/telegraf/telegraf.d/vcenter_vm.conf -test > out
2022-01-27T09:18:31Z I! Starting Telegraf 1.21.2
2022-01-27T09:18:31Z I! Loaded inputs: vsphere
2022-01-27T09:18:31Z I! Loaded aggregators:
2022-01-27T09:18:31Z I! Loaded processors:
2022-01-27T09:18:31Z W! Outputs are not used in testing mode!
2022-01-27T09:18:31Z I! Tags enabled: host=distktrlmon
2022-01-27T09:18:31Z I! [inputs.vsphere] Starting plugin
2022-01-27T09:18:36Z I! [inputs.vsphere] Stopping plugin
/tmp/telegraf-1.21.2/usr/bin# echo $?
0
then copy vcenter_vm.conf to vcenter_vm1.conf and change url to localhost
/tmp/telegraf-1.21.2/usr/bin# ./telegraf -config /etc/telegraf/telegraf.d/vcenter_vm1.conf -test
2022-01-27T09:13:48Z I! Starting Telegraf 1.21.2
2022-01-27T09:13:48Z I! Loaded inputs: vsphere
2022-01-27T09:13:48Z I! Loaded aggregators:
2022-01-27T09:13:48Z I! Loaded processors:
2022-01-27T09:13:48Z W! Outputs are not used in testing mode!
2022-01-27T09:13:48Z I! Tags enabled: host=distktrlmon
2022-01-27T09:13:48Z I! [inputs.vsphere] Starting plugin
2022-01-27T09:13:48Z E! [agent] Starting input inputs.vsphere: Post "https://localhost/sdk": dial tcp 127.0.0.1:443: connect: connection refused
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2fe0652]
goroutine 67 [running]:
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*Endpoint).Collect(0x0, {0x5863760, 0xc0000740c0}, {0x58cfeb8, 0xc0005f6dc0})
/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/endpoint.go:826 +0x72
github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather.func1(0x0)
/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:328 +0x78
created by github.com/influxdata/telegraf/plugins/inputs/vsphere.(*VSphere).Gather
/go/src/github.com/influxdata/telegraf/plugins/inputs/vsphere/vsphere.go:326 +0x68
/tmp/telegraf-1.21.2/usr/bin# echo $?
2
PLAY [Converge] ****************************************************************
TASK [Gathering Facts] *********************************************************
ok: [centos8]
TASK [Include v98765_telegraf] *************************************************
TASK [v98765_telegraf : Create system group] ***********************************
ok: [centos8]
TASK [v98765_telegraf : Create system user] ************************************
ok: [centos8]
TASK [v98765_telegraf : Ensure existense of /tmp/telegraf-v1.21.1] *************
ok: [centos8]
TASK [v98765_telegraf : Unarchive telegraf files] ******************************
skipping: [centos8]
TASK [v98765_telegraf : Copy telegraf to /usr/local/bin] ***********************
ok: [centos8]
TASK [v98765_telegraf : create telegraf configs directory] *********************
ok: [centos8] => (item=/etc/telegraf)
ok: [centos8] => (item=/etc/telegraf/telegraf.d)
TASK [v98765_telegraf : create systemd service unit] ***************************
ok: [centos8]
TASK [v98765_telegraf : configure telegraf.conf] *******************************
ok: [centos8]
TASK [v98765_telegraf : configure telegraf_vmware_vm.conf] *********************
changed: [centos8]
TASK [v98765_telegraf : ensure telegraf service is started and enabled] ********
ok: [centos8]
RUNNING HANDLER [v98765_telegraf : reload telegraf] ****************************
fatal: [centos8]: FAILED! => {"changed": false, "msg": "Unable to reload service telegraf: telegraf.service is not active, cannot reload.\n"}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
centos8 : ok=10 changed=1 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0
CRITICAL Ansible return code was 2
If we were to ignore the connection issue, you could possibly never get metrics or think that since Telegraf continued to run, everything is in a good state and therefore no action is required on your part.
solution: telegraf metrics for every state of input plugin
curl http://telegraf-host:xxx/metrics
telegraf_plugins{name="vshpere", status="down"} 1
telegraf_plugins{name="vshpere", status="up"} 0
telegraf_plugins{name="snmp", status="down"} 0
telegraf_plugins{name="snmp", status="up"} 10
telegraf_plugins{name="kernel", status="down"} 0
telegraf_plugins{name="kernel", status="up"} 1
I agree with @v98765 that this is very unexpected behavior. I would have expected the vsphere plugin to exit itself, not the entire telegraf process. This has precedent and is handled as I would expect it with other plugins by writing to the log both in test and in production which is what the following plugins do.
E! [inputs.docker] Error in plugin: Cannot connect to the Docker daemon at unix:///var/run/docker.sock2. Is the docker daemon running?
E! [inputs.exec] Error in plugin: exec: fork/exec /usr/local/bin/myexe: no such file or directory for command '/usr/local/bin/myexe':
E! [inputs.http] Error in plugin: [url=http://localhost:5764/api/v1/_status]: Get "http://localhost:5764/api/v1/_status": dial tcp 127.0.0.1:5764: connect: connection refused
E! [inputs.prometheus] Error in plugin: error making HTTP request to http://localhost:9100/metrics: Get "http://localhost:9100/metrics": dial tcp 127.0.0.1:9100: connect: connection refused
E! [inputs.redfish] Error in plugin: received status code 401 (Unauthorized) for address https://10.0.10.1/, expected 200
In other words, if my vsphere happens to not be accessible, the very last thing I want to have happen is to immediately lose all of my metrics and page the entire team for what looks like a massive outage.
Is this something where a PR would be accepted or usually handled by the internal team?
I'm a bit confused over what the desired behavior is supposed to be. My code is just passing the error back to the caller, which seems to cause the exit if the error is non-nil. What is the desired behavior here?
Also, I noticed that it crashed with a nil-pointer error in test mode. I pulled the latest code from "master" and the error disappeared.
@powersj Can you give me some guidance on how to address this? If the vsphere plugin handles this the way it's supposed to, maybe we should close this?
I'm a bit confused over what the desired behavior is supposed to be.
If you are connecting to an external service, then plugins should error on init if you cannot connect. I know this is less than ideal for some, but as I have explained to others before, we have seen time and time again where telegraf "just starts up" and users think everything is golden, when in fact metrics are not getting collected. This results in angry users.
Can you give me some guidance on how to address this?
We have agreed that this can be less than ideal in situations where services or networks are flaky or timing is off. As such we have allowed for plugin-by-plugin options that allow for starting the plugin even if issues are found. The MongoDB and I think a couple of other plugins now have options to ignore connection errors.
We would accept a PR that adds an option to ignore connection failures on start to resolve this issue. It must be an option that is off by default.
Adding a bit of clarification to why I added to this thread, hopefully to help others who may be in a similar situation. What we were running into was one of our vcenters falling offline (reasons are unimportant) which then caused a running Telegraf instance to fail, leading to all of our metrics going down. To us, this was much, much more drastic as it was all data lost, not just the one input. In fact, we first assumed a core routing problem on the network. It never crossed our minds that Telegraf would exit if one of its inputs wasn't online if it had been working properly.
Our solution was to fork Telegraf, have it not exit on errors like this, and have inputs give booleans on if they were alive or not (basically mimicking the Ping input). Simple monitoring on that boolean now immediately points us to the misbehaving input without affecting any other metric gathering. A bit more work to maintain from upstream but this usage fits our needs much better.
instance to fail, leading to all of our metrics going down
If telegraf was already started and collecting metrics and then crash/panic or exited after telegraf was already collecting metrics for some time, then it is a bug. Instead of a fork would you be willing to put up a PR to see what changes you had to make?
This is different than the original reporter, so ideally a new issue may be better to capture this.
I'll try my best to get a PR (and open a new issue) but the codebase is fairly deeply integrated with our internal systems so I'll have to create a sample version. The quick one to demonstrate would be the mqtt_consumer where, on Gather()
(https://github.com/influxdata/telegraf/blob/master/plugins/inputs/mqtt_consumer/mqtt_consumer.go#L312), if the state is disconnected, it will try connect()
(https://github.com/influxdata/telegraf/blob/master/plugins/inputs/mqtt_consumer/mqtt_consumer.go#L162) which, in the event of not being able to connect, will return err
and thus stop the main Telegraf process.
It's been quite some time and many, many commits since we've worked on this, apologies I can't give more info quickly. It is also extremely possible that I'm crossing some wires here so if something doesn't add up, feel free to assume it is a mistake on the part of my memory.
@mlschuh So you're saying that the plugin crashes if it can't connect to vCenter AFTER the initialization? I tried this by simply dropping the network connection to vCenter after a few minutes and it worked as expected. It was throwing this error (obviously), but kept running and was able to resume as soon as I restored the network connection. I tested this with 1.25.0.
2023-01-11T15:44:36Z E! [inputs.vsphere] Error in plugin: Post "https://192.168.1.151/sdk": dial tcp 192.168.1.151:443: connect: network is unreachable
Could you provide more information on the conditions when this happens along with a log (preferably)? I agree with @powersj that this warrants a separate issue as this would be a very serious bug!
@prydin My team has refreshed my memory on this. I can summarize by saying it was other Inputs (like MQTT) that would stop Telegraf, which would then not restart because of vcenter not being accessible. A cascading set of circumstances that do not really warrant any further attention. I will experiment with the mqtt consumer and open an issue if I can reproduce the exit on the current release.
Thank you for engaging with me on this, sorry for the confusion.
telegraf config with 2 inputs plugins
# Ansible managed
[inputs]
[[inputs.net]]
[[inputs.vsphere]]
vcenters = [ "https://localhost/sdk",]
username = "vmuser"
password = "pswd"
tagexclude = [ "dcname", "source", "vcenter", "uuid", "moid",]
vm_metric_include = [ "cpu.costop.summation", "cpu.ready.summation", "cpu.wait.summation", "cpu.run.summation", "cpu.idle.summation", "cpu.used.summation", "cpu.demand.average", "cpu.usagemhz.average", "cpu.usage.average", "mem.active.average", "mem.granted.average", "mem.consumed.average", "mem.usage.average", "mem.vmmemctl.average", "net.bytesRx.average", "net.bytesTx.average", "net.droppedRx.summation", "net.droppedTx.summation", "net.usage.average", "power.power.average", "virtualDisk.numberReadAveraged.average", "virtualDisk.numberWriteAveraged.average", "virtualDisk.read.average", "virtualDisk.readOIO.latest", "virtualDisk.throughput.usage.average", "virtualDisk.totalReadLatency.average", "virtualDisk.totalWriteLatency.average", "virtualDisk.write.average", "virtualDisk.writeOIO.latest", "sys.uptime.latest",]
host_include = []
host_exclude = [ "*",]
cluster_include = []
cluster_exclude = [ "*",]
datastore_include = []
datastore_exclude = [ "*",]
datacenter_include = []
datacenter_exclude = [ "*",]
separator = "_"
collect_concurrency = 10
discover_concurrency = 10
custom_attribute_include = [ "*",]
custom_attribute_exclude = []
insecure_skip_verify = true
historical_interval = "5m"
config test
[root@instance /]# telegraf --config /etc/telegraf/telegraf.d/telegraf_vmware_vm.conf --test
2023-01-12T15:20:35Z W! DeprecationWarning: Option "force_discover_on_init" of plugin "inputs.vsphere" deprecated since version 1.14.0 and will be removed in 2.0.0: option is ignored
2023-01-12T15:20:35Z I! Starting Telegraf 1.25.0
2023-01-12T15:20:35Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
2023-01-12T15:20:35Z I! Loaded inputs: net vsphere
2023-01-12T15:20:35Z I! Loaded aggregators:
2023-01-12T15:20:35Z I! Loaded processors:
2023-01-12T15:20:35Z I! Loaded secretstores:
2023-01-12T15:20:35Z W! Outputs are not used in testing mode!
2023-01-12T15:20:35Z I! Tags enabled: host=instance
2023-01-12T15:20:35Z W! Deprecated inputs: 0 and 1 options
2023-01-12T15:20:35Z I! [inputs.vsphere] Starting plugin
2023-01-12T15:20:35Z E! [agent] Starting input inputs.vsphere: Post "https://localhost/sdk": dial tcp 127.0.0.1:443: connect: connection refused
> net,host=instance,interface=eth0 bytes_recv=4374i,bytes_sent=0i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=39i,packets_sent=0i 1673536836000000000
2023-01-12T15:20:35Z I! [inputs.vsphere] Stopping plugin
> net,host=instance,interface=all icmp_inaddrmaskreps=0i,icmp_inaddrmasks=0i,icmp_incsumerrors=0i,icmp_indestunreachs=0i,icmp_inechoreps=0i,icmp_inechos=0i,icmp_inerrors=0i,icmp_inmsgs=0i,icmp_inparmprobs=0i,icmp_inredirects=0i,icmp_insrcquenchs=0i,icmp_intimeexcds=0i,icmp_intimestampreps=0i,icmp_intimestamps=0i,icmp_outaddrmaskreps=0i,icmp_outaddrmasks=0i,icmp_outdestunreachs=0i,icmp_outechoreps=0i,icmp_outechos=0i,icmp_outerrors=0i,icmp_outmsgs=0i,icmp_outparmprobs=0i,icmp_outredirects=0i,icmp_outsrcquenchs=0i,icmp_outtimeexcds=0i,icmp_outtimestampreps=0i,icmp_outtimestamps=0i,ip_defaultttl=64i,ip_forwarding=1i,ip_forwdatagrams=0i,ip_fragcreates=0i,ip_fragfails=0i,ip_fragoks=0i,ip_inaddrerrors=0i,ip_indelivers=14i,ip_indiscards=0i,ip_inhdrerrors=0i,ip_inreceives=16i,ip_inunknownprotos=0i,ip_outdiscards=0i,ip_outnoroutes=0i,ip_outrequests=14i,ip_reasmfails=0i,ip_reasmoks=0i,ip_reasmreqds=0i,ip_reasmtimeout=0i,tcp_activeopens=7i,tcp_attemptfails=7i,tcp_currestab=0i,tcp_estabresets=0i,tcp_incsumerrors=0i,tcp_inerrs=0i,tcp_insegs=14i,tcp_maxconn=-1i,tcp_outrsts=7i,tcp_outsegs=14i,tcp_passiveopens=0i,tcp_retranssegs=0i,tcp_rtoalgorithm=1i,tcp_rtomax=120000i,tcp_rtomin=200i,udp_ignoredmulti=0i,udp_incsumerrors=0i,udp_indatagrams=0i,udp_inerrors=0i,udp_memerrors=0i,udp_noports=0i,udp_outdatagrams=0i,udp_rcvbuferrors=0i,udp_sndbuferrors=0i,udplite_ignoredmulti=0i,udplite_incsumerrors=0i,udplite_indatagrams=0i,udplite_inerrors=0i,udplite_memerrors=0i,udplite_noports=0i,udplite_outdatagrams=0i,udplite_rcvbuferrors=0i,udplite_sndbuferrors=0i 1673536836000000000
start telegraf
Jan 12 15:21:40 instance systemd[1]: Started Telegraf.
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z W! DeprecationWarning: Value "netsnmp" for option "snmp_translator" of plugin "agent" deprecated since version 1.25.0 and will be removed in 2.0.0: Use 'gosmi' instead
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z W! DeprecationWarning: Option "force_discover_on_init" of plugin "inputs.vsphere" deprecated since version 1.14.0 and will be removed in 2.0.0: option is ignored
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! Starting Telegraf 1.25.0
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! Loaded inputs: net vsphere
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! Loaded aggregators:
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! Loaded processors:
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! Loaded secretstores:
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! Loaded outputs: influxdb
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! Tags enabled:
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z W! Deprecated inputs: 0 and 1 options
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"", Flush Interval:10s
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z I! [inputs.vsphere] Starting plugin
Jan 12 15:21:40 instance telegraf[1129]: 2023-01-12T15:21:40Z E! [telegraf] Error running agent: starting input inputs.vsphere: Post "https://localhost/sdk": dial tcp 127.0.0.1:443: connect: connection refused
Jan 12 15:21:40 instance systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE
Jan 12 15:21:40 instance systemd[1]: telegraf.service: Failed with result 'exit-code'.
process exited
@v98765 This is by design as @powersj pointed out. I'm about to push some code that allows you to change the behavior the same way the MongoDB plugin does. Stay tuned.
Relevent telegraf.conf
Logs from Telegraf
System info
centos 8
Docker
No response
Steps to reproduce
Expected behavior
At least a log message to STDERR, on init, about a failing conn to a vcenter server (with no retry mechanism)
Actual behavior
telegraf.service: Main process exited
Additional info
same issue, but mongodb plugin https://github.com/influxdata/telegraf/issues/10078