VMware Vsphere no dashboard no metrics

sykss commented 5 years ago

Relevant telegraf.conf:


/etc/telegraf/telegraf.conf 

## VMs
## Typical VM metrics (if omitted or empty, all metrics are collected)
[[inputs.vsphere]]
  interval = "60s"
   vcenters = [ "https://pcc.ovh.com/sdk" ]
   username = "grafana@VCENTER"
   password = "123456"

vm_metric_include = [
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.run.summation",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.wait.summation",
"mem.active.average",
"mem.granted.average",
"mem.latency.average",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.usage.average",
"power.power.average", 
"virtualDisk.numberReadAveraged.average",
"virtualDisk.numberWriteAveraged.average",
"virtualDisk.read.average",
"virtualDisk.readOIO.latest",
"virtualDisk.throughput.usage.average",
"virtualDisk.totalReadLatency.average",
"virtualDisk.totalWriteLatency.average",
"virtualDisk.write.average",
"virtualDisk.writeOIO.latest",
"sys.uptime.latest",
]
# vm_metric_exclude = [] ## Nothing is excluded by default
# vm_instances = true ## true by default

## Hosts 
## Typical host metrics (if omitted or empty, all metrics are collected)
host_metric_include = [
"cpu.coreUtilization.average",
"cpu.costop.summation",
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.swapwait.summation",
"cpu.usage.average",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.utilization.average",
"cpu.wait.summation",
"disk.deviceReadLatency.average",
"disk.deviceWriteLatency.average",
"disk.kernelReadLatency.average",
"disk.kernelWriteLatency.average",
"disk.numberReadAveraged.average",
"disk.numberWriteAveraged.average",
"disk.read.average",
"disk.totalReadLatency.average",
"disk.totalWriteLatency.average",
"disk.write.average",
"mem.active.average",
"mem.latency.average",
"mem.state.latest",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.totalCapacity.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.errorsRx.summation",
"net.errorsTx.summation",
"net.usage.average",
"power.power.average",
"storageAdapter.numberReadAveraged.average",
"storageAdapter.numberWriteAveraged.average",
"storageAdapter.read.average",
"storageAdapter.write.average",
"sys.uptime.latest",
]
# host_metric_exclude = [] ## Nothing excluded by default
# host_instances = true ## true by default

## Clusters 
# cluster_metric_include = [] ## if omitted or empty, all metrics are collected
# cluster_metric_exclude = [] ## Nothing excluded by default
# cluster_instances = true ## true by default

## Datastores 
datastore_metric_include = [] ## if omitted or empty, all metrics are collected
# datastore_metric_exclude = [] ## Nothing excluded by default
# datastore_instances = false ## false by default for Datastores only

## Datacenters
datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
# datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
# datacenter_instances = false ## false by default for Datastores only

## Plugin Settings 
## separator character to use for measurement and field names (default: "_")
# separator = "_"

## number of objects to retreive per query for realtime resources (vms and hosts)
## set to 64 for vCenter 5.5 and 6.0 (default: 256)
 max_query_objects = 64

## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
## set to 64 for vCenter 5.5 and 6.0 (default: 256)
 max_query_metrics = 64

## number of go routines to use for collection and discovery of objects and metrics
 collect_concurrency = 2
 discover_concurrency = 1

## whether or not to force discovery of new objects on initial gather call before collecting metrics
## when true for large environments this may cause errors for time elapsed while collecting metrics
## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
# force_discover_on_init = false

## the interval before (re)discovering objects subject to metrics collection (default: 300s)
 object_discovery_interval = "300s"

## timeout applies to any of the api request made to vcenter
 timeout = "1800s"

## Optional SSL Config
# ssl_ca = "/path/to/cafile"
# ssl_cert = "/path/to/certfile"
# ssl_key = "/path/to/keyfile"
## Use SSL but skip chain & host verification
insecure_skip_verify = true

System info:

telegraf version 1.11, grafana v4.6.3, ubuntu 16.04, influxdb, esxi v6.0

Steps to reproduce:

systemctl start telegraf
systemctl restart grafana

Expected behavior:

Actual behavior:

It seems to work, i haven't error in log (journalctl -fu telegraf/journalctl -fu influxdb), but i can't see a graph or a metrics in the grafana dashboard VMware vpshere overview, the datasource is configurated as well I don't understand because all the tutorial or configuration that i see, seems to work directly after install grafana : cpu average disk space, ram consumption, and in my case i only see the menu with the name of my datastore, the name of the vm, the name of the host and the name of my pcc..

Additional info:

Jul 23 13:55:10 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:10Z D! [outputs.influxdb] wrote batch of 3995 metrics in 271.794872ms
Jul 23 13:55:10 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:10Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
Jul 23 13:55:10 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:10Z D! [input.vsphere] Discovering resources for datastore
Jul 23 13:55:13 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:13Z D! [input.vsphere] Discovering resources for datastore
Jul 23 13:55:20 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:20Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
Jul 23 13:55:20 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:20Z D! [input.vsphere] Discovering resources for datacenter
Jul 23 13:55:20 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:20Z D! [input.vsphere]: No parent found for Folder:group-d1 (ascending from Folder:group-d1)
Jul 23 13:55:21 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:21Z D! [input.vsphere] Discovering resources for cluster
Jul 23 13:55:21 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:21Z D! [input.vsphere] Discovering resources for host
Jul 23 13:55:30 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:30Z D! [outputs.influxdb] wrote batch of 2 metrics in 4.391013ms
Jul 23 13:55:30 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:30Z D! [outputs.influxdb] buffer fullness: 0 / 100000 metrics.
Jul 23 13:55:30 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:30Z D! [input.vsphere]: Latest: 2019-07-23 13:54:31.477044 +0000 UTC, elapsed: 63.679479, resource: datacenter
Jul 23 13:55:30 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:30Z D! [input.vsphere]: Sampling period for datacenter of 300 has not elapsed on pcc.ovh.com
Jul 23 13:55:30 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:30Z D! [input.vsphere]: Latest: 2019-07-23 13:54:30.152952 +0000 UTC, elapsed: 65.056572, resource: datacenter
Jul 23 13:55:30 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:30Z D! [input.vsphere]: Sampling period for datacenter of 300 has not elapsed on pcc.ovh.com
Jul 23 13:55:30 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:30Z D! [input.vsphere]: Latest: 2019-07-23 13:55:01.366873 +0000 UTC, elapsed: 33.923656, resource: host
Jul 23 13:55:30 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:30Z D! [input.vsphere]: Collecting metrics for 11 objects of type host for pcc.ovh.com
Jul 23 13:55:30 prod-monitoring-01-rbx telegraf[11208]: 2019-07-23T13:55:30Z D! [input.vsphere]: Queuing query: 11 objects, 2180 metrics (0 remaining) of type host for pcc.ovh.com. Total objects 11 (final chunk)

danielnelson commented 5 years ago

You will want to look at the data used by the dashboard and check if your database contains values for those items. Which dashboard exactly are you using?

sykss commented 5 years ago

You will want to look at the data used by the dashboard and check if your database contains values for those items. Which dashboard exactly are you using?

I use the dashboard VMware vsphere overview, and the dashboard Telegraf metrics, my dashboard VMware vpshere overview is empty but in the telegraf metrics (who use influxdb too) i have some graph about my monitoring vm..

here the output command of my database "influx_db_telegraf" when i make SHOW MEASURMENTS

name: measurements
name
----
cpu
disk
diskio
kernel
mem
net
processes
swap
system
vsphere_cluster_clusterServices
vsphere_cluster_cpu
vsphere_cluster_mem
vsphere_cluster_vmop
vsphere_datacenter_vmop
vsphere_datastore_datastore
vsphere_datastore_disk
vsphere_host_cpu
vsphere_host_datastore
vsphere_host_disk
vsphere_host_hbr
vsphere_host_mem
vsphere_host_net
vsphere_host_power
vsphere_host_rescpu
vsphere_host_storageAdapter
vsphere_host_storagePath
vsphere_host_sys
vsphere_host_vflashModule
vsphere_vm_cpu
vsphere_vm_datastore
vsphere_vm_disk
vsphere_vm_mem
vsphere_vm_net
vsphere_vm_power
vsphere_vm_rescpu
vsphere_vm_sys
vsphere_vm_virtualDisk

danielnelson commented 5 years ago

It looks like it is working on the Telegraf side of things, does this query return results?:

SELECT last("uptime_latest") FROM "vsphere_host_sys" AND time > now() - 30m GROUP BY time("5m");

sykss commented 5 years ago

Hi, The query returns nothing, i try to use oxalide plugin then i have the dash but no data point, i dont understand, but i prefere use telegraf + influx on grfana instead of oxalide + influx db

sykss commented 5 years ago

Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: [httpd] 172.30.0.6, 172.30.0.6,::1 - telegraf [25/Jul/2019:09:46:23 +0000] "GET /query?db=influx_db_telegraf&epoch=ms&q=SHOW+TAG+VALUES+FROM+vsphere_host_cpu+WITH+KEY%3Dvcenter HTTP/1.1" 200 157 "http://10.5.2.1:3000/dashboard/db/vmware-vsphere-vms" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0" 104f81d4-aec1-11e9-802d-005056b86c64 1605
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: ts=2019-07-25T09:46:23.740123Z lvl=info msg="Executing query" log_id=0Gr6brSW000 service=query query="SHOW TAG VALUES ON influx_db_telegraf WITH KEY = clustername WHERE (_name = 'vsphere_host_cpu') AND (_tagKey = 'clustername')"
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: [httpd] 172.30.0.6, 172.30.0.6,::1 - telegraf [25/Jul/2019:09:46:23 +0000] "GET /query?db=influx_db_telegraf&epoch=ms&q=SHOW+TAG+VALUES+FROM+vsphere_host_cpu+WITH+KEY%3Dclustername HTTP/1.1" 200 128 "http://10.5.2.1:3000/dashboard/db/vmware-vsphere-vms" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0" 10546661-aec1-11e9-802e-005056b86c64 838
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: ts=2019-07-25T09:46:23.746842Z lvl=info msg="Executing query" log_id=0Gr6brSW000 service=query query="SHOW TAG VALUES ON influx_db_telegraf WITH KEY = esxhostname WHERE (_name = 'vsphere_host_cpu') AND (_tagKey = 'esxhostname')"
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: [httpd] 172.30.0.6, 172.30.0.6,::1 - telegraf [25/Jul/2019:09:46:23 +0000] "GET /query?db=influx_db_telegraf&epoch=ms&q=SHOW+TAG+VALUES+FROM+vsphere_host_cpu+WITH+KEY%3Desxhostname HTTP/1.1" 200 166 "http://10.5.2.1:3000/dashboard/db/vmware-vsphere-vms" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0" 105570d8-aec1-11e9-802f-005056b86c64 598
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: ts=2019-07-25T09:46:23.752466Z lvl=info msg="Executing query" log_id=0Gr6brSW000 service=query query="SHOW TAG VALUES ON influx_db_telegraf WITH KEY = source WHERE (_name = 'vsphere_datastore_disk') AND (_tagKey = 'source')"
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: [httpd] 172.30.0.6, 172.30.0.6,::1 - telegraf [25/Jul/2019:09:46:23 +0000] "GET /query?db=influx_db_telegraf&epoch=ms&q=SHOW+TAG+VALUES+FROM+vsphere_datastore_disk+WITH+KEY%3Dsource HTTP/1.1" 200 265 "http://10.5.2.1:3000/dashboard/db/vmware-vsphere-vms" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0" 10564c54-aec1-11e9-8030-005056b86c64 700
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: ts=2019-07-25T09:46:23.757971Z lvl=info msg="Executing query" log_id=0Gr6brSW000 service=query query="SHOW TAG VALUES ON influx_db_telegraf WITH KEY = vmname WHERE (_name = 'vsphere_vm_cpu') AND (_tagKey = 'vmname')"
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: [httpd] 172.30.0.6, 172.30.0.6,::1 - telegraf [25/Jul/2019:09:46:23 +0000] "GET /query?db=influx_db_telegraf&epoch=ms&q=SHOW+TAG+VALUES+FROM+vsphere_vm_cpu+WITH+KEY%3Dvmname HTTP/1.1" 200 690 "http://10.5.2.1:3000/dashboard/db/vmware-vsphere-vms" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0" 105722f9-aec1-11e9-8031-005056b86c64 1060
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: ts=2019-07-25T09:46:23.801603Z lvl=info msg="Executing query" log_id=0Gr6brSW000 service=query query="SHOW TAG VALUES ON influx_db_telegraf WITH KEY = guest WHERE _tagKey = 'guest'"
Jul 25 09:46:23 prod-monitoring-01-rbx influxd[3037]: [httpd] 172.30.0.6, 172.30.0.6,::1 - telegraf [25/Jul/2019:09:46:23 +0000] "GET /query?db=influx_db_telegraf&epoch=ms&q=show+tag+values+with+key+%3D+guest HTTP/1.1" 200 243 "http://10.5.2.1:3000/dashboard/db/vmware-vsphere-vms" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0" 105dc950-aec1-11e9-8032-005056b86c64 1124
Jul 25 09:46:40 prod-monitoring-01-rbx influxd[3037]: [httpd] 10.5.2.1 - telegraf [25/Jul/2019:09:46:40 +0000] "POST /write?db=influx_db_telegraf HTTP/1.1" 204 0 "-" "Telegraf/1.9.1" 1a07a159-aec1-11e9-8033-005056b86c64 61016
Jul 25 09:46:50 prod-monitoring-01-rbx influxd[3037]: [httpd] 10.5.2.1 - telegraf [25/Jul/2019:09:46:50 +0000] "POST /write?db=influx_db_telegraf HTTP/1.1" 204 0 "-" "Telegraf/1.9.1" 1ffd3675-aec1-11e9-8034-005056b86c64 21488
Jul 25 09:47:00 prod-monitoring-01-rbx influxd[3037]: [httpd] 10.5.2.1 - telegraf [25/Jul/2019:09:47:00 +0000] "POST /write?db=influx_db_telegraf HTTP/1.1" 204 0 "-" "Telegraf/1.9.1" 25f2c853-aec1-11e9-8035-005056b86c64 138638
Jul 25 09:47:10 prod-monitoring-01-rbx influxd[3037]: [httpd] 10.5.2.1 - telegraf [25/Jul/2019:09:47:10 +0000] "POST /write?db=influx_db_telegraf HTTP/1.1" 204 0 "-" "Telegraf/1.9.1" 2be8a9e0-aec1-11e9-8036-005056b86c64 201640
Jul 25 09:47:40 prod-monitoring-01-rbx influxd[3037]: [httpd] 10.5.2.1 - telegraf [25/Jul/2019:09:47:40 +0000] "POST /write?db=influx_db_telegraf HTTP/1.1" 204 0 "-" "Telegraf/1.9.1" 3dca471b-aec1-11e9-8037-005056b86c64 157336

here is the result of the command journalctl -fu influxdb

danielnelson commented 5 years ago

I would add a file output to your Telegraf config:

[[outputs.file]]
  files = ["stdout"]

Then try running this plugin in the foreground:

telegraf --input-filter=vsphere --output-filter=file

If any data is being produced, it should appear here (but it won't be sent to the database in this setup).

sykss commented 5 years ago

Hi i've edit telegraf.conf to add [[outputs.file]] files = ["stdout" below the is the result when I use telegraf --input-filter=vsphere --output-filter=file

2019-07-26T14:15:21Z I! Using config file: /etc/telegraf/telegraf.conf
2019-07-26T14:15:21Z I! Loaded inputs: inputs.vsphere
2019-07-26T14:15:21Z I! Loaded aggregators:
2019-07-26T14:15:21Z I! Loaded processors:
2019-07-26T14:15:21Z I! Loaded outputs: file
2019-07-26T14:15:21Z I! Tags enabled: host=prod-monitoring-01-rbx
2019-07-26T14:15:21Z I! [agent] Config: Interval:30s, Quiet:false, Hostname:"prod-monitoring-01-rbx", Flush Interval:30s
2019-07-26T14:15:21Z D! [agent] Connecting outputs
2019-07-26T14:15:21Z D! [agent] Attempting connection to output: file
2019-07-26T14:15:21Z D! [agent] Successfully connected to output: file
2019-07-26T14:15:21Z D! [agent] Starting service inputs
2019-07-26T14:15:21Z D! [input.vsphere]: Starting plugin
2019-07-26T14:15:21Z D! [input.vsphere]: Creating client: pcc-5-196-231-109.ovh.com
2019-07-26T14:15:21Z I! [input.vsphere] Option query for maxQueryMetrics failed. Using default
2019-07-26T14:15:21Z D! [input.vsphere] vCenter version is: 6.0.0
2019-07-26T14:15:21Z D! [input.vsphere] vCenter says max_query_metrics should be 64
2019-07-26T14:15:21Z D! [input.vsphere]: Discover new objects for pcc-5-196-231-109.ovh.com
2019-07-26T14:15:21Z D! [input.vsphere] Discovering resources for host
2019-07-26T14:15:23Z D! [input.vsphere] Discovering resources for vm
2019-07-26T14:15:30Z D! [input.vsphere]: Collecting metrics for 0 objects of type datacenter for pcc-5-196-231-109.ovh.com
2019-07-26T14:15:30Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for pcc-5-196-231-109.ovh.com
2019-07-26T14:15:30Z D! [input.vsphere]: Collecting metrics for 0 objects of type vm for pcc-5-196-231-109.ovh.com
2019-07-26T14:15:30Z D! [input.vsphere]: Collecting metrics for 0 objects of type datastore for pcc-5-196-231-109.ovh.com
2019-07-26T14:15:30Z D! [input.vsphere] Purged timestamp cache. 0 deleted with 0 remaining
2019-07-26T14:16:00Z D! [outputs.file] buffer fullness: 0 / 10000 metrics.
2019-07-26T14:16:00Z D! [input.vsphere]: Latest: 2019-07-26 12:15:36.598531 +0000 UTC, elapsed: 35.004718, resource: datacenter
2019-07-26T14:16:00Z D! [input.vsphere]: Sampling period for datacenter of 300 has not elapsed on pcc-5-196-231-109.ovh.com
2019-07-26T14:16:00Z D! [input.vsphere]: Latest: 2019-07-26 12:15:36.707539 +0000 UTC, elapsed: 35.011717, resource: host
2019-07-26T14:16:00Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for pcc-5-196-231-109.ovh.com
2019-07-26T14:16:00Z D! [input.vsphere]: Latest: 2019-07-26 12:15:36.814573 +0000 UTC, elapsed: 35.020694, resource: vm
2019-07-26T14:16:00Z D! [input.vsphere]: Collecting metrics for 0 objects of type vm for pcc-5-196-231-109.ovh.com
2019-07-26T14:16:00Z D! [input.vsphere]: Latest: 2019-07-26 12:15:36.923549 +0000 UTC, elapsed: 35.017718, resource: datastore
2019-07-26T14:16:00Z D! [input.vsphere]: Sampling period for datastore of 300 has not elapsed on pcc-5-196-231-109.ovh.com
2019-07-26T14:16:00Z D! [input.vsphere] Purged timestamp cache. 0 deleted with 0 remaining
2019-07-26T14:16:30Z D! [outputs.file] buffer fullness: 0 / 10000 metrics.
2019-07-26T14:16:30Z D! [input.vsphere]: Latest: 2019-07-26 12:15:36.598531 +0000 UTC, elapsed: 65.031477, resource: datacenter
2019-07-26T14:16:30Z D! [input.vsphere]: Sampling period for datacenter of 300 has not elapsed on pcc-5-196-231-109.ovh.com
2019-07-26T14:16:30Z D! [input.vsphere]: Latest: 2019-07-26 12:16:06.719256 +0000 UTC, elapsed: 35.021719, resource: host
2019-07-26T14:16:30Z D! [input.vsphere]: Collecting metrics for 0 objects of type host for pcc-5-196-231-109.ovh.com
2019-07-26T14:16:30Z D! [input.vsphere]: Latest: 2019-07-26 12:16:06.835267 +0000 UTC, elapsed: 35.015717, resource: vm
2019-07-26T14:16:30Z D! [input.vsphere]: Collecting metrics for 0 objects of type vm for pcc-5-196-231-109.ovh.com
2019-07-26T14:16:30Z D! [input.vsphere]: Latest: 2019-07-26 12:15:36.923549 +0000 UTC, elapsed: 65.034439, resource: datastore
2019-07-26T14:16:30Z D! [input.vsphere]: Sampling period for datastore of 300 has not elapsed on pcc-5-196-231-109.ovh.com
2019-07-26T14:16:30Z D! [input.vsphere] Purged timestamp cache. 0 deleted with 0 remaining
2019-07-26T14:16:34Z D! [input.vsphere] Discovering resources for datastore
2019-07-26T14:16:34Z D! [agent] Stopping service inputs
2019-07-26T14:16:34Z D! [input.vsphere]: Stopping plugin
2019-07-26T14:16:34Z D! [input.vsphere]: Waiting for endpoint pcc-5-196-231-109.ovh.com to finish
2019-07-26T14:16:34Z E! [input.vsphere]: Error while getting metric metadata. Discovery will be incomplete. Error: Post https://pcc-5-196-231-109.ovh.com/sdk: context canceled
2019-07-26T14:16:34Z D! [input.vsphere]: Stop requested for worker pool. Exiting.
2019-07-26T14:16:34Z D! [input.vsphere] Discovering resources for datacenter
2019-07-26T14:16:34Z E! [input.vsphere]: Error in discovery for pcc-5-196-231-109.ovh.com: Post https://pcc-5-196-231-109.ovh.com/sdk: context canceled
2019-07-26T14:16:34Z D! [input.vsphere]: Exiting discovery goroutine for pcc-5-196-231-109.ovh.com
2019-07-26T14:16:34Z D! [agent] Input channel closed
2019-07-26T14:16:34Z I! [agent] Hang on, flushing any cached metrics before shutdown
2019-07-26T14:16:34Z D! [outputs.file] buffer fullness: 0 / 10000 metrics.
2019-07-26T14:16:34Z D! [agent] Closing outputs

I don't know what's is mean but there many error in the discovery..

ilyam8 commented 5 years ago

hey @sykss it is not related to the problem but

timeout applies to any of the api request made to vcenter timeout = "1800s"

30 minutes, really? :smile:

sykss commented 5 years ago

hey @sykss it is not related to the problem but

timeout applies to any of the api request made to vcenter timeout = "1800s"

30 minutes, really? 😄

ahah i try all my possible.. but still not working

danielnelson commented 5 years ago

@prydin Any advice on this error, is it a timeout? Might be a follow on error that we are reporting instead of the initial error?

2019-07-26T14:16:34Z E! [input.vsphere]: Error while getting metric metadata. Discovery will be incomplete. Error: Post https://pcc-5-196-231-109.ovh.com/sdk: context canceled

prydin commented 5 years ago

Looks like a benign race condition. The Telegraf process is on its way to shut down, but there's a vSphere object discovery running in the background. To make sure that the background discovery exits quickly and gracefully, we simply cancel the context that was passed down to it. If it's currently waiting for an API call to complete, it will return an error, since its context was cancelled. This is totally fine and by design, but we may want to suppress the error message.

prydin commented 5 years ago

(I wouldn't even call it a race condition. It's just a graceful shutdown that generates an expected error that probably shouldn't be printed)

danielnelson commented 5 years ago

@sykss Let's enable the internal plugin as well:

[[inputs.internal]]
[[outputs.file]]
  files = ["stdout"]

Run it for a few minutes:

telegraf --input-filter=vsphere:internal --output-filter=file

sykss commented 5 years ago

Hi, thank's u for all u responses i try to configure telegraf.conf with

[[inputs.internal]]
[[outputs.file]]
  files = ["stdout"]
telegraf --input-filter=vsphere:internal --output-filter=file

Below the result in a txt file (the result is very long)

debug_telegraf.txt

thank's u a lot again.. syks

danielnelson commented 5 years ago

From this log it looks like everything worked, perhaps you are having the same issue as in #6187, where when Telegraf first starts up everything is working, but then over time it stops reporting.

prydin commented 5 years ago

Not sure if this has any impact here, but you may want to try force_discover_on_init = true in your config file. In fact (note to self), we should change that to be the default.

yishaihl commented 5 years ago

Not sure if this has any impact here, but you may want to try force_discover_on_init = true in your config file. In fact (note to self), we should change that to be the default.

@prydin you think it might solve the problem?

sykss commented 5 years ago

im gonna see this topics, i try to use force_discover_on_init = true but it fails i've no dashboard.. but i ve 2 question when i use the dahsboard named telegraf metric i see the graph about my monitoring server only (grafana server + influx server + telegraf server) this dashboard use telegraf and influx too, like the dashboard vmware vpshere overview.. So why i'v a dash here and no dash on vm overview.. ? and the other questions is that i use Grafana v4.6.3 and on the dasboard pages vmware overview dependencies recommend to use Grafana 6.2.1 does my issue come from here ? thanks for ur answer

prydin commented 5 years ago

@sykss I'm not sure how to help you here. From the logs, it looks like everything is working. Were you ever able to run it with the file output to verify that all the data is there? If data stops flowing at some point, it would be valuable to look at debug logs for that time period. If data is flowing, the problem is more likely to be on the Grafana side.

sykss commented 5 years ago

@prydin i run it with the file output and all the data seems to be here as i send in the debug file precedently ..

sykss commented 5 years ago

I find the issue, the version of grafana-server was too old, so i make an update and it work ! thank's u for u're help :)

influxdata / telegraf