pryorda / vmware_exporter

VMWare vCenter Exporter for Prometheus
BSD 3-Clause "New" or "Revised" License
513 stars 195 forks source link

Errors during vm or host metrics collection #217

Open seyfbarhoumi opened 4 years ago

seyfbarhoumi commented 4 years ago

2020-07-24 14:33:36,624 INFO:Starting vm metrics collection 2020-07-24 14:33:36,624 INFO:Fetching vim.VirtualMachine inventory 2020-07-24 14:33:36,624 INFO:Retrieving service instance content 2020-07-24 14:33:36,627 INFO:START: _vmware_get_vm_perf_manager_metrics 2020-07-24 14:33:37,288 INFO:Retrieved service instance content 2020-07-24 14:33:58,121 INFO:FIN: _vmware_get_vm_perf_manager_metrics 2020-07-24 14:33:58,186 INFO:Finished collecting metrics from bams-vcenter.bams.corp 2020-07-24 14:33:59,125 ERROR:Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) StopIteration: [<prometheus_client.core.GaugeMetricFamily object at 0x7f1f70c75710>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f70d13f28>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f70d13358>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73b86c50>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73b86358>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73b86fd0>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f70c7bcc0>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f70a23550>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f70a23eb8>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f70a10048>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f70a10898>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f715f7b38>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f715f7a20>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80320>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e803c8>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e804a8>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e805f8>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80630>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e805c0>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e806d8>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80748>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80780>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e807f0>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80898>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80828>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80908>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80940>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e808d0>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e809b0>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80ba8>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80eb8>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80ef0>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80e80>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e80f98>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e8d0b8>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e8d0f0>, <prometheus_client.core.GaugeMetricFamily object at 0x7f1f73e8d128>]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 1786, in _async_render_GET yield self.generate_latest_metrics(request) File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 1833, in generate_latest_metrics request.finish() File "/usr/local/lib/python3.6/site-packages/twisted/web/server.py", line 286, in finish return http.Request.finish(self) File "/usr/local/lib/python3.6/site-packages/twisted/web/http.py", line 1080, in finish "Request.finish called on a request after its connection was lost; " RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.

2020-07-24 14:33:59,126 INFO:Fetched vim.VirtualMachine inventory (0:00:22.501970) Unhandled error in Deferred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 501, in errback self._startRunCallbacks(fail) File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks self._runCallbacks() File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1475, in gotResult _inlineCallbacks(r, g, status) --- --- File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/usr/local/lib/python3.6/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 1791, in _async_render_GET request.finish() File "/usr/local/lib/python3.6/site-packages/twisted/web/server.py", line 286, in finish return http.Request.finish(self) File "/usr/local/lib/python3.6/site-packages/twisted/web/http.py", line 1080, in finish "Request.finish called on a request after its connection was lost; " builtins.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.

2020-07-24 14:34:00,061 INFO:Finished vm metrics collection 2020-07-24 14:34:01,579 ERROR:Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 1786, in _async_render_GET yield self.generate_latest_metrics(request) twisted.internet.defer.FirstError: FirstError[#1, [Failure instance: Traceback: <class 'twisted.internet.defer.FirstError'>: FirstError[#0, [Failure instance: Traceback: <class 'pyVmomi.VmomiSupport.vmodl.fault.ManagedObjectNotFound'>: (vmodl.fault.ManagedObjectNotFound) { dynamicType = , dynamicProperty = (vmodl.DynamicProperty) [], msg = 'This object has been deleted or haven't been entirely created', faultCause = , faultMessage = (vmodl.LocalizableMessage) [], obj = 'vim.VirtualMachine:vm-10546' } /usr/local/lib/python3.6/threading.py:916:_bootstrap_inner /usr/local/lib/python3.6/threading.py:864:run /usr/local/lib/python3.6/site-packages/twisted/_threads/_threadworker.py:46:work /usr/local/lib/python3.6/site-packages/twisted/_threads/_team.py:190:doWork --- --- /usr/local/lib/python3.6/site-packages/twisted/python/threadpool.py:250:inContext /usr/local/lib/python3.6/site-packages/twisted/python/threadpool.py:266: /usr/local/lib/python3.6/site-packages/twisted/python/context.py:122:callWithContext /usr/local/lib/python3.6/site-packages/twisted/python/context.py:85:callWithContext /usr/local/lib/python3.6/site-packages/pyVmomi/VmomiSupport.py:706: /usr/local/lib/python3.6/site-packages/pyVmomi/VmomiSupport.py:512:_InvokeMethod /usr/local/lib/python3.6/site-packages/pyVmomi/SoapAdapter.py:1397:InvokeMethod ]] --- --- /usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py:1342:_vmware_get_vm_perf_manager_metrics /usr/local/lib/python3.6/site-packages/vmware_exporter/defer.py:99:parallelize ]]

Unhandled Error Traceback (most recent call last): File "/usr/local/bin/vmware_exporter", line 10, in sys.exit(main()) File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 1899, in main reactor.run() File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 1283, in run self.mainLoop() File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 1292, in mainLoop self.runUntilCurrent() --- --- File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 886, in runUntilCurrent f(*a, **kw) File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 501, in errback self._startRunCallbacks(fail) File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks self._runCallbacks() File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 699, in _runCallbacks current.result.cleanFailure() File "/usr/local/lib/python3.6/site-packages/twisted/python/failure.py", line 627, in cleanFailure self.value.traceback = None File "/usr/local/lib/python3.6/site-packages/pyVmomi/VmomiSupport.py", line 663, in setattr CheckField(self._GetPropertyInfo(name), val) File "/usr/local/lib/python3.6/site-packages/pyVmomi/VmomiSupport.py", line 468, in GetPropertyInfo raise AttributeError(name) builtins.AttributeError: traceback

2020-07-24 14:34:06,631 INFO:Start collecting metrics from bams-vcenter.bams.corp 2020-07-24 14:34:06,631 INFO:Starting vm metrics collection 2020-07-24 14:34:06,631 INFO:Fetching vim.VirtualMachine inventory 2020-07-24 14:34:06,631 INFO:Retrieving service instance content 2020-07-24 14:34:06,634 INFO:START: _vmware_get_vm_perf_manager_metrics 2020-07-24 14:34:07,118 INFO:Retrieved service instance content Unhandled error in Deferred:

Traceback (most recent call last): --- --- File "/usr/local/lib/python3.6/site-packages/vmware_exporter/vmware_exporter.py", line 1342, in _vmware_get_vm_perf_manager_metrics self.vm_labels, File "/usr/local/lib/python3.6/site-packages/vmware_exporter/defer.py", line 99, in parallelize results = yield defer.DeferredList(args, fireOnOneErrback=True) twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback: <class 'pyVmomi.VmomiSupport.vmodl.fault.ManagedObjectNotFound'>: (vmodl.fault.ManagedObjectNotFound) { dynamicType = , dynamicProperty = (vmodl.DynamicProperty) [], msg = 'This object has been deleted or haven't been entirely created', faultCause = , faultMessage = (vmodl.LocalizableMessage) [], obj = 'vim.VirtualMachine:vm-10546' } /usr/local/lib/python3.6/threading.py:916:_bootstrap_inner /usr/local/lib/python3.6/threading.py:864:run /usr/local/lib/python3.6/site-packages/twisted/_threads/_threadworker.py:46:work /usr/local/lib/python3.6/site-packages/twisted/_threads/_team.py:190:doWork --- --- /usr/local/lib/python3.6/site-packages/twisted/python/threadpool.py:250:inContext /usr/local/lib/python3.6/site-packages/twisted/python/threadpool.py:266: /usr/local/lib/python3.6/site-packages/twisted/python/context.py:122:callWithContext /usr/local/lib/python3.6/site-packages/twisted/python/context.py:85:callWithContext /usr/local/lib/python3.6/site-packages/pyVmomi/VmomiSupport.py:706: /usr/local/lib/python3.6/site-packages/pyVmomi/VmomiSupport.py:512:_InvokeMethod /usr/local/lib/python3.6/site-packages/pyVmomi/SoapAdapter.py:1397:InvokeMethod ]]

2020-07-24 14:34:18,899 INFO:Fetched vim.VirtualMachine inventory (0:00:12.267172) 2020-07-24 14:34:19,117 INFO:Finished vm metrics collection 2020-07-24 14:34:23,446 INFO:FIN: _vmware_get_vm_perf_manager_metrics 2020-07-24 14:34:23,462 INFO:Finished collecting metrics from bams-vcenter.bams.corp

pryorda commented 4 years ago

Update the prometheus timeout to this endpoint.

seyfbarhoumi commented 4 years ago

This resolved the problem, but there's something that i wanted to let you know, sometimes for some reason a scrape duration get more than expected and it exceed the scrape timeout. The problem is that a kind of concurrency is created between the scape that timed out and the next scape, this make the scrape take too long and at some point the exporter get stuck and no longer fetch any metrics.

pryorda commented 4 years ago

If i'm understanding you correctly, we need to implement some sort of lock to prevent multiple scrapes from happening?

vsulimanec commented 3 years ago

Hi,

would say yes, the same problem we have on version 0.13.0. We'll test today 0.13.2

billabongrob commented 3 years ago

I'm finding that in large instances, this will occur and am unsure as to whether or not it's a slow response of the vSphere (6.7) API or the exporter itself. If I disable VM collection, it works relatively good. As soon as it's enabled, it hits the fan. First run will take ~26 seconds, second run upwards of 10-15 minutes. Unsure of what the best practice would be for multiple datacenters/clusters.

pryorda commented 3 years ago

@billabongrob You can use different sections.

kong62 commented 3 years ago

I'm finding that in large instances, this will occur and am unsure as to whether or not it's a slow response of the vSphere (6.7) API or the exporter itself. If I disable VM collection, it works relatively good. As soon as it's enabled, it hits the fan. First run will take ~26 seconds, second run upwards of 10-15 minutes. Unsure of what the best practice would be for multiple datacenters/clusters.

version: pryorda/vmware_exporter:v0.16.1

  1. I am same probrom, now I disabled VMS. it's good.

  2. prometheus timeout can't solve the problem:

    spec:
    podMetricsEndpoints:
    - interval: 60s
    scrapeTimeout: 55s
    path: /metrics
    port: http
  3. when I set LIMITED section, no error, but no metrics too

    # vi config.yml 
    kind: ConfigMap
    metadata:
    labels:
    app: vmware-exporter
    name: vmware-exporter-config
    apiVersion: v1
    data:
    VSPHERE_USER: "administrator@hupu.local"
    VSPHERE_HOST: "vCenterSRV01.hupu.local"
    VSPHERE_IGNORE_SSL: "True"
    VSPHERE_COLLECT_HOSTS: "True"
    VSPHERE_COLLECT_DATASTORES: "True"
    VSPHERE_COLLECT_SNAPSHOTS: "true"
    VSPHERE_COLLECT_VMS: "false"
    VSPHERE_COLLECT_VMGUESTS: "false"
    VSPHERE_LIMITED_USER: "administrator@hupu.local"
    VSPHERE_LIMITED_HOST: "vCenterSRV01.hupu.local"
    VSPHERE_LIMITED_PASSWORD: "ss%%m#sE2L"
    VSPHERE_LIMITED_IGNORE_SSL: "True"
    VSPHERE_LIMITED_COLLECT_HOSTS: "false"
    VSPHERE_LIMITED_COLLECT_DATASTORES: "false"
    VSPHERE_LIMITED_COLLECT_SNAPSHOTS: "false"
    VSPHERE_LIMITED_COLLECT_VMS: "true"
    VSPHERE_LIMITED_COLLECT_VMGUESTS: "false"