rafal-szypulka / itm_exporter

ITM Exporter for Prometheus
MIT License
13 stars 2 forks source link

Scrapes have super long runtimes #6

Closed ruohki closed 3 years ago

ruohki commented 3 years ago

At some point the ITM API seems to significantly slow down, the scraping and the request to /metrics take moer than a minute with the default config and around 3 managed systems. Eventually the portal server freezes as well. Any idea?

/metrics sample

itm_scrape_duration_seconds{group="NTMEMORY"} 59.202864944
itm_scrape_duration_seconds{group="WTSYSTEM"} 59.197552453
- name: "WTSYSTEM"
  datasets_uri: "/providers/itm.TEMS/datasources/TMSAgent.%25IBM.STATIC021/datasets"
  labels: ["ORIGINNODE"]
  metrics: ["PCTTLPCSRT","PCTTLPRIVT","PCTTLUSERT","SYSUPDAYS"]
  managed_system_group: "*NT_SYSTEM"
- name: "NTMEMORY"
  datasets_uri: "/providers/itm.TEMS/datasources/TMSAgent.%25IBM.STATIC021/datasets"
  labels: ["ORIGINNODE"]
  metrics: ["AVAILBTMEM","COMMBYTE", "TOTMEMBYTE","CACHEBTS","MEMUPCT","AVAILPCT","CACHEPCT"]
  managed_system_group: "*NT_SYSTEM"
rafal-szypulka commented 3 years ago

Hi, try to increase TEPS heap size: https://www.ibm.com/support/pages/resolving-teps-ewas-memory-issues-increasing-jvm-heap-size (especially if you find OOM errors in the TEPS eWAS logs).

ruohki commented 3 years ago

The machine is a 4 core 8g ram rhel 8. I did increase the heap to 2gb and run into this issue - there is not really something that is being monitored yet - but from the itm_scrape_duration_seconds i think the Windows OS agent might be a bottleneck here for some reason.

I also use the other datasource you provide at the same time

ruohki commented 3 years ago

To finish this of i think the issue is related with the grafana-apm-datasource. After some quries the system locks up and the teps api cant be queried anymore. Heap is fine

ruohki commented 3 years ago

Okay - turns out the issue is not the plugin or the teps ... drumroll the nt agent is absolute garbage and ramps up cpu load to 50% after 1-2 queries.