itm_exporter timeout error

romulolibralon commented 4 years ago

Hi,

I'm facing this error when I try to run itm_exporter with many servers tied to the manage system group. Is there a timeout limit? I changed the timeout parameter in the scrap config to 300s.

I ran the curl and i got the result in a few seconds.

There is any config in itm_exporter ?

curl -u sysadmin:xxxxxxx "http://localhost:15200/ibm/tivoli/rest/providers/itm.itm.brhocam01ir00uj/datasources/TMSAgent.%25IBM.STATIC134/datasets/MetricGroup.KLZNET/items?param_SourceToken=teste_prometheus&optimize=true&param_refId=br61qrufosfv8knt1p0g&properties=all"

`[root@brhodsh01srsprj itm_exporter]# ./itm_exporter export time="2020-05-25T16:32:40-03:00" level=info msg="Starting itm_exporter in export mode..." time="2020-05-25T16:32:40-03:00" level=info msg="Author: Rafal Szypulka" time="2020-05-25T16:32:40-03:00" level=info msg="itm_exporter listening on port::8000" time="2020-05-25T16:33:43-03:00" level=error msg="Get http://129.39.186.133:15200/ibm/tivoli/rest/providers/itm.itm.brhocam01ir00uj/datasources/TMSAgent.%25IBM.STATIC134/datasets/MetricGroup.KLZNET/items?param_SourceToken=teste_prometheus&optimize=true&param_refId=br61qrufosfv8knt1p0g&properties=all: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x884a3f]

goroutine 29 [running]: main.MakeAsyncRequest(0xc0000dc700, 0xf6, 0xc0002aa800, 0x6, 0xc000084900) /Users/rafalszypulka/goprojects/src/itm_exporter/main.go:119 +0x2bf created by main.ITMCollector.Collect /Users/rafalszypulka/goprojects/src/itm_exporter/main.go:262 +0x19d3 You have mail in /var/spool/mail/root`

romulolibralon commented 4 years ago

I did increase the timeout in config.yaml and working fine

romulolibralon commented 4 years ago

Now i have this issue: Get "http://localhost:8000/metrics": context deadline exceeded

rafal-szypulka commented 4 years ago

Do you still need help with this? What do you have in the logs? what is the /metrics output?

On Mon, May 25, 2020 at 10:10 PM romulolibralon notifications@github.com wrote:

Now i have this issue: Get "http://localhost:8000/metrics": context deadline exceeded

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rafal-szypulka/itm_exporter/issues/2#issuecomment-633703203, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACY7YCLJ3DHTKZ6F3AYGDLTRTLGEDANCNFSM4NJX4EMA .

romulolibralon commented 4 years ago

When i put too many server inside the managed system group the exporter cant start, i did change time collect timeout and now i solved this problem, now i cant get the metrics using curl, but in prometheus are showing an error "Get "http://localhost:8000/metrics": context deadline exceeded" in the curl http://localhost:8000/metrics i can get the metrics. There is some configurarion to do in prometheus side ?

rafal-szypulka commented 4 years ago

It most probably means, that the scrape request timed out. What is your scrape_timeout in Prometheus for itm-exporter job? In the sample config I shared, it is 45s which is already quite high:

scrape_configs:

job_name: 'itm-exporter' scrape_interval: 60s scrape_timeout: 45s

If you configured it a different way, try to set it to 45-50s. Are your exporter and Prometheus in the same local network as TEPS server? Try to isolate the problematic attribute group. There are metrics like itm_scrape_duration_seconds that show how much time took (in seconds) to collect a particular attribute group.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rafal-szypulka/itm_exporter/issues/2#issuecomment-634234884, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACY7YCLMTL2KHVVGYIIPG6LRTQLEDANCNFSM4NJX4EMA .

romulolibralon commented 4 years ago

global: scrape_interval: 60s evaluation_interval: 20s rule_files:

'alert.rules' scrape_configs:
- job_name: 'prometheus' scrape_interval: 5s static_configs:
  - targets: ['localhost:80']
- job_name: 'itm-exporter' scrape_interval: 300s scrape_timeout: 300s static_configs:
  - targets: ['x.x.x.x:8000']

The exporter is collecting from another server because the tep server is unix system.

the issue is when i try to get data from a managed system group with too many server,

rafal-szypulka commented 4 years ago

Is the exporter installed in the TEPS local network? Can you share the values of itm_scrape_duration_seconds for all configured attribute groups? How many agent instances you have? Do you see any OutOfMemory errors in TEPS SystemOut.log or SystemErr.log?

romulolibralon commented 4 years ago

An error has occurred while serving metrics:

96 error(s) occurred:

`[root@hograd001ger itm_exporter]# vim /etc/prometheus/prometheus.yml

global: scrape_interval: 60s evaluation_interval: 20s rule_files:

'alert.rules' scrape_configs:
- job_name: 'prometheus' scrape_interval: 5s static_configs:
  - targets: ['localhost:9090']
- job_name: 'itm-exporter' scrape_interval: 60s scrape_timeout: 60s static_configs:
  - targets: ['localhost:8000']`

`[root@hograd001ger itm_exporter]# vim config.yaml

itm_server_url: "http://:15200" itm_server_user: "user" itm_server_password: "pwd" connection_timeout: 60 collection_timeout: 200 groups:

name: "KLZCPU" datasets_uri: "/providers/itm.TEMS/datasources/TMSAgent.%25IBM.STATIC134/datasets" labels: ["CPUID", "ORIGINNODE"] metrics: ["BUSYCPU", "IDLECPU", "SYSCPU", "USRCPU", "WAITCPU"] managed_system_group: "teste_prom"
name: "KLZVM" datasets_uri: "/providers/itm.TEMS/datasources/TMSAgent.%25IBM.STATIC134/datasets" labels: ["ORIGINNODE"] metrics: ["VSUSEDPCT", "VSFREEPCT", "MEMUSEDPCT", "MEMFREEPCT", "MEMTOT"] managed_system_group: "teste_prom"
name: "KLZDISK" datasets_uri: "/providers/itm.TEMS/datasources/TMSAgent.%25IBM.STATIC134/datasets" labels: ["ORIGINNODE", "DSKNAME", "MOUNTPT"] metrics: ["DSKFREEPCT", "DSKUSEDPCT", "DSKFREE", "DSKUSED", "INDFREEPCT"] managed_system_group: "teste_prom"
name: "KLZDSKIO" datasets_uri: "/providers/itm.TEMS/datasources/TMSAgent.%25IBM.STATIC134/datasets" labels: ["ORIGINNODE", "DSKNAME"] metrics: ["TPS", "WRITETIME"] managed_system_group: "teste_prom"
name: "msys" datasets_uri: "/providers/itm.TEMS/datasources/TMSAgent.%26IBM.STATIC000/datasets" labels: ["ORIGINNODE", "PRODUCT", "AFFPRODUCT", "VERSION", "OSPLATFORM", "NETADDR", "HOSTNAME"] metrics: ["AVAILABLE"] managed_system_group: "*TEMS"`

romulolibralon commented 4 years ago

I don't seem to be able to collect a lot of data from TEPS, I already did the test on different teps and in the managed list I only added 3 servers.

Get "http://10.1.10.165:8000/metrics": context deadline exceeded

rafal-szypulka commented 4 years ago

This is wrong:

name: "KLZDSKIO"
datasets_uri: "/providers/itm.TEMS/datasources/TMSAgent.%25IBM.STATIC134/datasets"
labels: ["ORIGINNODE", "DSKNAME"]
metrics: ["TPS", "WRITETIME"]
managed_system_group: "teste_prom"

it should be:

name: "KLZDSKIO"
datasets_uri: "/providers/itm.TEMS/datasources/TMSAgent.%25IBM.STATIC134/datasets"
labels: ["ORIGINNODE", "DKNAME"]
metrics: ["TPS"]
managed_system_group: "teste_prom"

Use itm_exporter listAttributes to find correct metric names.

romulolibralon commented 4 years ago

The error "URL" is not related to this problem, this error is when I put many machines within the managed system list. I already tried without this KLZDSKIO. And teps is not showing any error, so I conclude that there must be some kind of problem occurring, from what I saw in --help in diagnostic mode, it also does not bring any error.

romulolibralon commented 4 years ago

Get "http://:8000/metrics": context deadline exceeded

rafal-szypulka commented 4 years ago

If it is a new problem, please open a new issue and describe as many details as possible especially: exporter log collected when running with '-v' option (verbose log), exporter config (use a minimal set of att groups that cause a problem), number of agents instances, curl /metrics output. BTW, diagnostic mode (command itm_exporter test) is for something else than trace logging. I created it for the offline conversion of ITM API response (JSON) to Prometheus metrics (it is useful only if you don't have access to ITM server).

rafal-szypulka / itm_exporter

itm_exporter timeout error #2