peimanja / artifactory_exporter

JFrog Artifactory Prometheus Exporter written in Go
Apache License 2.0
141 stars 37 forks source link

Response time from metrics page takes forever #71

Closed t-kap closed 2 years ago

t-kap commented 2 years ago

Overview of the Issue

Immediately after restart curl localhost:9531/metrics response takes 25 sec. But with each next min it grows like in twice ending up in endless wait after 30 min or so.

[PROD] [22:10:49] root@zi-repo:~
# time curl localhost:9531/metrics
---
real    0m25.396s
user    0m0.005s
sys 0m0.010s

[PROD] [22:11:23] root@zi-repo:~
# time curl localhost:9531/metrics
---
real    0m45.626s
user    0m0.003s
sys 0m0.010s

[PROD] [22:12:44] root@zi-repo:~
# time curl localhost:9531/metrics
---
real    1m21.190s
user    0m0.005s
sys 0m0.010s

[PROD] [22:15:34] root@zi-repo:~
# time curl localhost:9531/metrics
---
real    2m28.466s
user    0m0.003s
sys 0m0.016s

[PROD] [22:18:29] root@zi-repo:~
# time curl localhost:9531/metrics
real    3m38.536s
user    0m0.009s
sys 0m0.015s

[PROD] [22:28:21] root@zi-repo:~
# time curl localhost:9531/metrics
real    6m58.838s
user    0m0.010s
sys 0m0.020s

Operating system and Environment details

Centos 7

Logs

[PROD] [22:10:41] root@zi-repo:~
# service artifactory_exporter.service status
Redirecting to /bin/systemctl status artifactory_exporter.service
● artifactory_exporter.service - Artifactory Exporter
   Loaded: loaded (/etc/systemd/system/artifactory_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2022-02-16 22:10:41 UTC; 3s ago
 Main PID: 26577 (artifactory_exp)
   CGroup: /system.slice/artifactory_exporter.service
           └─26577 /usr/bin/artifactory_exporter --artifactory.scrape-uri=http://localhost:8081/artifactory --log.level=debug --artifactory.timeout=15s

Feb 16 22:10:41 zi-repo systemd[1]: Started Artifactory Exporter.
Feb 16 22:10:41 zi-repo artifactory_exporter[26577]: level=info ts=2022-02-16T22:10:41.653Z caller=artifactory_exporter.go:30 msg="Starting artifactory_exporter" version="(version=v1.9.1, branch=refs/tags/v1.9.1, revision=07d2a646a...c7e3724ebf3db87)"
Feb 16 22:10:41 zi-repo artifactory_exporter[26577]: level=info ts=2022-02-16T22:10:41.654Z caller=artifactory_exporter.go:31 msg="Build context" context="(go=go1.14, user=, date=2021-02-01T10:00:53Z)"
Feb 16 22:10:41 zi-repo artifactory_exporter[26577]: level=info ts=2022-02-16T22:10:41.654Z caller=artifactory_exporter.go:32 msg="Listening on address" address=:9531

No errors in logs

Feb 16 22:20:29 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:28.791Z caller=system.go:66 msg="Fetching license stats"
Feb 16 22:20:29 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:28.791Z caller=utils.go:21 msg="Fetching http" path=http://localhost:8081/artifactory/api/system/license
Feb 16 22:20:29 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:28.793Z caller=security.go:23 msg="Fetching users stats"
Feb 16 22:20:29 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:28.793Z caller=utils.go:21 msg="Fetching http" path=http://localhost:8081/artifactory/api/security/users
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.816Z caller=security.go:20 msg="Counting users"
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.817Z caller=security.go:50 msg="Registering metric" metric=users realm=ldap value=1188
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.817Z caller=security.go:50 msg="Registering metric" metric=users realm=internal value=27
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.817Z caller=security.go:50 msg="Registering metric" metric=users realm=saml value=2
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.817Z caller=security.go:47 msg="Fetching groups stats"
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.817Z caller=utils.go:21 msg="Fetching http" path=http://localhost:8081/artifactory/api/security/groups
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.818Z caller=security.go:69 msg="Registering metric" metric=groups value=25
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.818Z caller=replication.go:29 msg="Fetching replications stats"
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.818Z caller=utils.go:21 msg="Fetching http" path=http://localhost:8081/artifactory/api/replications
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.820Z caller=replication.go:31 msg="Registering metric" metric=enabled repo=libs-release-local-recovered type=pull url= cron="0 0 0 * * ?" value=1
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.820Z caller=replication.go:31 msg="Registering metric" metric=enabled repo=old-manual-import type=pull url= cron="1 * * * * ?" value=0
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.820Z caller=replication.go:31 msg="Registering metric" metric=enabled repo=old-repo type=pull url= cron="0 * * * * ?" value=0
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.820Z caller=system.go:17 msg="Fetching health stats"
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.820Z caller=utils.go:21 msg="Fetching http" path=http://localhost:8081/artifactory/api/system/ping
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.823Z caller=system.go:24 msg="System ping returned OK"
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.823Z caller=system.go:41 msg="Fetching build stats"
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.823Z caller=utils.go:21 msg="Fetching http" path=http://localhost:8081/artifactory/api/system/version
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.824Z caller=storageinfo.go:45 msg="Fetching storage info stats"
Feb 16 22:20:31 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:20:31.824Z caller=utils.go:21 msg="Fetching http" path=http://localhost:8081/artifactory/api/storageinfo

regular "Registering metric messages" and "Converting size to bytes"

Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=utils.go:30 msg="Converting size to bytes"
Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=utils.go:13 msg="Removing other characters to extract number from string"
Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=utils.go:24 msg="Successfully converted string to number" string="12.38 KB" number=12.38
Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=utils.go:49 msg="Successfully converted string to bytes" string="12.38 KB" value=12677.12
Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=utils.go:13 msg="Removing other characters to extract number from string"
Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=utils.go:24 msg="Successfully converted string to number" string=0% number=0
Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=storage.go:116 msg="Registering metric" metric=repoUsed repo=rpm-test type=local package_type=npm value=9.21698304e+06
Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=storage.go:119 msg="Registering metric" metric=repoFolders repo=rpm-test type=local package_type=npm value=0
Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=storage.go:122 msg="Registering metric" metric=repoItems repo=rpm-test type=local package_type=npm value=1
Feb 16 22:26:12 zi-repo artifactory_exporter: level=debug ts=2022-02-16T22:26:12.469Z caller=storage.go:125 msg="Registering metric" metric=repoFiles repo=rpm-test type=local package_type=npm value=1

System is not busy

top - 21:52:57 up 19 days, 12:00,  2 users,  load average: 8.04, 7.57, 7.93
Tasks: 305 total,   1 running, 304 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.9 us,  3.7 sy,  0.0 ni, 83.9 id,  0.0 wa,  0.0 hi,  1.5 si,  0.0 st
KiB Mem : 36899112 total,   250124 free, 13836920 used, 22812068 buff/cache
KiB Swap:  1257468 total,  1246708 free,    10760 used. 22645980 avail Mem

After several days there're too many open files in logs. At this time lsof -p shows about 1000 open files.

Feb 16 15:20:41 zi-repo artifactory_exporter: 2022/02/16 15:20:41 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:42 zi-repo artifactory_exporter: 2022/02/16 15:20:42 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 5ms
Feb 16 15:20:42 zi-repo artifactory_exporter: 2022/02/16 15:20:42 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 10ms
Feb 16 15:20:42 zi-repo artifactory_exporter: 2022/02/16 15:20:42 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 20ms
Feb 16 15:20:42 zi-repo artifactory_exporter: 2022/02/16 15:20:42 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 40ms
Feb 16 15:20:42 zi-repo artifactory_exporter: 2022/02/16 15:20:42 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 80ms
Feb 16 15:20:43 zi-repo artifactory_exporter: 2022/02/16 15:20:43 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 160ms
Feb 16 15:20:43 zi-repo artifactory_exporter: 2022/02/16 15:20:43 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 320ms
Feb 16 15:20:43 zi-repo artifactory_exporter: 2022/02/16 15:20:43 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 640ms
Feb 16 15:20:44 zi-repo artifactory_exporter: 2022/02/16 15:20:44 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:45 zi-repo artifactory_exporter: 2022/02/16 15:20:45 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:46 zi-repo artifactory_exporter: 2022/02/16 15:20:46 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:47 zi-repo artifactory_exporter: 2022/02/16 15:20:47 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:48 zi-repo artifactory_exporter: 2022/02/16 15:20:48 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:49 zi-repo artifactory_exporter: 2022/02/16 15:20:49 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:50 zi-repo artifactory_exporter: 2022/02/16 15:20:50 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:51 zi-repo artifactory_exporter: 2022/02/16 15:20:51 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:52 zi-repo artifactory_exporter: 2022/02/16 15:20:52 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s
Feb 16 15:20:53 zi-repo artifactory_exporter: 2022/02/16 15:20:53 http: Accept error: accept tcp 0.0.0.0:9531: accept4: too many open files; retrying in 1s

After restart lsof is like:

[PROD] [22:37:02] root@zi-repo:~
# lsof -p 26577 |wc -l
61

Any thoughts?

peimanja commented 2 years ago

hey @t-kap. interesting issue. Do you have debug logging enabled by default or was it just for testing here? If yes, that's not recommended since it's too noisy and impacts the performance.

t-kap commented 2 years ago

Yes, debug was enabled for testing session only.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

alosenkov-infobip commented 2 years ago

Hi, @peimanja! Any chances to work this out?

alosenkov-infobip commented 2 years ago

I see there's a fresh release v1.9.3, we will try it.

alosenkov-infobip commented 2 years ago

After the start of upgraded exporter, in 15 minutes the timeout is back

$ time curl localhost:9531/metrics --connect-timeout 600

real    5m33.418s
user    0m0.002s
sys 0m0.032s
alosenkov-infobip commented 2 years ago

There is a hope. Looks like our Prometheus server requested the exporter each 30s + we had artifactory.timeout=15s in exporter parameters. Probably the exporter drowned in Prometheus requests: while it was handling a previous request, it got new ones.

For now we've got scrapes stable 15-30s

$ time curl localhost:9531/metrics --connect-timeout 600 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  175k    0  175k    0     0   8552      0 --:--:--  0:00:21 --:--:-- 44719

real    0m21.057s
user    0m0.008s
sys 0m0.008s

$ time curl localhost:9531/metrics --connect-timeout 600 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  175k    0  175k    0     0  11760      0 --:--:--  0:00:15 --:--:-- 41965

real    0m15.313s
user    0m0.005s
sys 0m0.009s

setting

1) artifactory.timeout=60s on the exporter side 2) And such timing on Prometheus side

scrape_interval: 120s
scrape_timeout: 90s

Hope it helps.