go-graphite / carbonapi

Implementation of graphite API (graphite-web) in golang
Other
308 stars 140 forks source link

[Performance] A render query with many metrics takes a very long time (at least a minute) #823

Open interfan7 opened 5 months ago

interfan7 commented 5 months ago

Problem description Query a.*b*.c.d.* times out after a minute.

carbonapi's version v0.16.0

Does this happened before irrelevant (I didn't have an earlier version)

carbonapi's config Will provide if relevant.

backend software and config Go-Carbon 0.17.3. Config will be provided if needed.

carbonapi performance metrics

  1. Please enable expvars in the config and provide it's output I believe this is the required expvars config:

    expvar:
    enabled: true
    pprofEnabled: true
    listen: ""

    IDK what output this question refers to.

  2. What's system load on a server? What's the memory consumption? The top doesn't seem to indicate any kind of throttling. I believe the userspace CPU load is somewhat higher during the query but still very low. Mem - free -h output:

    me@graphite-dev:~$ free -h
              total        used        free      shared  buff/cache   available
    Mem:           30Gi       3.1Gi       628Mi       1.0Mi        27Gi        27Gi
    Swap:            0B          0B          0B

    This is our dev host.

  3. If possible, please provide profiler's output (https://blog.golang.org/pprof for instruction how to do that, debug handlers can be enabled in carbonapi's config on a separate port). Golang allows different types of profiles. It's always better to give CPU Profile, Memory profile and Lock profile, but that depends on what's the symptoms. 2 CPU profiles which are captured during the request are attached: curl_cpu_3 curl_cpu_4

  4. Debug logs could be helpful as well Setting log-level to debug has immense implication on performance so for now I've set it to "info" level. The only 2 relevant log entries for this request (grepped a\.\*b\*\.c\.d\.\*) are:

    "level":"ERROR","timestamp":"2024-02-18T13:51:21.454Z","logger":"access","message":"request failed","data":{"handler":"render","carbonapi_uuid":"d1118526-3647-4c52-99ea-49d19b549f74","url":"/render","peer_ip":"10.42.136.247","host":"{our graphite dev host}:4040","format":"json","use_cache":true,"targets":["a.*b*.c.d.*"],"cache_timeout":180,"runtime":59.89556336,"http_code":503,"reason":"timeout while fetching Response","from":1708263920,"until":1708264222,"from_raw":"1708263920","until_raw":"1708264222","uri":"/render","from_cache":false,"used_backend_cache":false,"request_headers":{"X-Grafana-Org-Id":"1","X-Panel-Id":"1"}}}

    and some 10's MB's long entry which starts like this:

    {"level":"WARN","timestamp":"2024-02-18T13:52:40.013Z","logger":"zipper","message":"errors occurred while getting results","type":"protoV2Group","name":"http://127.0.0.1:8080","type":"fetch","request":"&MultiFetchRequest{Metrics:[]FetchRequest{FetchRequest{Name:
    {{  removed for brevity }}
    ,"carbonapi_uuid":"d1118526-3647-4c52-99ea-49d19b549f74","errors":"max tries exceeded","errorsVerbose":"max tries exceeded\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:25\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: failed to fetch data from server/group\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:27\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: error while fetching Response\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:34\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"

Query that causes problems

  1. GF queries metrics. It's render query. Query target is expressed by a.*b*.c.d.*.
  2. According to go-carbon logs, there are ~65.5K files (i.e. metrics) returned.
  3. resolution - I think it's irrelevant.
  4. datapoints amount - Also irrelevant to my mind?

Additional context The carbonapi log reports status code of 503. The nginx reverse proxy indicates code 499 which is according to the internet means a client has halted the connection. Anyway, 1 minute feels too long anyway.