go-graphite / carbonzipper

proxy to transparently merge graphite carbon backends
Other
104 stars 29 forks source link

Q: troubleshoot carbonzipper -> carbonserver (via go-carbon) timeouts? #45

Closed blysik closed 7 years ago

blysik commented 7 years ago

Hi,

First, I love this project! I'm in the middle of implementing this, and currently have this setup:

grafana -> carbonapi -> carbonzipper -> carbonserver on 3 servers

This morning I've seen a few timeouts being reported by carbonzipper. Nothing in carbonserver stats or logs that I can see.

Where should I be looking to see what's going on here? Thanks!

deniszh commented 7 years ago

Do you use trigram index in go-carbon? (i.e. what's "max-globs" in go-carbon config) ?

blysik commented 7 years ago
[carbonserver]
listen = ":8080"
enabled = true
buckets = 10
max-globs = 100
metrics-as-counters = false
read-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "5m0s"

I don't have trigram-index defined, but I'm guessing it defaults to true. The carbonserver appear to have plenty of free memory. (72G each.)

azhiltsov commented 7 years ago

One of the reasons for time-outs could be a heavy find requests, containing weakly matching metric with glob. If trigram return many results, then carbonserver trying to limit the list by matching a glob (we are having this quite often from grafana templates, where something like bla.bla.* internally matches hundreds of thousands metrics after trigram, then glob reduces it to hundreds, but this is very costly. Another possible reason is queries for huge amount of data, for instance 5 days of secondly data over thousands metrics. In both aforementioned cases you might try to increase the afterStarted: "2s" timeout setting on zipper. Doing so you will hide the problem. Latest versions of go-carbon, zipper and api has comprehensive logging. Enabling it you can match the timed-out query from ziper with actual queries on go-carbon.

Civil commented 7 years ago

Can you also post full Carbonzipper config config?

blysik commented 7 years ago
listen: "localhost:8070"
maxProcs: 0
graphite:
    host: "graphitehost.example.com:2003"
    interval: "60s"
    prefix: "carbon.zipper"

buckets: 10

timeouts:
    global: "10s"

    afterStarted: "2s"

concurrencyLimit: 0

maxIdleConnsPerHost: 100

# If not zero, enabled cache for find requests
# This parameter controls when it will expire (in seconds)
# Default: 600 (10 minutes)
expireDelaySec: 10

backends:
    - "http://example1.com:8080"
    - "http://example2.com:8080"
    - "http://example3.com:8080"

carbonsearch:
    # Instance of carbonsearch backend
    #backend: "http://127.0.0.1:8070"
    # carbonsearch prefix to reserve/register
    #prefix: "virt.v1.*"

graphite09compat: false

logger:
   -
       logger: ""
       file: "/var/log/carbonzipper.log"
       level: "info"
       encoding: "json"
Civil commented 7 years ago

carbonzipper always assings carbonzipper uuid, and also passes carbonapi uuid through the pipeline. It might help you a bit debugging this issue. You can get the UUIDs of both and search for them in go-carbon log to actually understand if it receive the requests or not.

blysik commented 7 years ago

So I noticed some slow queries in carbonzipper logs, that created timeouts. I bumped afterStarted to 10s, and started to tail the log again.

There are definitely slow queries that take between 1 to 3s, with even an occasional 6s query. This is greatly going to help in troubleshooting.

Thanks for the suggestion of increasing afterStarted. Now I can start trying to drill down to figure out why I get occasional slow queries from carbonserver on the backend.

blysik commented 7 years ago

So going deeper down my stack, I've discovered that 1 of my 3 graphite servers that store the whisper files has significantly higher disk write utilization (as reported by iostat). I'm believing this is related.

I'm going to close this out, as you answered the question and enabled by research to continue. Thanks again.