go-graphite / carbonapi

Implementation of graphite API (graphite-web) in golang
Other
309 stars 140 forks source link

Can't make it to work with graphite-clickhouse #392

Closed blind-oracle closed 5 years ago

blind-oracle commented 5 years ago

As I already told @Civil in some other thread somewhere I wasn't able to make this carbonapi with graphite-clickhouse backend some months ago.

I tried again today with same result. I used the carbonapi.example.clickhouse.yaml config as base and can't get any output.

On request from Grafana it just hangs for 50s and then returns empty set:

2019-06-22T17:55:55.190+0200    DEBUG   zipper  will try to fetch data  {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["SC.PROD.{AT,NFT,OT,UFT,VT}.*.*.*[rs]*.network.nic.bond3.bw_out"]}
2019-06-22T17:55:55.190+0200    DEBUG   zipper  will do query with timeout  {"type": "broadcastGroup", "groupName": "root", "type": "find", "request": ["SC.PROD.{AT,NFT,OT,UFT,VT}.*.*.*[rs]*.network.nic.bond3.bw_out"], "timeout": 10}
2019-06-22T17:55:55.190+0200    DEBUG   zipper  waiting for a slot  {"type": "broadcastGroup", "groupName": "root", "type": "find", "request": ["SC.PROD.{AT,NFT,OT,UFT,VT}.*.*.*[rs]*.network.nic.bond3.bw_out"], "group_name": "root", "client_name": "ch"}
2019-06-22T17:55:55.190+0200    DEBUG   zipper  got a slot  {"type": "broadcastGroup", "groupName": "root", "type": "find", "request": ["SC.PROD.{AT,NFT,OT,UFT,VT}.*.*.*[rs]*.network.nic.bond3.bw_out"], "group_name": "root", "client_name": "ch"}
2019-06-22T17:55:55.190+0200    DEBUG   zipper  picked server   {"type": "protoV2Group", "name": "ch", "action": "query", "server": "http://127.0.0.1:9090"}
2019-06-22T17:55:55.190+0200    DEBUG   zipper  trying to get slot  {"type": "protoV2Group", "name": "ch", "action": "query", "server": "http://127.0.0.1:9090", "name": "ch", "uri": "http://127.0.0.1:9090/metrics/find/?format=protobuf&query=SC.PROD.%7BAT%2CNFT%2COT%2CUFT%2CVT%7D.%2A.%2A.%2A%5Brs%5D%2A.network.nic.bond3.bw_out"}
2019-06-22T17:55:55.190+0200    DEBUG   zipper  got slot    {"type": "protoV2Group", "name": "ch", "action": "query", "server": "http://127.0.0.1:9090", "name": "ch", "uri": "http://127.0.0.1:9090/metrics/find/?format=protobuf&query=SC.PROD.%7BAT%2CNFT%2COT%2CUFT%2CVT%7D.%2A.%2A.%2A%5Brs%5D%2A.network.nic.bond3.bw_out"}
2019-06-22T17:56:45.190+0200    WARN    zipper  timeout waiting for more responses  {"type": "broadcastGroup", "groupName": "root", "type": "find", "request": ["SC.PROD.{AT,NFT,OT,UFT,VT}.*.*.*[rs]*.network.nic.bond3.bw_out"], "no_answers_from": ["ch"]}
2019-06-22T17:56:45.190+0200    WARN    zipper  Find request failed when resolving globs    {"type": "broadcastGroup", "groupName": "root", "metric_name": "SC.PROD.{AT,NFT,OT,UFT,VT}.*.*.*[rs]*.network.nic.bond3.bw_out", "errors": [{"error": "timeout while fetching Response"}, {"error": "failed to fetch response from the server root"}]}
2019-06-22T17:56:45.190+0200    INFO    render  empty response or no response   {"carbonapi_uuid": "c10edd20-f47c-4ec5-8ad5-e60600b931d7", "username": ""}
2019-06-22T17:56:45.190+0200    INFO    access  request served  {"data": {"handler":"render","carbonapi_uuid":"c10edd20-f47c-4ec5-8ad5-e60600b931d7","url":"/render","peer_ip":"10.91.164.212","host":"10.91.2.59:8888","format":"json","use_cache":true,"targets":["aliasByNode(consolidateBy(removeAboveValue(removeBelowValue(SC.PROD.{AT,NFT,OT,UFT,VT}.*.*.*[rs]*.network.nic.bond3.bw_out, 0), 42949672960), 'max'), 5)"],"cache_timeout":60,"metrics":["SC.PROD.{AT,NFT,OT,UFT,VT}.*.*.*[rs]*.network.nic.bond3.bw_out"],"have_non_fatal_errors":true,"runtime":50.000789624,"http_code":200,"from":1561132555,"until":1561218955,"from_raw":"-24h","until_raw":"now","uri":"/render","from_cache":false,"zipper_requests":1}}
2019-06-22T17:56:45.190+0200    WARN    slow    Slow Request    {"time": 50.000889425, "url": "/render", "referer": ""}

On the graphite-clickhouse side I just get one query:

[2019-06-22T17:55:55.269+0200] INFO [query] query {"query": "SELECT Path FROM graphite_tree WHERE (Level = 10) AND (Path LIKE 'SC.PROD.%') AND (match(Path, '^SC[.]PROD[.](AT|NFT|OT|UFT|VT)[.]([^.]*?)[.]([^.]*?)[.]([^.]*?)[rs]([^.]*?)[.]network[.]nic[.]bond3[.]bw_out[.]?$')) AND (Deleted = 0) GROUP BY Path", "request_id": "b97e5b7f7f79a0fe699facf122fdfd48", "time": 0.078374903}
[2019-06-22T17:55:55.269+0200] INFO access {"request_id": "b97e5b7f7f79a0fe699facf122fdfd48", "time": 0.078648707, "method": "GET", "url": "/metrics/find/?format=protobuf&query=SC.PROD.%7BAT%2CNFT%2COT%2CUFT%2CVT%7D.%2A.%2A.%2A%5Brs%5D%2A.network.nic.bond3.bw_out", "peer": "127.0.0.1:44472", "status": 200}

What am I doing wrong? bookingcom fork works without any issues.

Thanks in advance.

Civil commented 5 years ago

Can you show what your config looks like?

Civil commented 5 years ago

Usually it happens when you have a 0 of max requests per backend, as in bookingcom's fork they threat it as "unlimited" but here it's treated as "0" as of now (which is probably should be fixed, as having 0 requests per backend makes no sense)

blind-oracle commented 5 years ago

Sure.

listen: "0.0.0.0:8888"

concurency: 20
cache:
   type: "mem"
   size_mb: 0
   defaultTimeoutSec: 60

cpus: 0
tz: ""

maxBatchSize: 0

graphite:
    host: ""
    interval: "60s"
    prefix: "carbon.api"
    pattern: "{prefix}.{fqdn}"

idleConnections: 1000
pidFile: ""

upstreams:
    buckets: 10
    timeouts:
        find: "10s"
        render: "50s"
        connect: "200ms"

    concurrencyLimitPerServer: 1000
    keepAliveInterval: "30s"
    maxIdleConnsPerHost: 1000

    backendsv2:
        backends:
          -
            groupName: "ch"
            protocol: "carbonapi_v2_pb"
            lbMethod: "rr"
            maxTries: 3
            maxBatchSize: 0
            keepAliveInterval: "30s"
            concurrencyLimit: 1000
            maxIdleConnsPerHost: 1000

            servers:
                - "http://127.0.0.1:9090"

expireDelaySec: 10

logger:
    - logger: ""
      file: "stderr"
      level: "debug"
      encoding: "console"
      encodingTime: "iso8601"
      encodingDuration: "seconds"
    - logger: ""
      file: "carbonapi.log"
      level: "info"
      encoding: "json"

Usually it happens when you have a 0 of max requests per backend

What parameter is this?

Civil commented 5 years ago

And what's in the logs when carbonapi start?

And the config from http://localhost:8888/debug/vars after you do couple of requests?

About the option, I'll check when I got home.

blind-oracle commented 5 years ago

And what's in the logs when carbonapi start?

2019-06-22T20:11:48.613+0200    INFO    main    will parse config as yaml   {"config_file": "/opt/carbonapi.yml"}
2019-06-22T20:11:48.617+0200    ERROR   functionInit    failed to read config file  {"function": "aliasByPostgres", "error": "Config File \"config\" Not Found in \"[]\""}
2019-06-22T20:11:48.617+0200    WARN    functionInit    aliasByPostgres config found but aliasByPostgres is disabled    {"function": "aliasByPostgres"}
2019-06-22T20:11:48.617+0200    WARN    functionInit    no config file specified    {"function": "graphiteWeb", "message": "this function requrires config file to work properly"}
2019-06-22T20:11:48.617+0200    INFO    main    starting carbonapi  {"build_version": "0.11.0-490-g5a4f", "config": "{\"ExtrapolateExperiment\":false,\"Logger\":[{\"logger\":\"\",\"file\":\"stderr\",\"level\":\"debug\",\"encoding\":\"console\",\"encoding-time\":\"iso8601\",\"encoding-duration\":\"seconds\",\"sample-tick\":\"\",\"sample-initial\":0,\"sample-thereafter\":0},{\"logger\":\"\",\"file\":\"carbonapi.log\",\"level\":\"info\",\"encoding\":\"json\",\"encoding-time\":\"\",\"encoding-duration\":\"\",\"sample-tick\":\"\",\"sample-initial\":0,\"sample-thereafter\":0}],\"Listen\":\"0.0.0.0:8888\",\"Buckets\":10,\"Concurency\":20,\"Cache\":{\"Type\":\"mem\",\"Size\":0,\"MemcachedServers\":[],\"DefaultTimeoutSec\":60},\"Cpus\":0,\"TimezoneString\":\"\",\"UnicodeRangeTables\":null,\"Graphite\":{\"Pattern\":\"{prefix}.{fqdn}\",\"Host\":\"\",\"Interval\":60000000000,\"Prefix\":\"carbon.api\"},\"IdleConnections\":1000,\"PidFile\":\"\",\"SendGlobsAsIs\":false,\"AlwaysSendGlobsAsIs\":false,\"MaxBatchSize\":0,\"Zipper\":\"\",\"Upstreams\":{\"ConcurrencyLimitPerServer\":1000,\"MaxIdleConnsPerHost\":1000,\"Backends\":null,\"BackendsV2\":{\"Backends\":[{\"GroupName\":\"ch\",\"Protocol\":\"carbonapi_v2_pb\",\"LBMethod\":\"all\",\"Servers\":[\"http://127.0.0.1:9090\"],\"Timeouts\":null,\"ConcurrencyLimit\":1000,\"KeepAliveInterval\":30000000000,\"MaxIdleConnsPerHost\":1000,\"MaxTries\":3,\"MaxBatchSize\":0}],\"MaxIdleConnsPerHost\":0,\"ConcurrencyLimitPerServer\":0,\"Timeouts\":{\"Find\":0,\"Render\":0,\"Connect\":0},\"KeepAliveInterval\":0,\"MaxTries\":0,\"MaxBatchSize\":0},\"MaxBatchSize\":0,\"MaxTries\":0,\"CarbonSearch\":{\"Backend\":\"\",\"Prefix\":\"virt.v1.*\"},\"CarbonSearchV2\":{\"Backends\":null,\"MaxIdleConnsPerHost\":0,\"ConcurrencyLimitPerServer\":0,\"Timeouts\":{\"Find\":0,\"Render\":0,\"Connect\":0},\"KeepAliveInterval\":0,\"MaxTries\":0,\"MaxBatchSize\":0,\"Prefix\":\"\"},\"ExpireDelaySec\":0,\"InternalRoutingCache\":600000000000,\"Timeouts\":{\"Find\":10000000000,\"Render\":50000000000,\"Connect\":200000000},\"KeepAliveInterval\":30000000000},\"ExpireDelaySec\":10,\"GraphiteWeb09Compatibility\":false,\"IgnoreClientTimeout\":false,\"DefaultColors\":null,\"GraphTemplates\":\"\",\"FunctionsConfigs\":{}}"}
2019-06-22T20:11:48.618+0200    DEBUG   zipper  initializing zipper
2019-06-22T20:11:48.618+0200    DEBUG   zipper  creating lb group   {"name": "ch", "servers": ["http://127.0.0.1:9090"], "type": "all"}
2019-06-22T20:11:48.618+0200    DEBUG   zipper  created broadcast group {"type": "broadcastGroup", "groupName": "ch", "group_name": "ch", "clients": ["http://127.0.0.1:9090"]}
2019-06-22T20:11:48.618+0200    DEBUG   zipper  created broadcast group {"type": "broadcastGroup", "groupName": "root", "group_name": "root", "clients": ["ch"]}
2019-06-22T20:11:48.618+0200    DEBUG   zipper  zipper config   {"config": {"ConcurrencyLimitPerServer":1000,"MaxIdleConnsPerHost":1000,"Backends":null,"BackendsV2":{"Backends":[{"GroupName":"ch","Protocol":"carbonapi_v2_pb","LBMethod":"all","Servers":["http://127.0.0.1:9090"],"Timeouts":{"Find":10000000000,"Render":50000000000,"Connect":200000000},"ConcurrencyLimit":1000,"KeepAliveInterval":30000000000,"MaxIdleConnsPerHost":1000,"MaxTries":3,"MaxBatchSize":0}],"MaxIdleConnsPerHost":0,"ConcurrencyLimitPerServer":0,"Timeouts":{"Find":10000000000,"Render":50000000000,"Connect":200000000},"KeepAliveInterval":0,"MaxTries":0,"MaxBatchSize":0},"MaxBatchSize":0,"MaxTries":0,"CarbonSearch":{"Backend":"","Prefix":"virt.v1.*"},"CarbonSearchV2":{"Backends":null,"MaxIdleConnsPerHost":0,"ConcurrencyLimitPerServer":0,"Timeouts":{"Find":0,"Render":0,"Connect":0},"KeepAliveInterval":0,"MaxTries":0,"MaxBatchSize":0,"Prefix":""},"ExpireDelaySec":0,"InternalRoutingCache":600000000000,"Timeouts":{"Find":10000000000,"Render":50000000000,"Connect":200000000},"KeepAliveInterval":30000000000}}
2019-06-22T20:11:48.618+0200    DEBUG   zipper  doing request   {"type": "protoV2Group", "name": "http://127.0.0.1:9090", "function": "prober", "request": ["*"]}
2019-06-22T20:11:48.618+0200    DEBUG   zipper  picked server   {"type": "protoV2Group", "name": "http://127.0.0.1:9090", "action": "query", "server": "http://127.0.0.1:9090"}
2019-06-22T20:11:48.618+0200    DEBUG   zipper  trying to get slot  {"type": "protoV2Group", "name": "http://127.0.0.1:9090", "action": "query", "server": "http://127.0.0.1:9090", "name": "http://127.0.0.1:9090", "uri": "http://127.0.0.1:9090/metrics/find/?format=protobuf&query=%2A"}
2019-06-22T20:11:48.618+0200    DEBUG   zipper  got slot    {"type": "protoV2Group", "name": "http://127.0.0.1:9090", "action": "query", "server": "http://127.0.0.1:9090", "name": "http://127.0.0.1:9090", "uri": "http://127.0.0.1:9090/metrics/find/?format=protobuf&query=%2A"}
2019-06-22T20:11:48.627+0200    DEBUG   zipper  will return data    {"type": "protoV2Group", "name": "http://127.0.0.1:9090", "function": "prober", "tlds": ["SC", "RIEMANN", "carbon"]}

And the config from http://localhost:8888/debug/vars after you do couple of requests?

https://gist.github.com/blind-oracle/07f08e73f2ef28279947c2e64429b97d

Civil commented 5 years ago

Hm...

I can't reproduce your issue with my test setup... (I've tried your config as well)

However I've found couple of other smaller issues (e.x. maxBatchSize was hardcoded to "100" since https://github.com/go-graphite/carbonapi/commit/06f88803556e49ce22b85bfebadee0510385eca9). So I've fixed them in current master.

Could you try to build again from master and also enable debug logging (level="debug" in logging section)?

Also can you tell me to how many metrics your query actually resolves?

The option I was talking about was "concurrencyLimit", but in your case it's set to 1000, so it should be fine.

blind-oracle commented 5 years ago

@Civil

Could you try to build again from master

Checked out latest master (0.11.0-504-g98cc). BTW I'm building with nocairo if that's relevant. The result is the same.

enable debug logging (level="debug" in logging section)

It's already debug in your example yaml, so I left it as is.

Also can you tell me to how many metrics your query actually resolves?

Now I'm testing with just one metric with 1440 values over a requested period. So it's not a lot.

Funny thing that it does HTTP call to graphite-clickhouse only on start:

GET /metrics/find/?format=protobuf&query=%2A HTTP/1.1
Host: 127.0.0.1:9090
User-Agent: Go-http-client/1.1
X-Ctx-Carbonapi-Uuid: 47c16f97-11f4-4111-ab89-6b87fe806d46
Accept-Encoding: gzip

HTTP/1.1 200 OK
Date: Sun, 23 Jun 2019 09:06:59 GMT
Content-Length: 30
Content-Type: application/octet-stream

.*..
.SC.
.RIEMANN..
.carbon

Then when I request the actual metrics it hangs on waiting for slot and then times out. But at start it goes further to got slot.

Also at start it works with "type": "protoV2Group", "name": "http://127.0.0.1:9090", but with actual request it goes to "type": "broadcastGroup", "groupName": "root"

blind-oracle commented 5 years ago

Ah ok I got it, for some reason it was now set to broadcast instead of round-robin as I specified in the config above. When I changed to rr it started working 😕

But why? My guess is that with 1 server rr should be the same as all, no?

Civil commented 5 years ago

It shouldn't matter. In fact my default clickhouse test setup is using "broadcast" mode and current test setup for Prometheus is using "rr".

So there is something else that I'm missing here and therefore still cannot reproduce.

I've currently tagged current master as "0.12.0-rc.0" and it also include last commit where I've tried to unify all the fetching logic. I'll have a closer look on how the rate limiter is working at this moment, as if there is a deadlock somewhere it's likely be there.

blind-oracle commented 5 years ago

Я сейчас специально проверил на всякий случай. В том конфиге который я выше постил если изменить rr на all - работать перестает. Меняешь обратно - продолжает работать. Больше ничего не трогал.

В любом случае, спасибо - по крайней мере сервис теперь можно использовать. Судя по моим нескольким тестам, он лучше заменяет graphite-web чем букинговый форк. По крайней мере часть графиков в их форке у меня выводилать не совсем так как надо.

Если нужно будет помочь с тестированием - дай знать. Руки дойдут - может тоже поищу проблему...

ЗЫ: О, Цюрих, привет сосед :)

Civil commented 5 years ago

Ок, посмотрю как смогу, может действительно все дело в all, у меня то в конфиге broadcast который должен был бы быть синонимом.

Про функции не удивительно, у меня их и больше поддерживается и я старался исправлять ошибки совместимости, где находил.

С тестированием - багрепорты всегда полезны. Так у меня нет продакшена где бы я мог использовать графит, то багрепорты - лучший источник ошибок.

ЗЫ: Ага, есть такое дело)

blind-oracle commented 5 years ago

@Civil

В общем проблема понятна. При инициализации он создает две иерархические броадкаст-группы:

2019-06-24T10:33:31.467+0200    DEBUG   zipper  created broadcast group {"type": "broadcastGroup", "groupName": "ch", "group_name": "ch", "clients": ["http://127.0.0.1:9090"], "max_batch_size": 0}
2019-06-24T10:33:31.467+0200    DEBUG   zipper  created broadcast group {"type": "broadcastGroup", "groupName": "root", "group_name": "root", "clients": ["ch"], "max_batch_size": 0}

У вышестоящей root один клиент - группа серверов (в моем случае ch), у нижестоящей - уже сами сервера (http://127.0.0.1:9090).

При входе в группу root лимитер пытается получить слот по имени сервера http://127.0.0.1:9090, хотя при создании группы лимитеру было передано только ch.

2019-06-24T10:33:37.006+0200    DEBUG   zipper  waiting for slot    {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["SC.PROD.{AT,NFT,OT,UFT,VT}.*.*.*[rs]*.network.nic.bond3.bw_out"], "client_name": "http://127.0.0.1:9090", "maxConns": 1000}

И он таймаутится по контексту при записи в канал:

    select {
    case sl.m[s] <- struct{}{}:
        return nil
    case <-ctx.Done():
        return errors.New("timeout exceeded")
    }

Вообще, по идее, на sl.m[s] он должен тупо паниковать, так как переданного ключа s в мапе нет, но, возможно, я что-то не знаю о Go :)

В качестве workaround я назвал группу и сервер одинаково (http://127.0.0.1:9090) - и все заработало в броадкасте. Дальше копать пока времени нет, возможно у тебя быстрее получится.

Civil commented 5 years ago

Спасибо, посмотрю. Возможно в этом и причина.

blind-oracle commented 5 years ago

Спасибо! А про мапы и панику это у меня Го с Питоном и его KeyError в голове смешались... Так что оно совершенно законно не паникует, а пытается писать вникуда.

Civil commented 5 years ago

Кажется я нашел проблему, теоретически должно поправится в мастере.

Проблема была в методе Children() который возвращал не своих детей.

blind-oracle commented 5 years ago

Подтверждаю, теперь все работает. Спасибо!