Open joemiller opened 4 years ago
I may have spoken too soon. I am able to get a successful response after setting up an isolated instance of 0.14 that is not receiving any other traffic and changing maxBatchSize
within the backend block to 0
, albeit slower - about 60s compared to 10-15s.
I'll probably close this bug in a few minutes if I can replicate this success with the primary instance of carbonapi.
EDIT:
The next time I made the same request (after letting the 60s cache expire)it responded in ~2s with a 500:
{"level":"WARN","timestamp":"2020-08-04T00:02:47.373Z","logger":"zipper","message":"find request failed when resolving globs","type":"broadcastGroup","groupName":"root","metric_name":"foo.*.live.*.mysql_replica.lag_seconds.value","error":"timeout while fetching data"}
carbonapi-0-14-7b8dbf7cd-qbdsw carbonapi {"level":"WARN","timestamp":"2020-08-04T00:02:47.373Z","logger":"zipper","message":"errors occurred while getting results","type":"autoGroup","name":"http://go-carbon-1.go-carbon:8000","type":"protoV2Group","name":"http://go-carbon-1.go-carbon:8000_protobuf","type":"find","request":["foo.*.live.*.mysql_replica.lag_seconds.value"],"errors":"max tries exceeded: failed to fetch data from server/group: Get \"http://go-carbon-1.go-carbon:8000/metrics/find/?format=protobuf&query=foo.%2A.live.%2A.mysql_replica.lag_seconds.value\": context deadline exceeded","errorsVerbose":"max tries exceeded\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:21\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5420\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\n\n\nCaused By: failed to fetch data from server/group\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:22\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5420\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\n\n\nCaused By: Get \"http://go-carbon-1.go-carbon:8000/metrics/find/?format=protobuf&query=foo.%2A.live.%2A.mysql_replica.lag_seconds.value\": context deadline exceeded"}
{"level":"WARN","timestamp":"2020-08-04T00:02:47.373Z","logger":"zipper","message":"errors occurred while getting results","type":"autoGroup","name":"http://go-carbon-0.go-carbon:8000","type":"protoV2Group","name":"http://go-carbon-0.go-carbon:8000_protobuf","type":"find","request":["foo.*.live.*.mysql_replica.lag_seconds.value"],"errors":"max tries exceeded: failed to fetch data from server/group: Get \"http://go-carbon-0.go-carbon:8000/metrics/find/?format=protobuf&query=foo.%2A.live.%2A.mysql_replica.lag_seconds.value\": context deadline exceeded","errorsVerbose":"max tries exceeded\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:21\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5420\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\n\n\nCaused By: failed to fetch data from server/group\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:22\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5420\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\n\n\nCaused By: Get \"http://go-carbon-0.go-carbon:8000/metrics/find/?format=protobuf&query=foo.%2A.live.%2A.mysql_replica.lag_seconds.value\": context deadline exceeded"}
{"level":"ERROR","timestamp":"2020-08-04T00:02:47.373Z","logger":"access","message":"request failed","data":{"handler":"render","carbonapi_uuid":"5127fb89-9f6a-458c-93eb-c32120974d0b","url":"/render/?from=-5minutes&target=currentAbove(foo.*.live.*.mysql_replica.lag_seconds.value,1200)","peer_ip":"10.48.5.1","host":"proxy","use_cache":true,"targets":["currentAbove(foo.*.live.*.mysql_replica.lag_seconds.value,1200)"],"cache_timeout":60,"runtime":2.00074282,"http_code":500,"reason":"error or no response: no metrics in the Response","from":1596499065,"until":1596499365,"from_raw":"-5minutes","uri":"/render/?from=-5minutes&target=currentAbove(foo.*.live.*.mysql_replica.lag_seconds.value,1200)","from_cache":false,"used_backend_cache":false,"request_headers":{}}}
{"level":"WARN","timestamp":"2020-08-04T00:02:47.373Z","logger":"slow","message":"Slow Request","time":2.000934282,"url":"https:///render/?from=-5minutes&target=currentAbove(foo.*.live.*.mysql_replica.lag_seconds.value,1200)","referer":""}
Immediately after the failed 2nd request, the 3rd request succeeded in ~89seconds.
EDIT 2:
After analyzing the parsed config struct that is printed during startup I noticed there was a find timeout in the Upstreams/Backends section that was using a default of 2 seconds. I think I may have resolved the 500 error that finishes in 2 seconds by adding a find timeout in the global upstreams section:
upstreams:
buckets: 10
timeouts:
find: "300s"
global: "300s"
afterStarted: "120s"
connect: "500ms"
It is difficult to know for sure, however, as the results have been a bit unpredictable overall.
EDIT 4:
It seems like the upstreams/timeout/find setting has helped resolve the intermittent 500s that occur after 2 seconds.
However the overall performance compared to 0.10 is lower. The fastest result I've achieved with 0.14 is ~45s with most between 60-120s. This is compared to often seeing 10-20s responses with 0.10. Not sure if there's any additional tuning I can do here though.
EDIT 5: sorry
From watching the queries coming into go-carbon via tcpdump, some observations:
With my 0.10 config I see a lot of small/fast queries for the individual timeseries, eg:
foo.A.a.value
foo.A.b.value
foo.B.a.value
foo.B.b.value
With the 0.14 config I see batches of queries from the "parent", eg:
foo.A.a.value, foo.A.b.value
foo.B.a.value, foo.B.b.value
I would think the batches would be faster but that doesn't seem to be the case from my observations thus far.
EDIT 6:
After observing the batched requests appear to be serial, one at a time, I took a look at netstat
and
noticed there's only ever 1 connection from my caronbapi pod to the go-carbon backends for 0.14, whereas my 0.10 pods have many more connections. I've tried all sorts of combinations of the various concurrency settings including leaving the defaults.
It seems you are right, there is a behavioral change in terms of "sendGlobsAsIs" since 0.10. Back then all requests were sent in parallel, however now after "split" they are sent as 1 by 1.
I'll try to think of a way how to make old style config to use it by default, but in new style config (backendv2) it will be configurable with defaults that match current behavior.
EDIT:
Ah no, I'm wrong, while it's a different limiter now, but it's still done in parallel even for the same backend. This is regulated by concurrencyLimit
and with 0
all the requests should be sent out in parallel without any restrictions...
Ok, I think I found a place which was parallel in older versions of carbonapi but now it's not. I've added a flag that should enable old behavior (also if you use old config format, this should be set to true automatically).
Could you please give it a try and tell if it helps in your case?
I'm also not sure if I should have different limiter for that use case.
@Civil Thank you for your efforts!
Results using 0.14.0-67_amd64.deb from autobuid repo:
(I am using v2 config now btw.) I tried a few different settings and the results were consistently in the ~ 34-42 second range. It's possible I don't have the right combination of settings. It seems like any or all of maxBatchSize, concurrencyLimit, sendGlobAsIs
and the new doMultipleRequestsIfSplit
might matter but I'm not sure.
upstreams:
buckets: 10
timeouts:
find: "300s"
global: "300s"
afterStarted: "120s"
connect: "500ms"
# alwaysSendGlobsAsIs: true
# maxBatchSize: 0
# concurrencyLimit: 1000
# concurency: 1000
keepAliveInterval: "30s"
maxIdleConnsPerHost: 100
graphite09compat: false
# new style backend config, carbonapi 0.13+
# https://github.com/go-graphite/carbonapi/blob/master/doc/configuration.md
backendsv2:
concurrencyLimit: 0
backends:
-
groupName: "go-carbon"
protocol: "auto"
lbMethod: "broadcast"
maxTries: 3
# maxBatchSize: 10000
#sendGlobsAsIs: true
maxBatchSize: 0
# doMultipleRequestsIfSplit: false
keepAliveInterval: "30s"
# concurrencyLimitPerServer: 500
concurrencyLimit: 0
maxIdleConnsPerHost: 100
timeouts:
find: "300s"
render: "300s"
connect: "500ms"
servers:
- "http://go-carbon-0.go-carbon:8000"
- "http://go-carbon-1.go-carbon:8000"
I would suggest to try something like:
upstreams:
buckets: 10
timeouts:
find: "300s"
global: "300s"
afterStarted: "120s"
connect: "500ms"
# alwaysSendGlobsAsIs: true
maxBatchSize: 100
concurrencyLimitPerServer: 1000
concurrencyLimit: 1000
concurency: 1000
keepAliveInterval: "30s"
graphite09compat: false
doMultipleRequestsIfSplit: true
# new style backend config, carbonapi 0.13+
# https://github.com/go-graphite/carbonapi/blob/master/doc/configuration.md
backendsv2:
concurrencyLimit: 250
backends:
-
groupName: "go-carbon"
protocol: "auto"
lbMethod: "broadcast"
maxTries: 3
# maxBatchSize: 10000
#sendGlobsAsIs: true
maxBatchSize: 100
doMultipleRequestsIfSplit: true
keepAliveInterval: "30s"
# concurrencyLimitPerServer: 500
concurrencyLimit: 0
maxIdleConnsPerHost: 200
timeouts:
find: "300s"
render: "300s"
connect: "500ms"
servers:
- "http://go-carbon-0.go-carbon:8000"
- "http://go-carbon-1.go-carbon:8000"
Apologies for the delay on getting back on this. I tried those settings but the result is a crash. The log line is massive. A snippet with beginning and ending looks like:
2020-08-23T20:27:27.354Z WARN zipper errors occurred while getting results {"type": "autoGroup", "name": "http://go-carbon-1.go-carbon:8000", "type": "protoV2Group", "name": "http://go-carbon-1.go-carbon:8000_protobuf", "type": "fetch", "request": "&MultiFetchRequest{Metrics:[{...
...
context canceled"}
Unsure from the log output that I've been able to examine what errors occurred.
I think I'm hitting this bug as well. In 0.12.3, I was able to handle a massive query load without throwing any 500s, 100% of my internal monitoring system's query load which was around 70,000 metrics every 15 seconds. Under 0.15.3, I'm able to direct only 50% of the query load to carbonapi. After that, I started getting 500s without any useful error.
Under 0.12.3 I had concurrencyLimit
set to 100, under 0.15.3, I've pushed that to 1000, but I cannot get to over 50% of the query load regardless of the concurrencyLimit
setting until the system falls over.
Here's a snippet:
2021-06-07T17:46:16.649Z ERROR access request failed {"data": {"handler":"render", ...
"cache_timeout":60,"runtime":1.3171156640000001,"http_code":500,"reason":"failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error", ... , "from_cache":false,"used_backend_cache":false,"request_headers":{}}}
Here's my config:
graphite09compat: true
concurency: 1000
cpus: 0
maxBatchSize: 100
upstreams:
buckets: 10
timeouts:
find: "300s"
global: "300s"
afterStarted: "120s"
connect: "500ms"
concurrencyLimitPerServer: 1000
concurrencyLimit: 1000
maxBatchSize: 100
slowLogThreshold: "1s"
keepAliveInterval: "30s"
maxIdleConnsPerHost: 100
backendsv2:
backends:
-
groupName: "misc"
protocol: "auto"
lbMethod: "broadcast"
concurrencyLimit: 0
maxIdleConnsPerHost: 100
servers:
- ...
0.15.4 should have a fix to make errors more useful.
What worries me is that the error happens after 1.3 second, which is way below all timeout limits.
Is there anything useful in debug logs about what's going on with a query?
@reyjrar Also for you I suggest to set doMultipleRequestsIfSplit: true
as it might be important when you have a huge request that is being split to a series of smaller ones (so 70k metrics for the single fetch with maxBatchSize: 100
will result in 700 requests that will be performed one after the other which is suboptimal).
I tweaked concurrency and other settings. It still appears to be having issues, namely making too many requests to carbonserver:
2021-10-28T19:13:20.837Z WARN zipper errors occurred while getting results {"type": "autoGroup", "name": "http://10.20.11.96:8080", "type": "protoV3Group", "name": "http://10.20.11.96:8080_carbonapi_v3_pb", "type": "fetch", "request": "&MultiFetchRequest{Metrics:[{l.... 1635448099 1635448399 false l..... [] 0}],}", "errors": "max tries exceeded: failed to fetch data from server/group: error fetching data from backend: dial tcp 10.20.11.96:8080: connect: connection refused"
Settings are:
concurency: 500
maxBatchSize: 500
upstreams:
buckets: 10
timeouts:
find: "5s"
render: "60s"
connect: "200ms"
concurrency: 1000
concurrencyLimit: 1000
concurrencyLimitPerServer: 1000
maxBatchSize: 500
doMultipleRequestsIfSplit: true
slowLogThreshold: "1s"
keepAliveInterval: "30s"
maxIdleConnsPerHost: 100
backendsv2:
backends:
-
groupName: "misc"
protocol: "auto"
lbMethod: "broadcast"
maxBatchSize: 500
concurrencyLimit: 0
doMultipleRequestsIfSplit: true
servers: []
"Connection refused" means that there was no one listening on that port, and not that there were "too many connections".
If we'll be talking about what happens on TCP layer, carbonapi initiates connection, OS sends SYN, then gets RST back. That happens before carbonserver gets anything.
Well, carbonserver is in fact listening.. a connection refused can also happen when the listen backlog is overflowing.. which appears to be the case. It seems like something changed and requests that used to be batched are now being sent in either smaller batches or as single queries.
At least on Linux this is not a default behavior. Unless you set tcp_abort_on_overflow
sysctl, which by default is set to 0:
tcp_abort_on_overflow - BOOLEAN
If listening service is too slow to accept new connections,
reset them. Default state is FALSE. It means that if overflow
occurred due to a burst, connection will recover. Enable this
option _only_ if you are really sure that listening daemon
cannot be tuned to accept connections faster. Enabling this
option can harm clients of your server.
However other OS might have different defaults, yeah.
Anyway it would be helpful if you can gather some statistics about how much new connections getting established or how short lived they are, to verify if something changed in that regards.
If I attempt 100 simultaneous clients to carbonapi, I will wind up over 1,000 new connection opens per minute per carbonserver backend.. These appear to be shortlived as the "CurrEstab" counters jump only 5-10 per node.
I am seeing a crazy high number of requests to the carbonserver's in the backend.
Well, GitHub isn't letting me attach an image, cool. So the gist is this: Pulling:
I see carbonapi requests 20k requests, carbonserver requests 391k, so roughly making 20 times the requests to carbonserver.
More data, looking at zipper render requests, it's at 4 million for the same duration:
OK, I installed Booking's carbonapi/carbonzipper on a host in the cluster.. I ran my brutal query test against Booking's fork and 0.15.4:
0.15.4 fell over at 100 concurrent clients, and zipper was recording 50 times as many render_requests
as the booking fork.
Booking's fork fell over at 1000 concurrent clients and zipper was reporting 1/5 the number of requests as 0.15.4 was at 100 concurrent clients.
The bug appears to be with carbonapi
s interaction with carbonzipper
.
Confirmed that switching to Booking's fork allows me to cut-over our monitoring stack to using carbonapi. If this gets addressed, I can retry this version and see if it scales as well.
FWIW, this is still a thing. I setup a script on a storage node to watch the "ActiveOpens / PassiveOpens" as reported by /proc/net/snmp
:
I set:
concurrency: 2048
maxBatchSize: 100
idleConnections: 100
doMultipleRequestsIfSplit: true
With bookingcom-carbonapi@0.3.0
the test script: 100 workers in 8.165s
--> 20:34:01 <---
- ActiveOpens changed by: 6
- PassiveOpens changed by: 18
--> 20:34:06 <---
- ActiveOpens changed by: 6
- PassiveOpens changed by: 137
--> 20:34:11 <---
- ActiveOpens changed by: 52
- PassiveOpens changed by: 46
And with carbonapi@0.15.6
the test script: 100 workers in 20.185s
--> 20:41:30 <---
--> 20:44:44 <---
- PassiveOpens changed by: 5239
--> 20:44:49 <---
- ActiveOpens changed by: 3
- PassiveOpens changed by: 14010
--> 20:44:54 <---
- PassiveOpens changed by: 13960
--> 20:44:59 <---
- ActiveOpens changed by: 2
- PassiveOpens changed by: 13633
--> 20:45:04 <---
- ActiveOpens changed by: 11
- PassiveOpens changed by: 6680
--> 20:45:09 <---
- ActiveOpens changed by: 53
- PassiveOpens changed by: 64
That's a lot more connections by several orders of magnitude. This really needs to get fixed.
FWIW, I set doMultipleRequestsIfSplit: false
and got 100 workers in 20.183s
:
--> 20:47:19 <---
- ActiveOpens changed by: 2
- PassiveOpens changed by: 8954
--> 20:47:24 <---
- ActiveOpens changed by: 2
- PassiveOpens changed by: 15086
--> 20:47:29 <---
- ActiveOpens changed by: 3
- PassiveOpens changed by: 14224
--> 20:47:34 <---
- PassiveOpens changed by: 14085
--> 20:47:39 <---
- ActiveOpens changed by: 1
- PassiveOpens changed by: 2383
FWIW: Some background, my test script is simulating a load by an existing monitoring tool which does render requests with batches of target=foo.bar.thing&target=foo.baz.thing
instead of using globs. It's a long story.. but in this case, each worker was fetching 917 targets total in batches of 50 targets. I spun up 100 simultaneous workers.
This was the case with the bookingcom-carbonapi
test as well. It might be as simple as the Keep-Alive
is not being utilized, and/or missing a call to use a connection pool and spinning up a new connection foreach request. Either way, the Booking fork does the right thing, the "official" version appears to be broken badly.
I'm planning to finally fix that as part of #742
That should refactor how keep-alive, etc. is used ( I think one of the problems was with current approach is that we set IdleConnectionTimeout to 0, which means that Go fallbacks to whatever read timeout is, so on slow queries that would cause huge amount of dropped connections). And as part of that I'll fix #631 as well.
Actually I would split the PR in 2, current one focuses on refactoring HTTP client (should fix keep-alive for failed requests and should ensure that not more than X connections is established). Next one would refactor expr package in attempt to combine fetches for multiple targets.
As about single fetch request for multiple targets: I think #743 should fix the problem, once it's merged. It introduced a config option that will send multiple targets in a single request. It however might need some work before it's fully compatible (I have some concerns about fallbackSeries).
If you have a way to test it safely, I would appreciate your feedback on both just #742 and #743.
FWIW: Testing with 0.16.0, things are the same, maybe slightly better. When I run my benchmarking script against go-graphite/carbonapi
, with 25 concurrent workers making about 40 requests (containing 25 metrics each), I see an astronomical amount of TIME_WAIT
after the run. This indicates there were a large number of short-lived connections.
For comparison, I'm observing TIME_WAIT sockets on the host running carbonapi
. My data servers, running go-carbon
live on different hardware.
Running the benchmark on bookingcom/carbonapi
yields around 400 time wait connections by the time the benchmark finishes. And cold-cached, it takes each worker about 2s to fetch the data.
Running the benchmark on go-graphite/carbonapi
v0.16.0 yields around 108,000 time wait connections by the time the benchmark finishes. And cold-cached, it takes each worker about 9s to fetch the data.
For fun, I toggled systctl -w net.ipv4.tcp_tw_recycle=1
on the host running go-graphite/carbonapi
, and that dropped the cold-cache request time to 8.5s per worker. And since the kernel could aggressively reuse sockets, we maxed out at around 8,000 TIME_WAIT sockets during the benchmark run.
I'm pretty sure the dramatic performance issues OP and I noted are due to the excessive socket use present in go-graphite/carbonapi
. On
I'm using:
# Both using:
concurrencyLimit: 2048
concurrencyLimitPerServer: 0
expireDelaySec: 600
maxIdleConnsPerHost: 100
# go-graphite/carbonapi "resolveGlobs"
# I've tried, 0, 1, and 100, no difference in speed or number of TIME_WAITs
Benchmark took 9.052s
PassiveOpens on a storage node:
--> 01:31:10 <---
- ActiveOpens changed by: 52
- PassiveOpens changed by: 2729
--> 01:31:15 <---
- ActiveOpens changed by: 1
- PassiveOpens changed by: 8037
--> 01:31:20 <---
- PassiveOpens changed by: 2565
Benchmark took 3.036s
PassiveOpens on a storage node:
--> 01:32:42 <---
- ActiveOpens changed by: 1
- PassiveOpens changed by: 28
Derp, my previous runs checked passive opens when running 100 simultaneous clients, results:
Benchmark took 7.162s.
Passive opens on the storage node.
--> 01:53:06 <---
- ActiveOpens changed by: 8
- PassiveOpens changed by: 38
--> 01:53:11 <---
- ActiveOpens changed by: 54
- PassiveOpens changed by: 62
--> 01:53:16 <---
- ActiveOpens changed by: 2
- PassiveOpens changed by: 101
--> 01:53:26 <---
- ActiveOpens changed by: 3
- PassiveOpens changed by: 1
Benchmark took 31.213s.
Passive opens on the storage node.
--> 01:49:16 <---
- ActiveOpens changed by: 2
- PassiveOpens changed by: 7733
--> 01:49:21 <---
- PassiveOpens changed by: 8787
--> 01:49:26 <---
- ActiveOpens changed by: 3
- PassiveOpens changed by: 9550
--> 01:49:31 <---
- ActiveOpens changed by: 2
- PassiveOpens changed by: 8483
--> 01:49:36 <---
- ActiveOpens changed by: 2
- PassiveOpens changed by: 8883
--> 01:49:41 <---
- PassiveOpens changed by: 9315
--> 01:49:46 <---
- ActiveOpens changed by: 2
- PassiveOpens changed by: 725
--> 01:49:51 <---
I wonder if you've also tried with combineMultipleTargetsInOne flag?
But I'll have another look at why http client doesn't respect keepalive option.
Also a question - how unique are queries? You were mentioning that it's multiple targets, are all of the targets - unique? Or some percentage of targets is the same?
Also what percentage of those are invalid (or ends with error)? That's for all backends or only for some?
Not that it would help me with keepalive, but might give some idea of particular use-case.
The targets are all unique, but all 100 workers are asking the "same" questions. For the benchmark, I'm enumerating all the server names and requesting the 1 minute LoadAvg, so my metric space looks like "collector.$Hostname.LoadAvg.01" x ~900 hosts.
Wow! Adding combineMultipleTargetsInOne: true
brought the runtime down to 3s, and the number of connections is way down.
--> 16:48:29 <---
- ActiveOpens changed by: 5
- PassiveOpens changed by: 1
--> 16:48:39 <---
- ActiveOpens changed by: 2
- PassiveOpens changed by: 1167
--> 16:48:49 <---
- ActiveOpens changed by: 1
- PassiveOpens changed by: 2
This is still quite a bit higher than what I see with bookingcom/carbonapi
, but it may be fast enough and efficient enough for my immediate needs. That said, there's definitely something happening with connections that could be done more efficiently.
So the option I've suggested is specifically designed to solve that particular problem- pass all targets in a single request (or re split that later to reasonably small amount of them).
Caveat is handling of fallbackSeries a bit different. I currently don't pass enough information to identify if failed request had it or not (only if it was present at least once). And to actually fix that I will need to rework the whole path.
About http client I will have a look later.
I don't know if I should open a new issue for this, but with doMultipleRequestsIfSplit: true
, Grafana cannot talk to carbonapi as a DataSource. Tested with Grafana v9.2.4 and v9.3.1, both experienced errors.
I'll try to reproduce it. I think it shouldn't matter if it's grafana or not.
I'm experimenting more with this setup. I have a broadcast
cluster setup. I saw a few "blips" overnight that I can't put my finger on why the query failed. The cluster has three nodes, data is replicated to 2 of the 3 nodes. Most of the time, things work fine.. I'm sending over 10k requests a minute and I've had 15 queries 503 in the past 18h. The weird thing, is digging into the logs, I don't think those queries should've failed.
Access log from a failed query:
{
"data": {
"cache_timeout": 60,
"carbonapi_uuid": "cda21391-6188-4282-b78f-318553f53a17",
"format": "json",
"from": 1672901880,
"from_cache": false,
"from_raw": "06:58_20230105",
"handler": "render",
"host": "graphite-api.example.com",
"http_code": 503,
"peer_ip": "...",
"reason": "timeout while fetching Response",
"request_headers": {},
"runtime": 5.00458162,
"targets": [
....
],
"tz": "UTC",
"until": 1672902780,
"until_raw": "07:13_20230105",
"uri": "/render?until=07%3A13_20230105&target=...",
"url": "/render?until=07%3A13_20230105&target=...",
"use_cache": true,
"used_backend_cache": false
},
"level": "ERROR",
"logger": "access",
"message": "request failed",
"timestamp": "2023-01-05T07:13:28.665Z"
}
And when I check the zipper logs:
{
"errors": "max tries exceeded",
"errorsVerbose": "max tries exceeded\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/u2/cbs/workers/blhotsky_1671060092142841_centos-7-x86_64_carbonapi/GIT/craigslist-graphite/golang/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:25\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6222\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571\n\nCaused By: failed to fetch data from server/group\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/u2/cbs/workers/blhotsky_1671060092142841_centos-7-x86_64_carbonapi/GIT/craigslist-graphite/golang/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:27\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6222\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571\n\nCaused By: error while fetching Response\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/u2/cbs/workers/blhotsky_1671060092142841_centos-7-x86_64_carbonapi/GIT/craigslist-graphite/golang/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:34\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6222\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571",
"level": "WARN",
"logger": "zipper",
"message": "errors occurred while getting results",
"name": "http://10.1.2.3:8080_carbonapi_v3_pb",
"request": "&MultiFetchRequest{Metrics:[]FetchRequest{FetchRequest{Name:foo.dc.build.host.metric,StartTime:1672901880,StopTime:1672902780,HighPrecisionTimestamps:false,PathExpression:foo.dc.build.host.metric2,FilterFunctions:[]*FilteringFunction{},MaxDataPoints:0,},}",
"timestamp": "2023-01-05T07:13:28.665Z",
"type": "fetch"
}
If I'm reading these logs correctly, there was a timeout on the request to a single backend, but it failed the entire query? That shouldn't happen?
When I check the go-carbon storage server logs, I see this access request item 5s earlier:
[2023-01-05T07:13:23.665Z] INFO [access] fetch served:
{
"carbonapi_uuid": "cda21391-6188-4282-b78f-318553f53a17",
"carbonzipper_uuid": "cda21391-6188-4282-b78f-318553f53a17",
"format": "carbonapi_v3_pb",
"from_cache": false,
"handler": "render",
"http_code": 200,
"memory_used_bytes": 4566,
"metrics_fetched": 19,
"peer": "CARBONAPI_IP:28050",
"query_cache_enabled": true,
"runtime_seconds": 0.002854232,
"targets": [
...
],
"url": "/render/?format=carbonapi_v3_pb",
"values_fetched": 285
}
Which indicates the query returned successfully from the backend, or at least enough that go-carbon
logged a 200.
Hey, was this resolved? we suffer from a similar bug upgraded carbonapi to 0.16.0 golang version is 1.19.10
I tinkered with the configuration a lot. nothing helped. here is the latest configuration:
cpus: 0
combineMultipleTargetsInOne: true
concurency: 2000
upstreams:
graphite09compat: false
buckets: 10
doMultipleRequestsIfSplit: true
concurrencyLimitPerServer: 0
keepAliveInterval: "30s"
maxIdleConnsPerHost: 100
timeouts:
find: "150s"
render: "30s"
connect: "500ms"
backendsv2:
backends:
-
groupName: "go-carbon"
doMultipleRequestsIfSplit: true
protocol: "carbonapi_v3_pb"
lbMethod: "broadcast"
maxTries: 3
maxBatchSize: 100
keepAliveInterval: "10s"
concurrencyLimit: 2048
#concurrencyLimit: 0
maxIdleConnsPerHost: 1000
timeouts:
find: "2s"
render: "50s"
connect: "200ms"
servers:
- "http://127.0.0.1:8080"
the query we use has high cardinality + we assign groupByNode and timeShift functions on the results
the error seem to be from carbonzipper, almost exclusively max tries exceeded
/ context deadline exceeded
error:
{"level":"WARN","timestamp":"2023-06-27T15:14:42.667Z","logger":"zipper","message":"errors occurred while getting results","type":"protoV3Group","name":"http://127.0.0.1:8080","type":"fetch","request":"&MultiFetchRequest{Metrics:[]FetchRequest{FetchRequest{Name:stats.counters.health.status.health_v3.statusv.1_226_1.bcaeifjkoohhhhlmlnlikddldlkhnkggihjcadla.custom_info_error.count,StartTime:1687792472,StopTime:1687878872,HighPrecisionTimestamps:false,PathExpression:stats.counters.health.status.health_v3.statusv.*.*.custom_info_error.count,FilterFunctions:[]*FilteringFunction{},MaxDataPoints:0,},FetchRequest{Name:stats.counters.health.status.health_v3.statusv.1_226_2.bcaeifjkoohhhhlmlnlikddldlkhnkggihjcadla.custom_info_error.count,StartTime:1687792472,StopTime:1687878872,HighPrecisionTimestamps:false,PathExpression:stats.counters.health.status.health_v3.statusv.*.*.custom_info_error.count,FilterFunctions:[]*FilteringFunction{},MaxDataPoints:0,},},}","errors":"max tries exceeded","errorsVerbose":"max tries exceeded\nHTTP Code: 504\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:25\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: failed to fetch data from server/group\nHTTP Code: 504\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:27\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: timeout while fetching Response\nHTTP Code: 504\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:20\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: Get \"http://127.0.0.1:8080/render/?format=carbonapi_v3_pb\": context deadline exceeded"}
any help / ideas on how to solve it will be much appreciated
There is no good solution for that at this moment especially if backend is go-carbon. If you'd migrate to graphite-clickhouse and rely on backend side aggregation - that should work way better.
Overall I have enough chunks of time to work on carbonapi only occasionally and therefore I prefer working on issues that are easy to reproduce at home lab and performance problems that rely on a lot of data unfortunately is not something that is easy to reproduce.
ok so open issue it is for anyone that suffers from the same problem and is interested in how we bypassed this, I followed @reyjrar advice and installed Booking's carbonapi fork. although the queries take some time but no more timeouts
I found go-canbon self metrics "carbon.agents.*.carbonserver.inflight_requests_count" sometimes is 18446744073709552000 https://github.com/go-graphite/go-carbon/blob/115958e61f5ee608a871d43e82c31aeafad18c1f/carbonserver/carbonserver.go#L2212-L2222 restart go-carbon may help
Describe the bug We're seeing a consistent timeout on a specific query we run after upgrading from carbonapi 0.10 to 0.14 (running current master).
The query is (some redactions):
This query has high cardinality. There are ~50k metrics I think. I'm guessing this value based on some DEBUG output from carbonapi 0.10 with an attribute
zipper_requests: 49678
.I have enabled debug logging with both carbonapi 0.10 and 0.14 but I don't see any particularly helpful output. I can look for some specific key words if pointed in the right direction.
Instead I've been tcpdumping on go-carbon's http port in order to identify any differences in the queries being sent by the different carbonapi version. My observations:
When running the query against carbonapi 0.10 I see mostly individual /render calls like:
And the query to carbonapi returns in ~10-15s usually.
When running the same query against carbonapi 0.14 I see lots of /metrics/find requests in addition to the /render requests to go-carbon like the following (from a glance it looks like 1 /metrics/find + 1 /render call for each timeseries in the glob):
In other words it appears that carbonapi 0.14 is sending a /metrics/find request + a /render request for every timeseries and eventually the query times out.
CarbonAPI Configuration:
0.10 configuration is:
Latest attempted 0.14 configuration is:
I have tried different variations of
maxBatchSize
such as 0, 1, 2000 but the behavior of a timed out query remains in all cases.