[BUG] consistent timeouts on some large queries when upgrading from 0.10.x to 0.14.x

go-graphite / carbonapi

Implementation of graphite API (graphite-web) in golang

Other

308 stars 140 forks source link

[BUG] consistent timeouts on some large queries when upgrading from 0.10.x to 0.14.x #509

Open joemiller opened 4 years ago

joemiller commented 4 years ago

Describe the bug We're seeing a consistent timeout on a specific query we run after upgrading from carbonapi 0.10 to 0.14 (running current master).

The query is (some redactions):

/render/?from=-5minutes&target=currentAbove(foo.*.live.*.mysql_replica.lag_seconds.value,1200)

This query has high cardinality. There are ~50k metrics I think. I'm guessing this value based on some DEBUG output from carbonapi 0.10 with an attribute zipper_requests: 49678.

I have enabled debug logging with both carbonapi 0.10 and 0.14 but I don't see any particularly helpful output. I can look for some specific key words if pointed in the right direction.

Instead I've been tcpdumping on go-carbon's http port in order to identify any differences in the queries being sent by the different carbonapi version. My observations:

When running the query against carbonapi 0.10 I see mostly individual /render calls like:

...wU..[GET /render/?format=protobuf&from=1596495516&target=foo.002a4a38.live.048c8.mysql_replica.lag_seconds.value&until=1596495816 HTTP/1.1
...wU..[GET /render/?format=protobuf&from=1596495516&target=foo.0018a9c8.live.5df8d.mysql_replica.lag_seconds.value&until=1596495816 HTTP/1.1

And the query to carbonapi returns in ~10-15s usually.

When running the same query against carbonapi 0.14 I see lots of /metrics/find requests in addition to the /render requests to go-carbon like the following (from a glance it looks like 1 /metrics/find + 1 /render call for each timeseries in the glob):

{z......GET /metrics/find/?format=protobuf&query=foo.01b98791.live.352a969.mysql_replica.lag_seconds.value HTTP/1.1
{z......GET /render/?format=protobuf&from=1596495456&target=foo.01b98791.live.352a969.mysql_replica.lag_seconds.value&until=1596495756 HTTP/1.1

In other words it appears that carbonapi 0.14 is sending a /metrics/find request + a /render request for every timeseries and eventually the query times out.

CarbonAPI Configuration:

0.10 configuration is:

    listen: "0.0.0.0:8080"
    concurency: 50
    cache:
       # Type of caching. Valid: "mem", "memcache", "null"
       type: "mem"
       size_mb: 1024
       defaultTimeoutSec: 60
    cpus: 8
    tz: ""
    sendGlobsAsIs: false
    maxBatchSize: 2000
    graphite:
        # Host:port where to send internal metrics
        # Empty = disabled
        host: "graphite-ingest:2003"
        interval: "60s"
        prefix: "carbon.carbonapi"
        pattern: "{prefix}.{fqdn}"
    idleConnections: 100
    pidFile: ""
    # See https://github.com/go-graphite/carbonzipper/blob/master/example.conf#L70-L108 for format explanation
    upstreams:
        buckets: 10

        timeouts:
            global: "300s"
            afterStarted: "120s"
            connect: "500ms"

        concurrencyLimit: 0
        keepAliveInterval: "30s"
        maxIdleConnsPerHost: 100
        backends:
            - "http://go-carbon-0.go-carbon:8000"
            - "http://go-carbon-1.go-carbon:8000"
        graphite09compat: false
    expireDelaySec: 10
    logger:
        - logger: ""
          file: "stderr"
          level: "warn"
          encoding: "json"
          encodingTime: "iso8601"
          encodingDuration: "seconds"

Latest attempted 0.14 configuration is:

    listen: "0.0.0.0:8080"
    concurency: 50
    cache:
       # Type of caching. Valid: "mem", "memcache", "null"
       type: "mem"
       size_mb: 1024
       defaultTimeoutSec: 60
    cpus: 8
    tz: ""
    # sendGlobsAsIs: false
    # maxBatchSize: 2000
    graphite:
        # Host:port where to send internal metrics
        # Empty = disabled
        host: "graphite-ingest:2003"
        interval: "60s"
        prefix: "carbon.carbonapi"
        pattern: "{prefix}.{fqdn}"
    idleConnections: 100
    pidFile: ""
    # See https://github.com/go-graphite/carbonzipper/blob/master/example.conf#L70-L108 for format explanation
    upstreams:
        buckets: 10

        timeouts:
            global: "300s"
            afterStarted: "120s"
            connect: "500ms"

        # alwaysSendGlobsAsIs: true
        # maxBatchSize: 0
        concurrencyLimit: 0
        keepAliveInterval: "30s"
        maxIdleConnsPerHost: 100
        graphite09compat: false
        # new style backend config, carbonapi 0.13+
        # https://github.com/go-graphite/carbonapi/blob/master/doc/configuration.md
        backendsv2:
          backends:
            -
              groupName: "go-carbon"
              protocol: "auto"
              lbMethod: "broadcast"
              maxTries: 3
              maxBatchSize: 10000
              keepAliveInterval: "30s"
              concurrencyLimit: 0
              maxIdleConnsPerHost: 100
              timeouts:
                  find: "300s"
                  render: "300s"
                  connect: "500ms"
              servers:
                  - "http://go-carbon-0.go-carbon:8000"
                  - "http://go-carbon-1.go-carbon:8000"
    expireDelaySec: 10
    logger:
        - logger: ""
          file: "stderr"
          level: "debug"
          # level: "warn"
          encoding: "json"
          encodingTime: "iso8601"
          encodingDuration: "seconds"

I have tried different variations of maxBatchSize such as 0, 1, 2000 but the behavior of a timed out query remains in all cases.

joemiller commented 4 years ago

I may have spoken too soon. I am able to get a successful response after setting up an isolated instance of 0.14 that is not receiving any other traffic and changing maxBatchSize within the backend block to 0, albeit slower - about 60s compared to 10-15s.

I'll probably close this bug in a few minutes if I can replicate this success with the primary instance of carbonapi.

EDIT:

The next time I made the same request (after letting the 60s cache expire)it responded in ~2s with a 500:

{"level":"WARN","timestamp":"2020-08-04T00:02:47.373Z","logger":"zipper","message":"find request failed when resolving globs","type":"broadcastGroup","groupName":"root","metric_name":"foo.*.live.*.mysql_replica.lag_seconds.value","error":"timeout while fetching data"}
carbonapi-0-14-7b8dbf7cd-qbdsw carbonapi {"level":"WARN","timestamp":"2020-08-04T00:02:47.373Z","logger":"zipper","message":"errors occurred while getting results","type":"autoGroup","name":"http://go-carbon-1.go-carbon:8000","type":"protoV2Group","name":"http://go-carbon-1.go-carbon:8000_protobuf","type":"find","request":["foo.*.live.*.mysql_replica.lag_seconds.value"],"errors":"max tries exceeded: failed to fetch data from server/group: Get \"http://go-carbon-1.go-carbon:8000/metrics/find/?format=protobuf&query=foo.%2A.live.%2A.mysql_replica.lag_seconds.value\": context deadline exceeded","errorsVerbose":"max tries exceeded\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:21\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5420\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\n\n\nCaused By: failed to fetch data from server/group\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:22\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5420\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\n\n\nCaused By: Get \"http://go-carbon-1.go-carbon:8000/metrics/find/?format=protobuf&query=foo.%2A.live.%2A.mysql_replica.lag_seconds.value\": context deadline exceeded"}

{"level":"WARN","timestamp":"2020-08-04T00:02:47.373Z","logger":"zipper","message":"errors occurred while getting results","type":"autoGroup","name":"http://go-carbon-0.go-carbon:8000","type":"protoV2Group","name":"http://go-carbon-0.go-carbon:8000_protobuf","type":"find","request":["foo.*.live.*.mysql_replica.lag_seconds.value"],"errors":"max tries exceeded: failed to fetch data from server/group: Get \"http://go-carbon-0.go-carbon:8000/metrics/find/?format=protobuf&query=foo.%2A.live.%2A.mysql_replica.lag_seconds.value\": context deadline exceeded","errorsVerbose":"max tries exceeded\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:21\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5420\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\n\n\nCaused By: failed to fetch data from server/group\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:22\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5420\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5415\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\n\n\nCaused By: Get \"http://go-carbon-0.go-carbon:8000/metrics/find/?format=protobuf&query=foo.%2A.live.%2A.mysql_replica.lag_seconds.value\": context deadline exceeded"}

{"level":"ERROR","timestamp":"2020-08-04T00:02:47.373Z","logger":"access","message":"request failed","data":{"handler":"render","carbonapi_uuid":"5127fb89-9f6a-458c-93eb-c32120974d0b","url":"/render/?from=-5minutes&target=currentAbove(foo.*.live.*.mysql_replica.lag_seconds.value,1200)","peer_ip":"10.48.5.1","host":"proxy","use_cache":true,"targets":["currentAbove(foo.*.live.*.mysql_replica.lag_seconds.value,1200)"],"cache_timeout":60,"runtime":2.00074282,"http_code":500,"reason":"error or no response: no metrics in the Response","from":1596499065,"until":1596499365,"from_raw":"-5minutes","uri":"/render/?from=-5minutes&target=currentAbove(foo.*.live.*.mysql_replica.lag_seconds.value,1200)","from_cache":false,"used_backend_cache":false,"request_headers":{}}}

{"level":"WARN","timestamp":"2020-08-04T00:02:47.373Z","logger":"slow","message":"Slow Request","time":2.000934282,"url":"https:///render/?from=-5minutes&target=currentAbove(foo.*.live.*.mysql_replica.lag_seconds.value,1200)","referer":""}

Immediately after the failed 2nd request, the 3rd request succeeded in ~89seconds.

EDIT 2:

After analyzing the parsed config struct that is printed during startup I noticed there was a find timeout in the Upstreams/Backends section that was using a default of 2 seconds. I think I may have resolved the 500 error that finishes in 2 seconds by adding a find timeout in the global upstreams section:

    upstreams:
        buckets: 10

        timeouts:
            find: "300s"
            global: "300s"
            afterStarted: "120s"
            connect: "500ms"

It is difficult to know for sure, however, as the results have been a bit unpredictable overall.

EDIT 4:

It seems like the upstreams/timeout/find setting has helped resolve the intermittent 500s that occur after 2 seconds.

However the overall performance compared to 0.10 is lower. The fastest result I've achieved with 0.14 is ~45s with most between 60-120s. This is compared to often seeing 10-20s responses with 0.10. Not sure if there's any additional tuning I can do here though.

EDIT 5: sorry

From watching the queries coming into go-carbon via tcpdump, some observations:

With my 0.10 config I see a lot of small/fast queries for the individual timeseries, eg:

foo.A.a.value
foo.A.b.value
foo.B.a.value
foo.B.b.value

With the 0.14 config I see batches of queries from the "parent", eg:

foo.A.a.value, foo.A.b.value
foo.B.a.value, foo.B.b.value

I would think the batches would be faster but that doesn't seem to be the case from my observations thus far.

EDIT 6:

After observing the batched requests appear to be serial, one at a time, I took a look at netstat and noticed there's only ever 1 connection from my caronbapi pod to the go-carbon backends for 0.14, whereas my 0.10 pods have many more connections. I've tried all sorts of combinations of the various concurrency settings including leaving the defaults.

Civil commented 4 years ago

It seems you are right, there is a behavioral change in terms of "sendGlobsAsIs" since 0.10. Back then all requests were sent in parallel, however now after "split" they are sent as 1 by 1.

I'll try to think of a way how to make old style config to use it by default, but in new style config (backendv2) it will be configurable with defaults that match current behavior.

EDIT:

Ah no, I'm wrong, while it's a different limiter now, but it's still done in parallel even for the same backend. This is regulated by concurrencyLimit and with 0 all the requests should be sent out in parallel without any restrictions...

Civil commented 4 years ago

Ok, I think I found a place which was parallel in older versions of carbonapi but now it's not. I've added a flag that should enable old behavior (also if you use old config format, this should be set to true automatically).

Could you please give it a try and tell if it helps in your case?

I'm also not sure if I should have different limiter for that use case.

joemiller commented 4 years ago

@Civil Thank you for your efforts!

Results using 0.14.0-67_amd64.deb from autobuid repo:

(I am using v2 config now btw.) I tried a few different settings and the results were consistently in the ~ 34-42 second range. It's possible I don't have the right combination of settings. It seems like any or all of maxBatchSize, concurrencyLimit, sendGlobAsIs and the new doMultipleRequestsIfSplit might matter but I'm not sure.

    upstreams:
        buckets: 10

        timeouts:
            find: "300s"
            global: "300s"
            afterStarted: "120s"
            connect: "500ms"

        # alwaysSendGlobsAsIs: true
        # maxBatchSize: 0
        # concurrencyLimit: 1000
        # concurency: 1000
        keepAliveInterval: "30s"
        maxIdleConnsPerHost: 100
        graphite09compat: false
        # new style backend config, carbonapi 0.13+
        # https://github.com/go-graphite/carbonapi/blob/master/doc/configuration.md
        backendsv2:
          concurrencyLimit: 0
          backends:
            -
              groupName: "go-carbon"
              protocol: "auto"
              lbMethod: "broadcast"
              maxTries: 3
              # maxBatchSize: 10000
              #sendGlobsAsIs: true
              maxBatchSize: 0
              # doMultipleRequestsIfSplit: false
              keepAliveInterval: "30s"
              # concurrencyLimitPerServer: 500
              concurrencyLimit: 0
              maxIdleConnsPerHost: 100
              timeouts:
                  find: "300s"
                  render: "300s"
                  connect: "500ms"
              servers:
                  - "http://go-carbon-0.go-carbon:8000"
                  - "http://go-carbon-1.go-carbon:8000"

Civil commented 4 years ago

I would suggest to try something like:

    upstreams:
        buckets: 10

        timeouts:
            find: "300s"
            global: "300s"
            afterStarted: "120s"
            connect: "500ms"

        # alwaysSendGlobsAsIs: true
        maxBatchSize: 100
        concurrencyLimitPerServer: 1000
        concurrencyLimit: 1000
        concurency: 1000
        keepAliveInterval: "30s"
        graphite09compat: false
        doMultipleRequestsIfSplit: true
        # new style backend config, carbonapi 0.13+
        # https://github.com/go-graphite/carbonapi/blob/master/doc/configuration.md
        backendsv2:
          concurrencyLimit: 250
          backends:
            -
              groupName: "go-carbon"
              protocol: "auto"
              lbMethod: "broadcast"
              maxTries: 3
              # maxBatchSize: 10000
              #sendGlobsAsIs: true
              maxBatchSize: 100
              doMultipleRequestsIfSplit: true
              keepAliveInterval: "30s"
              # concurrencyLimitPerServer: 500
              concurrencyLimit: 0
              maxIdleConnsPerHost: 200
              timeouts:
                  find: "300s"
                  render: "300s"
                  connect: "500ms"
              servers:
                  - "http://go-carbon-0.go-carbon:8000"
                  - "http://go-carbon-1.go-carbon:8000"

joemiller commented 4 years ago

Apologies for the delay on getting back on this. I tried those settings but the result is a crash. The log line is massive. A snippet with beginning and ending looks like:

2020-08-23T20:27:27.354Z        WARN    zipper  errors occurred while getting results   {"type": "autoGroup", "name": "http://go-carbon-1.go-carbon:8000", "type": "protoV2Group", "name": "http://go-carbon-1.go-carbon:8000_protobuf", "type": "fetch", "request": "&MultiFetchRequest{Metrics:[{...
...
 context canceled"}

Unsure from the log output that I've been able to examine what errors occurred.

reyjrar commented 3 years ago

I think I'm hitting this bug as well. In 0.12.3, I was able to handle a massive query load without throwing any 500s, 100% of my internal monitoring system's query load which was around 70,000 metrics every 15 seconds. Under 0.15.3, I'm able to direct only 50% of the query load to carbonapi. After that, I started getting 500s without any useful error.

Under 0.12.3 I had concurrencyLimit set to 100, under 0.15.3, I've pushed that to 1000, but I cannot get to over 50% of the query load regardless of the concurrencyLimit setting until the system falls over.

Here's a snippet:

2021-06-07T17:46:16.649Z        ERROR   access  request failed  {"data": {"handler":"render", ...
"cache_timeout":60,"runtime":1.3171156640000001,"http_code":500,"reason":"failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error,failed due to error", ... , "from_cache":false,"used_backend_cache":false,"request_headers":{}}}

Here's my config:

graphite09compat: true
concurency: 1000
cpus: 0
maxBatchSize: 100
upstreams:
    buckets: 10
    timeouts:
        find: "300s"
        global: "300s"
        afterStarted: "120s"
        connect: "500ms"
    concurrencyLimitPerServer: 1000
    concurrencyLimit: 1000
    maxBatchSize: 100
    slowLogThreshold: "1s"
    keepAliveInterval: "30s"
    maxIdleConnsPerHost: 100
    backendsv2:
        backends:
          -
             groupName: "misc"
             protocol: "auto"
             lbMethod: "broadcast"
             concurrencyLimit: 0
             maxIdleConnsPerHost: 100
             servers:
                - ...

Civil commented 3 years ago

0.15.4 should have a fix to make errors more useful.

What worries me is that the error happens after 1.3 second, which is way below all timeout limits.

Is there anything useful in debug logs about what's going on with a query?

Civil commented 3 years ago

@reyjrar Also for you I suggest to set doMultipleRequestsIfSplit: true as it might be important when you have a huge request that is being split to a series of smaller ones (so 70k metrics for the single fetch with maxBatchSize: 100 will result in 700 requests that will be performed one after the other which is suboptimal).

reyjrar commented 2 years ago

I tweaked concurrency and other settings. It still appears to be having issues, namely making too many requests to carbonserver:

2021-10-28T19:13:20.837Z        WARN    zipper  errors occurred while getting results   {"type": "autoGroup", "name": "http://10.20.11.96:8080", "type": "protoV3Group", "name": "http://10.20.11.96:8080_carbonapi_v3_pb", "type": "fetch", "request": "&MultiFetchRequest{Metrics:[{l.... 1635448099 1635448399 false l..... [] 0}],}", "errors": "max tries exceeded: failed to fetch data from server/group: error fetching data from backend: dial tcp 10.20.11.96:8080: connect: connection refused"

Settings are:

concurency: 500
maxBatchSize: 500
upstreams:
    buckets: 10
    timeouts:
        find: "5s"
        render: "60s"
        connect: "200ms"
    concurrency: 1000
    concurrencyLimit: 1000
    concurrencyLimitPerServer: 1000
    maxBatchSize: 500
    doMultipleRequestsIfSplit: true
    slowLogThreshold: "1s"
    keepAliveInterval: "30s"
    maxIdleConnsPerHost: 100
    backendsv2:
        backends:
          -
             groupName: "misc"
             protocol: "auto"
             lbMethod: "broadcast"
             maxBatchSize: 500
             concurrencyLimit: 0
             doMultipleRequestsIfSplit: true
             servers: []

Civil commented 2 years ago

"Connection refused" means that there was no one listening on that port, and not that there were "too many connections".

If we'll be talking about what happens on TCP layer, carbonapi initiates connection, OS sends SYN, then gets RST back. That happens before carbonserver gets anything.

reyjrar commented 2 years ago

Well, carbonserver is in fact listening.. a connection refused can also happen when the listen backlog is overflowing.. which appears to be the case. It seems like something changed and requests that used to be batched are now being sent in either smaller batches or as single queries.

Civil commented 2 years ago

At least on Linux this is not a default behavior. Unless you set tcp_abort_on_overflow sysctl, which by default is set to 0:

tcp_abort_on_overflow - BOOLEAN
    If listening service is too slow to accept new connections,
    reset them. Default state is FALSE. It means that if overflow
    occurred due to a burst, connection will recover. Enable this
    option _only_ if you are really sure that listening daemon
    cannot be tuned to accept connections faster. Enabling this
    option can harm clients of your server.

However other OS might have different defaults, yeah.

Anyway it would be helpful if you can gather some statistics about how much new connections getting established or how short lived they are, to verify if something changed in that regards.

reyjrar commented 2 years ago

TCP statistics

If I attempt 100 simultaneous clients to carbonapi, I will wind up over 1,000 new connection opens per minute per carbonserver backend.. These appear to be shortlived as the "CurrEstab" counters jump only 5-10 per node.

Request breakdown

I am seeing a crazy high number of requests to the carbonserver's in the backend.

Well, GitHub isn't letting me attach an image, cool. So the gist is this: Pulling:

carbonserver requests -> groupByNodes(nonNegativeDerivative(carbon.agents.graphite*.carbonserver.render_requests), 'max', 3)
carbonapi requests -> alias(groupByNodes(nonNegativeDerivative(carbon.api.graphite*.requests), 'max', 3), 'carbonapi', false)

I see carbonapi requests 20k requests, carbonserver requests 391k, so roughly making 20 times the requests to carbonserver.

reyjrar commented 2 years ago

More data, looking at zipper render requests, it's at 4 million for the same duration:

zipper requests => groupByNodes(nonNegativeDerivative(carbon.api.graphite*.zipper.render_requests), 'max', 3)

reyjrar commented 2 years ago

OK, I installed Booking's carbonapi/carbonzipper on a host in the cluster.. I ran my brutal query test against Booking's fork and 0.15.4:

0.15.4 fell over at 100 concurrent clients, and zipper was recording 50 times as many render_requests as the booking fork.

Booking's fork fell over at 1000 concurrent clients and zipper was reporting 1/5 the number of requests as 0.15.4 was at 100 concurrent clients.

The bug appears to be with carbonapis interaction with carbonzipper.

reyjrar commented 2 years ago

Confirmed that switching to Booking's fork allows me to cut-over our monitoring stack to using carbonapi. If this gets addressed, I can retry this version and see if it scales as well.

reyjrar commented 1 year ago

FWIW, this is still a thing. I setup a script on a storage node to watch the "ActiveOpens / PassiveOpens" as reported by /proc/net/snmp:

I set:

concurrency: 2048
maxBatchSize: 100
idleConnections: 100
doMultipleRequestsIfSplit: true

With bookingcom-carbonapi@0.3.0 the test script: 100 workers in 8.165s

--> 20:34:01 <---
 - ActiveOpens changed by: 6
 - PassiveOpens changed by: 18
--> 20:34:06 <---
 - ActiveOpens changed by: 6
 - PassiveOpens changed by: 137
--> 20:34:11 <---
 - ActiveOpens changed by: 52
 - PassiveOpens changed by: 46

And with carbonapi@0.15.6 the test script: 100 workers in 20.185s

--> 20:41:30 <---
--> 20:44:44 <---
 - PassiveOpens changed by: 5239
--> 20:44:49 <---
 - ActiveOpens changed by: 3
 - PassiveOpens changed by: 14010
--> 20:44:54 <---
 - PassiveOpens changed by: 13960
--> 20:44:59 <---
 - ActiveOpens changed by: 2
 - PassiveOpens changed by: 13633
--> 20:45:04 <---
 - ActiveOpens changed by: 11
 - PassiveOpens changed by: 6680
--> 20:45:09 <---
 - ActiveOpens changed by: 53
 - PassiveOpens changed by: 64

That's a lot more connections by several orders of magnitude. This really needs to get fixed.

reyjrar commented 1 year ago

FWIW, I set doMultipleRequestsIfSplit: false and got 100 workers in 20.183s:

--> 20:47:19 <---
 - ActiveOpens changed by: 2
 - PassiveOpens changed by: 8954
--> 20:47:24 <---
 - ActiveOpens changed by: 2
 - PassiveOpens changed by: 15086
--> 20:47:29 <---
 - ActiveOpens changed by: 3
 - PassiveOpens changed by: 14224
--> 20:47:34 <---
 - PassiveOpens changed by: 14085
--> 20:47:39 <---
 - ActiveOpens changed by: 1
 - PassiveOpens changed by: 2383

reyjrar commented 1 year ago

FWIW: Some background, my test script is simulating a load by an existing monitoring tool which does render requests with batches of target=foo.bar.thing&target=foo.baz.thing instead of using globs. It's a long story.. but in this case, each worker was fetching 917 targets total in batches of 50 targets. I spun up 100 simultaneous workers.

This was the case with the bookingcom-carbonapi test as well. It might be as simple as the Keep-Alive is not being utilized, and/or missing a call to use a connection pool and spinning up a new connection foreach request. Either way, the Booking fork does the right thing, the "official" version appears to be broken badly.

Civil commented 1 year ago

I'm planning to finally fix that as part of #742

That should refactor how keep-alive, etc. is used ( I think one of the problems was with current approach is that we set IdleConnectionTimeout to 0, which means that Go fallbacks to whatever read timeout is, so on slow queries that would cause huge amount of dropped connections). And as part of that I'll fix #631 as well.

Civil commented 1 year ago

Actually I would split the PR in 2, current one focuses on refactoring HTTP client (should fix keep-alive for failed requests and should ensure that not more than X connections is established). Next one would refactor expr package in attempt to combine fetches for multiple targets.

Civil commented 1 year ago

As about single fetch request for multiple targets: I think #743 should fix the problem, once it's merged. It introduced a config option that will send multiple targets in a single request. It however might need some work before it's fully compatible (I have some concerns about fallbackSeries).

If you have a way to test it safely, I would appreciate your feedback on both just #742 and #743.

reyjrar commented 1 year ago

FWIW: Testing with 0.16.0, things are the same, maybe slightly better. When I run my benchmarking script against go-graphite/carbonapi, with 25 concurrent workers making about 40 requests (containing 25 metrics each), I see an astronomical amount of TIME_WAIT after the run. This indicates there were a large number of short-lived connections.

For comparison, I'm observing TIME_WAIT sockets on the host running carbonapi. My data servers, running go-carbon live on different hardware.

Running the benchmark on bookingcom/carbonapi yields around 400 time wait connections by the time the benchmark finishes. And cold-cached, it takes each worker about 2s to fetch the data.

Running the benchmark on go-graphite/carbonapi v0.16.0 yields around 108,000 time wait connections by the time the benchmark finishes. And cold-cached, it takes each worker about 9s to fetch the data.

For fun, I toggled systctl -w net.ipv4.tcp_tw_recycle=1 on the host running go-graphite/carbonapi, and that dropped the cold-cache request time to 8.5s per worker. And since the kernel could aggressively reuse sockets, we maxed out at around 8,000 TIME_WAIT sockets during the benchmark run.

I'm pretty sure the dramatic performance issues OP and I noted are due to the excessive socket use present in go-graphite/carbonapi. On

I'm using:

# Both using:
concurrencyLimit: 2048
concurrencyLimitPerServer: 0
expireDelaySec: 600
maxIdleConnsPerHost: 100
# go-graphite/carbonapi "resolveGlobs"
# I've tried, 0, 1, and 100, no difference in speed or number of TIME_WAITs

go-graphite/carbonapi

Benchmark took 9.052s

PassiveOpens on a storage node:

--> 01:31:10 <---
 - ActiveOpens changed by: 52
 - PassiveOpens changed by: 2729
--> 01:31:15 <---
 - ActiveOpens changed by: 1
 - PassiveOpens changed by: 8037
--> 01:31:20 <---
 - PassiveOpens changed by: 2565

bookingcom/carbonapi

Benchmark took 3.036s

PassiveOpens on a storage node:

--> 01:32:42 <---
 - ActiveOpens changed by: 1
 - PassiveOpens changed by: 28

reyjrar commented 1 year ago

Derp, my previous runs checked passive opens when running 100 simultaneous clients, results:

bookingcom/carbonapi

Benchmark took 7.162s.

Passive opens on the storage node.

--> 01:53:06 <---
 - ActiveOpens changed by: 8
 - PassiveOpens changed by: 38
--> 01:53:11 <---
 - ActiveOpens changed by: 54
 - PassiveOpens changed by: 62
--> 01:53:16 <---
 - ActiveOpens changed by: 2
 - PassiveOpens changed by: 101
--> 01:53:26 <---
 - ActiveOpens changed by: 3
 - PassiveOpens changed by: 1

go-graphite/carbonapi

Benchmark took 31.213s.

Passive opens on the storage node.

--> 01:49:16 <---
 - ActiveOpens changed by: 2
 - PassiveOpens changed by: 7733
--> 01:49:21 <---
 - PassiveOpens changed by: 8787
--> 01:49:26 <---
 - ActiveOpens changed by: 3
 - PassiveOpens changed by: 9550
--> 01:49:31 <---
 - ActiveOpens changed by: 2
 - PassiveOpens changed by: 8483
--> 01:49:36 <---
 - ActiveOpens changed by: 2
 - PassiveOpens changed by: 8883
--> 01:49:41 <---
 - PassiveOpens changed by: 9315
--> 01:49:46 <---
 - ActiveOpens changed by: 2
 - PassiveOpens changed by: 725
--> 01:49:51 <---

Civil commented 1 year ago

I wonder if you've also tried with combineMultipleTargetsInOne flag?

https://github.com/go-graphite/carbonapi/blob/main/cmd/mockbackend/testcases/pr743/carbonapi.yaml#L15

But I'll have another look at why http client doesn't respect keepalive option.

Civil commented 1 year ago

Also a question - how unique are queries? You were mentioning that it's multiple targets, are all of the targets - unique? Or some percentage of targets is the same?

Also what percentage of those are invalid (or ends with error)? That's for all backends or only for some?

Not that it would help me with keepalive, but might give some idea of particular use-case.

reyjrar commented 1 year ago

The targets are all unique, but all 100 workers are asking the "same" questions. For the benchmark, I'm enumerating all the server names and requesting the 1 minute LoadAvg, so my metric space looks like "collector.$Hostname.LoadAvg.01" x ~900 hosts.

Wow! Adding combineMultipleTargetsInOne: true brought the runtime down to 3s, and the number of connections is way down.

--> 16:48:29 <---
 - ActiveOpens changed by: 5
 - PassiveOpens changed by: 1
--> 16:48:39 <---
 - ActiveOpens changed by: 2
 - PassiveOpens changed by: 1167
--> 16:48:49 <---
 - ActiveOpens changed by: 1
 - PassiveOpens changed by: 2

This is still quite a bit higher than what I see with bookingcom/carbonapi, but it may be fast enough and efficient enough for my immediate needs. That said, there's definitely something happening with connections that could be done more efficiently.

Civil commented 1 year ago

So the option I've suggested is specifically designed to solve that particular problem- pass all targets in a single request (or re split that later to reasonably small amount of them).

Caveat is handling of fallbackSeries a bit different. I currently don't pass enough information to identify if failed request had it or not (only if it was present at least once). And to actually fix that I will need to rework the whole path.

About http client I will have a look later.

reyjrar commented 1 year ago

I don't know if I should open a new issue for this, but with doMultipleRequestsIfSplit: true, Grafana cannot talk to carbonapi as a DataSource. Tested with Grafana v9.2.4 and v9.3.1, both experienced errors.

Civil commented 1 year ago

I'll try to reproduce it. I think it shouldn't matter if it's grafana or not.

reyjrar commented 1 year ago

I'm experimenting more with this setup. I have a broadcast cluster setup. I saw a few "blips" overnight that I can't put my finger on why the query failed. The cluster has three nodes, data is replicated to 2 of the 3 nodes. Most of the time, things work fine.. I'm sending over 10k requests a minute and I've had 15 queries 503 in the past 18h. The weird thing, is digging into the logs, I don't think those queries should've failed.

Access log from a failed query:

{
  "data": {
    "cache_timeout": 60,
    "carbonapi_uuid": "cda21391-6188-4282-b78f-318553f53a17",
    "format": "json",
    "from": 1672901880,
    "from_cache": false,
    "from_raw": "06:58_20230105",
    "handler": "render",
    "host": "graphite-api.example.com",
    "http_code": 503,
    "peer_ip": "...",
    "reason": "timeout while fetching Response",
    "request_headers": {},
    "runtime": 5.00458162,
    "targets": [
      ....
    ],
    "tz": "UTC",
    "until": 1672902780,
    "until_raw": "07:13_20230105",
    "uri": "/render?until=07%3A13_20230105&target=...",
    "url": "/render?until=07%3A13_20230105&target=...",
    "use_cache": true,
    "used_backend_cache": false
  },
  "level": "ERROR",
  "logger": "access",
  "message": "request failed",
  "timestamp": "2023-01-05T07:13:28.665Z"
}

And when I check the zipper logs:

{
  "errors": "max tries exceeded",
  "errorsVerbose": "max tries exceeded\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/u2/cbs/workers/blhotsky_1671060092142841_centos-7-x86_64_carbonapi/GIT/craigslist-graphite/golang/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:25\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6222\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571\n\nCaused By: failed to fetch data from server/group\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/u2/cbs/workers/blhotsky_1671060092142841_centos-7-x86_64_carbonapi/GIT/craigslist-graphite/golang/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:27\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6222\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571\n\nCaused By: error while fetching Response\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/u2/cbs/workers/blhotsky_1671060092142841_centos-7-x86_64_carbonapi/GIT/craigslist-graphite/golang/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:34\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6222\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.doInit\n\t/usr/lib/golang/src/runtime/proc.go:6199\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571",
  "level": "WARN",
  "logger": "zipper",
  "message": "errors occurred while getting results",
  "name": "http://10.1.2.3:8080_carbonapi_v3_pb",
  "request": "&MultiFetchRequest{Metrics:[]FetchRequest{FetchRequest{Name:foo.dc.build.host.metric,StartTime:1672901880,StopTime:1672902780,HighPrecisionTimestamps:false,PathExpression:foo.dc.build.host.metric2,FilterFunctions:[]*FilteringFunction{},MaxDataPoints:0,},}",
  "timestamp": "2023-01-05T07:13:28.665Z",
  "type": "fetch"
}

If I'm reading these logs correctly, there was a timeout on the request to a single backend, but it failed the entire query? That shouldn't happen?

When I check the go-carbon storage server logs, I see this access request item 5s earlier:

[2023-01-05T07:13:23.665Z] INFO [access] fetch served:

{
  "carbonapi_uuid": "cda21391-6188-4282-b78f-318553f53a17",
  "carbonzipper_uuid": "cda21391-6188-4282-b78f-318553f53a17",
  "format": "carbonapi_v3_pb",
  "from_cache": false,
  "handler": "render",
  "http_code": 200,
  "memory_used_bytes": 4566,
  "metrics_fetched": 19,
  "peer": "CARBONAPI_IP:28050",
  "query_cache_enabled": true,
  "runtime_seconds": 0.002854232,
  "targets": [
     ...
  ],
  "url": "/render/?format=carbonapi_v3_pb",
  "values_fetched": 285
}

Which indicates the query returned successfully from the backend, or at least enough that go-carbon logged a 200.

BorisMan9 commented 1 year ago

Hey, was this resolved? we suffer from a similar bug upgraded carbonapi to 0.16.0 golang version is 1.19.10

I tinkered with the configuration a lot. nothing helped. here is the latest configuration:

cpus: 0
combineMultipleTargetsInOne: true
concurency: 2000

upstreams:
    graphite09compat: false
    buckets: 10
    doMultipleRequestsIfSplit: true
    concurrencyLimitPerServer: 0
    keepAliveInterval: "30s"
    maxIdleConnsPerHost: 100
    timeouts:
        find: "150s"
        render: "30s"
        connect: "500ms"     

    backendsv2:
        backends:
          -
            groupName: "go-carbon"
            doMultipleRequestsIfSplit: true
            protocol: "carbonapi_v3_pb"
            lbMethod: "broadcast"
            maxTries: 3
            maxBatchSize: 100
            keepAliveInterval: "10s"
            concurrencyLimit: 2048
            #concurrencyLimit: 0
            maxIdleConnsPerHost: 1000
            timeouts:
                find: "2s"
                render: "50s"
                connect: "200ms"
            servers:
                - "http://127.0.0.1:8080"

the query we use has high cardinality + we assign groupByNode and timeShift functions on the results

the error seem to be from carbonzipper, almost exclusively max tries exceeded / context deadline exceeded error:

{"level":"WARN","timestamp":"2023-06-27T15:14:42.667Z","logger":"zipper","message":"errors occurred while getting results","type":"protoV3Group","name":"http://127.0.0.1:8080","type":"fetch","request":"&MultiFetchRequest{Metrics:[]FetchRequest{FetchRequest{Name:stats.counters.health.status.health_v3.statusv.1_226_1.bcaeifjkoohhhhlmlnlikddldlkhnkggihjcadla.custom_info_error.count,StartTime:1687792472,StopTime:1687878872,HighPrecisionTimestamps:false,PathExpression:stats.counters.health.status.health_v3.statusv.*.*.custom_info_error.count,FilterFunctions:[]*FilteringFunction{},MaxDataPoints:0,},FetchRequest{Name:stats.counters.health.status.health_v3.statusv.1_226_2.bcaeifjkoohhhhlmlnlikddldlkhnkggihjcadla.custom_info_error.count,StartTime:1687792472,StopTime:1687878872,HighPrecisionTimestamps:false,PathExpression:stats.counters.health.status.health_v3.statusv.*.*.custom_info_error.count,FilterFunctions:[]*FilteringFunction{},MaxDataPoints:0,},},}","errors":"max tries exceeded","errorsVerbose":"max tries exceeded\nHTTP Code: 504\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:25\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: failed to fetch data from server/group\nHTTP Code: 504\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:27\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: timeout while fetching Response\nHTTP Code: 504\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:20\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: Get \"http://127.0.0.1:8080/render/?format=carbonapi_v3_pb\": context deadline exceeded"}

any help / ideas on how to solve it will be much appreciated

Civil commented 1 year ago

There is no good solution for that at this moment especially if backend is go-carbon. If you'd migrate to graphite-clickhouse and rely on backend side aggregation - that should work way better.

Overall I have enough chunks of time to work on carbonapi only occasionally and therefore I prefer working on issues that are easy to reproduce at home lab and performance problems that rely on a lot of data unfortunately is not something that is easy to reproduce.

BorisMan9 commented 1 year ago

ok so open issue it is for anyone that suffers from the same problem and is interested in how we bypassed this, I followed @reyjrar advice and installed Booking's carbonapi fork. although the queries take some time but no more timeouts

ganlvtech commented 7 months ago

I found go-canbon self metrics "carbon.agents.*.carbonserver.inflight_requests_count" sometimes is 18446744073709552000 https://github.com/go-graphite/go-carbon/blob/115958e61f5ee608a871d43e82c31aeafad18c1f/carbonserver/carbonserver.go#L2212-L2222 restart go-carbon may help