go-graphite / carbonapi

Implementation of graphite API (graphite-web) in golang
Other
309 stars 142 forks source link

[Performance] High CPU usage due to implicit sorting since 0.14+ #672

Closed Karsonito closed 2 years ago

Karsonito commented 2 years ago

Problem description Implicit sorting of rows for all functions leads to slow requests and high CPU usage For example query duration slows in 1.5 times and cpu usage grows in 4 times graphite-clickhouse-docker benchmark

carbonapi's version 0.15.4 (problem started in 0.14.0)

carbonapi's config

listen: ":8081"
pidFile: ""
tz: ""
cpus: 0
idleConnections: 1000
concurency: 10000
maxBatchSize: 0

expvar:
  enabled: true
  pprofEnabled: true
  listen: ""
cache:
   type: "null"
backendCache:
   type: "null"
graphite:
    host: "carbon-clickhouse-live-test6:2003"
    interval: "60s"
    prefix: "resources.monitoring.carbonapi"
logger:
    - logger: ""
      file: "stderr"
      level: "debug"
      encoding: "json"
upstreams:
    timeouts:
        global: "60s"
        afterStarted: "60s"
    backends:
        - "http://graphite-clickhouse-live-test6:9090"

backend software and config

[common]
listen = ":9090"
max-cpu = 2
max-metrics-per-target = 1000000

[clickhouse]
url = "http://clickhouse-live-test6:8123/?max_query_size=268435456&max_ast_elements=1000000&max_execution_time=60&log_queries=1"
index-table = "graphite.index"
tagged-table = "graphite.tagged"
data-timeout = "60s"
index-timeout = "60s"
rollup-conf = "/etc/graphite-clickhouse/rollup.xml"

[[data-table]]
table = "graphite.data"
reverse = false
rollup-conf = "auto"

[[data-table]]
table = "graphite.data_reverse"
max-age = "48h"
reverse = true

[[logging]]
file = "stdout"
level = "debug"

carbonapi performance metrics

  1. /debug/vars
    {
    "BuildVersion": "v0.15.4",
    "GoVersion": "go1.18.1",
    "backend_cache_hits": 0,
    "backend_cache_misses": 14,
    "cmdline": ["carbonapi","-config","/etc/carbonapi.yml"],
    "config": {"ExtrapolateExperiment":false,"Logger":[{"logger":"","file":"stderr","level":"debug","encoding":"json","encoding-time":"iso8601","encoding-duration":"seconds","sample-tick":"","sample-initial":0,"sample-thereafter":0}],"Listen":":8081","Listeners":[],"Buckets":10,"Concurency":10000,"ResponseCacheConfig":{"Type":"null","Size":0,"MemcachedServers":[],"DefaultTimeoutSec":60},"BackendCacheConfig":{"Type":"null","Size":0,"MemcachedServers":null,"DefaultTimeoutSec":0},"Cpus":0,"TimezoneString":"","UnicodeRangeTables":null,"Graphite":{"Pattern":"{prefix}.{fqdn}","Host":"carbon-clickhouse-live-test6:2003","Interval":60000000000,"Prefix":"resources.monitoring.carbonapi"},"IdleConnections":1000,"PidFile":"","SendGlobsAsIs":null,"AlwaysSendGlobsAsIs":null,"MaxBatchSize":0,"Zipper":"","Upstreams":{"Buckets":10,"SlowLogThreshold":1000000000,"ConcurrencyLimitPerServer":0,"MaxIdleConnsPerHost":100,"Backends":["http://graphite-clickhouse-live-test6:9090"],"BackendsV2":{"Backends":[{"GroupName":"backends","Protocol":"carbonapi_v2_pb","LBMethod":"broadcast","Servers":["http://graphite-clickhouse-live-test6:9090"],"Timeouts":{"Find":2000000000,"Render":10000000000000,"Connect":200000000},"ConcurrencyLimit":0,"KeepAliveInterval":30000000000,"MaxIdleConnsPerHost":100,"MaxTries":0,"MaxBatchSize":0,"BackendOptions":null,"ForceAttemptHTTP2":false,"DoMultipleRequestsIfSplit":true}],"MaxIdleConnsPerHost":100,"ConcurrencyLimitPerServer":0,"Timeouts":{"Find":2000000000,"Render":10000000000000,"Connect":200000000},"KeepAliveInterval":30000000000,"MaxTries":0,"MaxBatchSize":0},"MaxBatchSize":0,"FallbackMaxBatchSize":0,"MaxTries":0,"DoMultipleRequestsIfSplit":true,"CarbonSearch":{"Backend":"","Prefix":"virt.v1.*"},"CarbonSearchV2":{"Backends":null,"MaxIdleConnsPerHost":0,"ConcurrencyLimitPerServer":0,"Timeouts":{"Find":0,"Render":0,"Connect":0},"KeepAliveInterval":0,"MaxTries":0,"MaxBatchSize":0,"Prefix":""},"ExpireDelaySec":0,"TLDCacheDisabled":false,"InternalRoutingCache":600000000000,"Timeouts":{"Find":2000000000,"Render":10000000000000,"Connect":200000000},"KeepAliveInterval":30000000000,"ScaleToCommonStep":true},"ExpireDelaySec":600,"GraphiteWeb09Compatibility":false,"IgnoreClientTimeout":false,"DefaultColors":null,"GraphTemplates":"","FunctionsConfigs":{},"HeadersToPass":null,"HeadersToLog":null,"Define":null,"Prefix":"","Expvar":{"Listen":"","Enabled":true,"PProfEnabled":true},"NotFoundStatusCode":200,"HTTPResponseStackTrace":true,"UseCachingDNSResolver":false,"CachingDNSRefreshTime":60000000000},
    "find_requests": 0,
    "memstats": {"Alloc":266042256,"TotalAlloc":23414830544,"Sys":806806728,"Lookups":0,"Mallocs":131416544,"Frees":131292467,"HeapAlloc":266042256,"HeapSys":762707968,"HeapIdle":491175936,"HeapInuse":271532032,"HeapReleased":473702400,"HeapObjects":124077,"StackInuse":557056,"StackSys":557056,"MSpanInuse":1425552,"MSpanSys":4488000,"MCacheInuse":4800,"MCacheSys":15600,"BuckHashSys":3288636,"GCSys":34311680,"OtherSys":1437788,"NextGC":303718928,"LastGC":1653925564152970073,"PauseTotalNs":17554781,"PauseNs":[26435,27361,39479,30400,28403,39294,33166,43002,55317,53054,48541,42668,64775,135945,193711,96871,37314,34822,37173,544151,68062,36440,47362,64642,38353,54968,51682,56575,29845,39626,56338,33719,314389,30746,30914,52951,55824,33378,33022,30560,36337,32439,54603,49465,52420,47875,52515,86333,53234,49024,43136,46386,53180,56155,57214,41607,44587,43734,47159,91584,75275,26236,27899,50798,35539,63020,27124,36066,29765,41291,74281,28699,29114,43805,59082,32334,30339,36772,80176,40588,82185,50177,53839,46152,42617,88646,27291,40539,28871,37399,34428,28089,28459,35382,29919,29636,29106,32381,34442,40164,51865,44542,41882,126311,62498,41491,133676,42810,63752,74136,19834,85271,73271,23279,25590,69194,28339,47245,26127,30602,39548,27410,38617,27373,31473,31159,30406,29075,46880,48677,40274,41398,42690,54549,118097,53439,52638,66871,41580,112407,72070,97882,101459,124323,42817,28362,37875,29045,27127,28466,32887,29080,32771,32151,32546,32644,32525,37092,137623,42837,45884,51114,143704,45249,45341,49621,40783,61052,15905,16264,19280,169506,427547,48004,29118,29966,30508,27549,31053,27346,29297,46498,32467,31865,30146,30277,75046,40323,128728,53457,46617,113937,97370,51986,88386,34632,57358,27148,56379,27002,29941,31170,28076,31737,30213,27348,33539,31751,35581,48596,45444,45664,60870,61017,44478,49540,84848,39940,28464,75815,33210,202385,75833,58070,29051,27589,29396,26632,26984,29844,28546,27605,33908,30478,32600,30833,31459,34912,45264,41842,47899,136234,42515,48111,44780,41723,45568,76526,148492,29169,60189,79802,27079,31395,32305,84750],"PauseEnd":[1653924694209906944,1653924694263883329,1653924694321232921,1653924694386447415,1653924694499949987,1653924694638938597,1653924694780042139,1653924694987485479,1653924696159511099,1653924696798426460,1653924697446410883,1653924698027562868,1653924698121007191,1653924698215556510,1653924698309636971,1653924698404185360,1653924821090915340,1653924941139902979,1653925057918763359,1653925057922163378,1653925057927322068,1653925057929289154,1653925057935106930,1653925057940153667,1653925057943838950,1653925057950773786,1653925057969212964,1653925057973198302,1653925057983782890,1653925057999436419,1653925058010749358,1653925058040476494,1653925058067056697,1653925058101598442,1653925058138353768,1653925058209215631,1653925058275052782,1653925058355885619,1653925058493962971,1653925058649829623,1653925058848906334,1653925059049728746,1653925059735499228,1653925060377419990,1653925061013628429,1653925061656788922,1653925062191526675,1653925062306323998,1653925062423600955,1653925062570007050,1653925062718731663,1653925191256025316,1653925311308322349,1653925431360942603,1653925551411933893,1653925560189474264,1653925560194200431,1653925560198706069,1653925560200772351,1653925560203070478,1653925560204848183,1653925560209456182,1653925560216459222,1653925560222654432,1653925560235544204,1653925560251573993,1653925560266366403,1653925560287804191,1653925560311238136,1653925560341655739,1653925560384251135,1653925560424016365,1653925560501091121,1653925560588658988,1653925560701675201,1653925560832615033,1653925560969225574,1653925561177510402,1653925562277449727,1653925562946146119,1653925563571135458,1653925564152970073,1653924649117209677,1653924652345614074,1653924652352804268,1653924652360930974,1653924652372067737,1653924652383496286,1653924652400626212,1653924652428780021,1653924652457241125,1653924652489955522,1653924652532915344,1653924652627727327,1653924652677045133,1653924652779521852,1653924652876710467,1653924653040206012,1653924653543082541,1653924654151514683,1653924654770089494,1653924655392954149,1653924655991313150,1653924656241361278,1653924656372594126,1653924656519689558,1653924656640899564,1653924659763500689,1653924659770800357,1653924659772223316,1653924659775495949,1653924659781695164,1653924659783389171,1653924659787938266,1653924659793129461,1653924659803920581,1653924659824206979,1653924659835794217,1653924659853266022,1653924659875939068,1653924659904411158,1653924659938752640,1653924659985532813,1653924660060653093,1653924660145273897,1653924660246889946,1653924660352089155,1653924660492014300,1653924660696681038,1653924661308599735,1653924661906763584,1653924662498425087,1653924663096709667,1653924663569366339,1653924663656327186,1653924663737366875,1653924663827326376,1653924666631522614,1653924666638561760,1653924666646762402,1653924666649040173,1653924666652289319,1653924666654897815,1653924666660970187,1653924666672907406,1653924666684551932,1653924666696686503,1653924666714531224,1653924666738305322,1653924666770025534,1653924666824081591,1653924666867301396,1653924666922877395,1653924666991304178,1653924667106512772,1653924667228217491,1653924667390843247,1653924667579325472,1653924668240230968,1653924668837602690,1653924669520554740,1653924670142320972,1653924670620376579,1653924670723920220,1653924670820563295,1653924670921319357,1653924673894535954,1653924673901498015,1653924673903239127,1653924673904575167,1653924673907781093,1653924673917076908,1653924673920713982,1653924673939065538,1653924673951185200,1653924673964370245,1653924673982983237,1653924674009905532,1653924674077300627,1653924674099636406,1653924674152261001,1653924674231572228,1653924674316858720,1653924674407147227,1653924674514235989,1653924674677893900,1653924674864473766,1653924675497231516,1653924676106969982,1653924676725842911,1653924677347232990,1653924677866050165,1653924677961728714,1653924678098925380,1653924678216609671,1653924680842019512,1653924680850440329,1653924680862335799,1653924680879944496,1653924680903139882,1653924680931962327,1653924680969277999,1653924681027700475,1653924681080335943,1653924681166483797,1653924681276070552,1653924681399336178,1653924681560924426,1653924681744590247,1653924682381480295,1653924682992579739,1653924683595761421,1653924684178862009,1653924684653477239,1653924684744683097,1653924684866046283,1653924684956144111,1653924687369089000,1653924687376331047,1653924687378137148,1653924687380525761,1653924687385710126,1653924687391304161,1653924687397366783,1653924687408329178,1653924687419516141,1653924687430647811,1653924687449550348,1653924687469582631,1653924687497842084,1653924687533315064,1653924687590850100,1653924687641202914,1653924687730177508,1653924687838992033,1653924687948092728,1653924688089011234,1653924688279984121,1653924688918069924,1653924689500655921,1653924690099886292,1653924690679127568,1653924691166203388,1653924691275042458,1653924691387430299,1653924691480858078,1653924694048821292,1653924694058487721,1653924694063640827,1653924694067502176,1653924694081110360,1653924694093592474,1653924694106146272,1653924694124819452,1653924694157212860,1653924694174838406],"NumGC":338,"NumForcedGC":0,"GCCPUFraction":0.0006660196042336757,"EnableGC":true,"DebugGC":false,"BySize":[{"Size":0,"Mallocs":0,"Frees":0},{"Size":8,"Mallocs":803,"Frees":534},{"Size":16,"Mallocs":82047281,"Frees":82000070},{"Size":24,"Mallocs":1571885,"Frees":1571380},{"Size":32,"Mallocs":1829929,"Frees":1829048},{"Size":48,"Mallocs":2134625,"Frees":2121520},{"Size":64,"Mallocs":1829965,"Frees":1829078},{"Size":80,"Mallocs":912,"Frees":854},{"Size":96,"Mallocs":1876040,"Frees":1874466},{"Size":112,"Mallocs":715,"Frees":603},{"Size":128,"Mallocs":1829188,"Frees":1828489},{"Size":144,"Mallocs":309,"Frees":293},{"Size":160,"Mallocs":19645,"Frees":19520},{"Size":176,"Mallocs":730972,"Frees":728530},{"Size":192,"Mallocs":3056809,"Frees":3051949},{"Size":208,"Mallocs":1146089,"Frees":1142461},{"Size":224,"Mallocs":699791,"Frees":696426},{"Size":240,"Mallocs":131199,"Frees":131020},{"Size":256,"Mallocs":2797659,"Frees":2796962},{"Size":288,"Mallocs":1600508,"Frees":1587209},{"Size":320,"Mallocs":1322211,"Frees":1320607},{"Size":352,"Mallocs":165,"Frees":127},{"Size":384,"Mallocs":1874415,"Frees":1873005},{"Size":416,"Mallocs":143,"Frees":45},{"Size":448,"Mallocs":26,"Frees":19},{"Size":480,"Mallocs":41,"Frees":38},{"Size":512,"Mallocs":1234453,"Frees":1233765},{"Size":576,"Mallocs":134615,"Frees":123387},{"Size":640,"Mallocs":42,"Frees":32},{"Size":704,"Mallocs":36,"Frees":30},{"Size":768,"Mallocs":6,"Frees":3},{"Size":896,"Mallocs":68,"Frees":53},{"Size":1024,"Mallocs":16469,"Frees":16141},{"Size":1152,"Mallocs":274,"Frees":260},{"Size":1280,"Mallocs":42,"Frees":29},{"Size":1408,"Mallocs":46,"Frees":41},{"Size":1536,"Mallocs":134546,"Frees":134544},{"Size":1792,"Mallocs":17,"Frees":4},{"Size":2048,"Mallocs":16232,"Frees":15921},{"Size":2304,"Mallocs":115,"Frees":107},{"Size":2688,"Mallocs":28,"Frees":17},{"Size":3072,"Mallocs":38,"Frees":36},{"Size":3200,"Mallocs":2,"Frees":0},{"Size":3456,"Mallocs":0,"Frees":0},{"Size":4096,"Mallocs":16295,"Frees":15974},{"Size":4864,"Mallocs":78,"Frees":72},{"Size":5376,"Mallocs":49,"Frees":46},{"Size":6144,"Mallocs":3,"Frees":1},{"Size":6528,"Mallocs":0,"Frees":0},{"Size":6784,"Mallocs":16183,"Frees":15874},{"Size":6912,"Mallocs":34,"Frees":34},{"Size":8192,"Mallocs":5,"Frees":3},{"Size":9472,"Mallocs":54,"Frees":47},{"Size":9728,"Mallocs":37,"Frees":36},{"Size":10240,"Mallocs":16183,"Frees":15874},{"Size":10880,"Mallocs":12,"Frees":12},{"Size":12288,"Mallocs":134579,"Frees":123366},{"Size":13568,"Mallocs":0,"Frees":0},{"Size":14336,"Mallocs":16183,"Frees":15874},{"Size":16384,"Mallocs":23,"Frees":23},{"Size":18432,"Mallocs":2,"Frees":1}]},
    "render_cache_overhead_ns": 12861,
    "render_requests": 14,
    "requestBuckets": [0,0,0,0,0,0,1,0,0,0,12],
    "request_cache_hits": 0,
    "request_cache_misses": 14,
    "requests": 14,
    "zipper_cache_hits": 0,
    "zipper_cache_misses": 0,
    "zipper_find_errors": 0,
    "zipper_find_requests": 0,
    "zipper_find_timeouts": 0,
    "zipper_info_errors": 0,
    "zipper_info_requests": 0,
    "zipper_info_timeouts": 0,
    "zipper_render_errors": 0,
    "zipper_render_requests": 12,
    "zipper_render_timeouts": 0,
    "zipper_search_requests": 0,
    "zipper_timeouts": 0
    }
  2. Hardware has enough free cpu and memory capacity
  3. CPU & Memory profiles pprof.carbonapi.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz pprof.carbonapi.samples.cpu.016.pb.gz
  4. Debug logs could be helpful as well carbonapi_debug.log

Query that causes problems

  1. What's the query?target=aggregate(seriesByTag('hostname=~host_*'),'count')&from=1653350400&until=1653436800&format=json&maxDataPoints=1488
  2. How many metrics does it fetch? 11211 unique time series
  3. What's the resolution of those metrics? If it's not consistent (e.x. different metrics have different resolution) please specify that and provide average metric resolution 1 min
  4. How many datapoints per metric do you have? 1440 per time series

Data was generated in such way:

go get github.com/influxdata/influxdb-comparisons/cmd/bulk_query_gen github.com/influxdata/influxdb-comparisons/cmd/bulk_data_gen github.com/influxdata/influxdb-comparisons/cmd/query_benchmarker_graphite
go install github.com/influxdata/influxdb-comparisons/cmd/bulk_query_gen github.com/influxdata/influxdb-comparisons/cmd/bulk_data_gen github.com/influxdata/influxdb-comparisons/cmd/query_benchmarker_graphite

bulk_data_gen --format graphite-line -sampling-interval 60s --seed 1 --timestamp-start "2022-05-24T00:00:00Z" --timestamp-end "2022-05-25T00:00:00Z" -cardinality 1000 | gzip -c -3 > /tmp/dur-1d_cardinality-1000_step-60.graphite-line_24.gz

zcat /tmp/dur-1d_cardinality-1000_step-60.graphite-line_24.gz | ncat 127.0.0.1 2003

Additional context We discussed with @Civil in telegram about improvements

Civil commented 2 years ago

There is a commit in a separate branch (based on current master which is code-wise same as 0.15.5) that makes that feature optional (enabled by default, see carbonapi.example.yaml for how to disable it).

It would be great if you can test if that improves performance.

Anyway that is something that I need to properly improve later on. I'll try to make the test setup you've provided if not per-commit, but at least per release and publish some comparisons.

Civil commented 2 years ago

If it doesn't help, can you try to do git bisect to determine the commit that caused the problem?

Karsonito commented 2 years ago

I found the root of a problem: https://github.com/go-graphite/carbonapi/commit/9199a582f3b17aa9ba4bf3b53ff3b1893cc33e26 It's a little different sort after fetch. After reverting this commit cpu usage dramatically decreased image

Karsonito commented 2 years ago

I don't really know how important it is to sort the data before aggregation. May be @Felixoid can helps explain his change in v0.14: "[Fix] Sorting metrics should work now in the same way as in graphite-web (thx to @Felixoid)"

Felixoid commented 2 years ago

Basically, as it described in https://t.me/ru_go_graphite/5076, without sorting some results were complete trash. If functions require data pre-ordered, it should be in the correct order in advance.

Karsonito commented 2 years ago

If functions require data pre-order, it should be in the correct order in advance.

Is it possible to find all such functions and sort only before applying them?

Felixoid commented 2 years ago

In our case it was as well a question of determined results for * targets. Without sorting each request gave us metrics in a random order, and graphs been a bit derpy. For example, CPU metrics (idle, sys, user etc) each time were in different colors in grafana.

So, what basically the changes have done, is implemented the same behavior as in graphite-web.

Civil commented 2 years ago

I think having a switch that disables that behavior might be still useful.

And later on we can try to get a list of such functions and do pre-sorting on a per-function level

Karsonito commented 2 years ago

Without sorting each request gave us metrics in a random order, and graphs been a bit derpy. For example, CPU metrics (idle, sys, user etc) each time were in different colors in grafana.

I understand the goal to follow graphite-web's behavior But it can be solved with sortByName We shouldn't mix correctness and visualize, IMHO. Especially at the cost of performance

Felixoid commented 2 years ago

But it can be solved with sortByName

It would be required for literally every target in myriads of dashboards.

Maybe, something can be improved in https://github.com/go-graphite/carbonapi/blob/main/expr/helper/sort.go#L50-L57? Or simply replacing ByNameNatural with ByName is a way to quick but dirty win.

Karsonito commented 2 years ago

It would be required for literally every target in myriads of dashboards.

What about adding switchable option? I don't see any problem to add sortByName manually if it needed for dashboard

Felixoid commented 2 years ago

I think having a switch that disables that behavior might be still useful.

Yep, this would work too, for sure. What I am just trying to say is, the behavior can be improved. For example, I've found two libraries with nat-sorting and compared them.

The results are in https://github.com/maruel/natural/pull/2

We can use the natsort from https://github.com/maruel/natural/blob/5d883ec/natsort.go#L18

Civil commented 2 years ago

Could you please check that after #673 was merged, performance became better?

Karsonito commented 2 years ago

Nice improvement! Even with implicit sorting last version became as quick as 0.12.6 image

Felixoid commented 2 years ago

Good, than it's done for sure

Karsonito commented 2 years ago

Hi! Do you plan to build new release with this improvement??