helm / chartmuseum

helm chart repository server
https://chartmuseum.com
Apache License 2.0
3.57k stars 401 forks source link

loadtesting: request rate for uploading charts drops a lot with latest version #396

Closed marchdaisy closed 2 years ago

marchdaisy commented 3 years ago

Chartmuseum version:

latest version with last commit: "ci: migrate release from docker hub into github container registry #395"

Problem

Using locust to have the loadtesting, the request rate for uploading charts drops a lot.

Description

Previously

Previously chartmuseum release version v0.12.0 with pvc: locust -f locustfile.py --no-web -L DEBUG -c 500 -r 50 -n 300000 --csv=result We define posting new charts in locustfile.py. The request rate for uploading charts could arrive 300 req/s.

Now

But when we use the latest version, the request rate drops to 15 req/s and after the charts arrives 525, the post requests begin to fail.

 Name                                                          # reqs      # fails     Avg     Min     Max  |  Median   req/s
--------------------------------------------------------------------------------------------------------------------------------------------
 POST /api/mychart/charts                                         525     5(0.94%)   16888    6432   29975  |   16000   14.00
 GET /index.yaml                                                   56     0(0.00%)       3       3      16  |       3    1.70
--------------------------------------------------------------------------------------------------------------------------------------------
 Total                                                            581     5(0.86%)                                      15.70

From the chartmuseum pod, we found the CPU is very high. top - 05:22:03 up 31 days, 23:27, 0 users, load average: 2.01, 2.33, 1.40 Tasks: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie %Cpu(s): 22.0 us, 2.0 sy, 0.2 ni, 75.4 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st MiB Mem : 32159.41+total, 7854.578 free, 3791.039 used, 20513.79+buff/cache MiB Swap: 0.000 total, 0.000 free, 0.000 used. 27340.60+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 61357 20 0 824296 167388 31080 S 152.8 0.508 13:02.71 chartmuseum 33 61357 20 0 14200 3428 2980 S 0.000 0.010 0:00.00 sh 38 61357 20 0 37124 3496 2968 R 0.000 0.011 0:00.00 top

And below is some information shown by go tool pprof.

  1. goroutine (pprof) top Showing nodes accounting for 1006, 99.80% of 1008 total Dropped 47 nodes (cum <= 5) Showing top 10 nodes out of 25 flat flat% sum% cum cum% 1006 99.80% 99.80% 1006 99.80% runtime.gopark 0 0% 99.80% 500 49.60% github.com/gin-contrib/size.RequestSizeLimiter.func1 0 0% 99.80% 500 49.60% github.com/gin-gonic/gin.(Context).Next 0 0% 99.80% 500 49.60% github.com/gin-gonic/gin.(Engine).ServeHTTP 0 0% 99.80% 500 49.60% github.com/gin-gonic/gin.(Engine).handleHTTPRequest 0 0% 99.80% 500 49.60% github.com/gin-gonic/gin.RecoveryWithWriter.func1 0 0% 99.80% 500 49.60% github.com/gin-gonic/gin.serveError 0 0% 99.80% 500 49.60% github.com/zsais/go-gin-prometheus.(Prometheus).HandlerFunc.func1 0 0% 99.80% 500 49.60% helm.sh/chartmuseum/pkg/chartmuseum/router.(Router).masterHandler 0 0% 99.80% 500 49.60% helm.sh/chartmuseum/pkg/chartmuseum/router.requestWrapper.func1 (pprof) list gopark Total: 1008 ROUTINE ======================== runtime.gopark in /usr/local/go/src/runtime/proc.go 1006 1006 (flat, cum) 99.80% of Total . . 299: gp.waitreason = reason . . 300: mp.waittraceev = traceEv . . 301: mp.waittraceskip = traceskip . . 302: releasem(mp) . . 303: // can't do anything that might move the G between Ms here. 1006 1006 304: mcall(park_m) . . 305:} . . 306: . . 307:// Puts the current goroutine into a waiting state and unlocks the lock. . . 308:// The goroutine can be made runnable again by calling goready(gp). . . 309:func goparkunlock(lock mutex, reason waitReason, traceEv byte, traceskip int) { (pprof)
  1. profile: (pprof) top Showing nodes accounting for 14230ms, 37.70% of 37750ms total Dropped 448 nodes (cum <= 188.75ms) Showing top 10 nodes out of 194 flat flat% sum% cum cum% 3040ms 8.05% 8.05% 4920ms 13.03% runtime.scanobject 2300ms 6.09% 14.15% 2300ms 6.09% runtime.memclrNoHeapPointers 1470ms 3.89% 18.04% 3160ms 8.37% regexp.(*Regexp).tryBacktrack 1450ms 3.84% 21.88% 1450ms 3.84% runtime.memmove 1240ms 3.28% 25.17% 2310ms 6.12% gopkg.in/yaml%2ev2.yaml_emitter_analyze_scalar 1110ms 2.94% 28.11% 1340ms 3.55% runtime.heapBitsSetType 1010ms 2.68% 30.78% 2640ms 6.99% gopkg.in/yaml%2ev2.yaml_parser_scan_flow_scalar 1000ms 2.65% 33.43% 6410ms 16.98% runtime.mallocgc 840ms 2.23% 35.66% 840ms 2.23% gopkg.in/yaml%2ev2.is_blankz 770ms 2.04% 37.70% 890ms 2.36% runtime.findObject (pprof)

  2. heap (pprof) top Showing nodes accounting for 12423.42kB, 82.91% of 14983.58kB total Showing top 10 nodes out of 99 flat flat% sum% cum cum% 3584.98kB 23.93% 23.93% 4097kB 27.34% helm.sh/helm/v3/pkg/chart/loader.LoadFiles 3081.01kB 20.56% 44.49% 3081.01kB 20.56% bufio.NewWriterSize 1536.21kB 10.25% 54.74% 5633.21kB 37.60% helm.sh/chartmuseum/pkg/repo.ChartVersionFromStorageObject 1121.44kB 7.48% 62.23% 1121.44kB 7.48% gopkg.in/yaml%2ev2.yaml_string_write_handler 528.17kB 3.52% 65.75% 528.17kB 3.52% regexp.(bitState).reset 520.04kB 3.47% 69.22% 520.04kB 3.47% bytes.makeSlice 514kB 3.43% 72.65% 514kB 3.43% bufio.NewReaderSize 513.31kB 3.43% 76.08% 513.31kB 3.43% regexp/syntax.(compiler).inst 512.14kB 3.42% 79.50% 512.14kB 3.42% github.com/gin-gonic/gin.(Context).Set 512.12kB 3.42% 82.91% 512.12kB 3.42% github.com/gin-gonic/gin.(Engine).allocateContext

  3. allocs (pprof) top Showing nodes accounting for 168.86GB, 91.89% of 183.76GB total Dropped 324 nodes (cum <= 0.92GB) Showing top 10 nodes out of 93 flat flat% sum% cum cum% 92.12GB 50.13% 50.13% 100.12GB 54.48% gopkg.in/yaml%2ev2.yaml_emitter_emit 19.59GB 10.66% 60.79% 19.98GB 10.87% regexp.(Regexp).backtrack 19.58GB 10.66% 71.45% 39.36GB 21.42% regexp.(Regexp).FindStringSubmatch 9.66GB 5.25% 76.70% 49.02GB 26.67% github.com/Masterminds/semver/v3.NewVersion 9.38GB 5.11% 81.81% 9.38GB 5.11% gopkg.in/yaml%2ev2.(*parser).node 8GB 4.35% 86.16% 8GB 4.35% gopkg.in/yaml%2ev2.yaml_string_write_handler 4.28GB 2.33% 88.49% 4.28GB 2.33% bytes.makeSlice 3.05GB 1.66% 90.15% 3.05GB 1.66% gopkg.in/yaml%2ev2.read 1.67GB 0.91% 91.06% 1.67GB 0.91% reflect.mapassign 1.52GB 0.83% 91.89% 1.52GB 0.83% reflect.New

jdolitsky commented 3 years ago

@marchdaisy - thank you for opening. Will hold off on release until we can resolve this

jdolitsky commented 3 years ago

@marchdaisy can you share what you are using for storage backend?

scbizu commented 3 years ago

@marchdaisy hihi , and can you share your chartmuseum startup command ?

marchdaisy commented 3 years ago

@marchdaisy can you share what you are using for storage backend?

@jdolitsky Thank you for your response. We have tested with both PVC (VolumeMode: Filesystem) and MinIO backend. Both of them have such issue I mentioned above.

marchdaisy commented 3 years ago

@marchdaisy hihi , and can you share your chartmuseum startup command ?

@scbizu Thank you for your response. We startup the chartmuseum in a container by a Dockerfile. And the following is the related env. env:

scbizu commented 3 years ago

It seems wired that your pprof shows requests are blocked in fs I/O (chart loading) , I tried to run the 0.13.0-pre loadtesting on my local machine (mbp 2018), the RPS could reach ~500 req/s.

BTW , thanks for your reply , we will check the new release again , and keep this issue as the new release blocking issue until we have some more optimization .

marchdaisy commented 3 years ago

Do you have some update for this problem? I use the latest v0.13.1 version, and the issue still exists in my test scenario. Do you need some more information to help to debug this issue?

jdolitsky commented 3 years ago

Hi - are you using the loadtesting suite here:

https://github.com/helm/chartmuseum/tree/main/loadtesting

Or are you creating your own locustfile.py? Can you share this code?

marchdaisy commented 3 years ago

@jdolitsky Thanks for your reply!

The locustfile.py I used is similar to the one in above link. The difference is as follows:

class UserBehavior(TaskSet): tasks = {index: 1, post_new_chart: 10}

class WebsiteUser(HttpLocust): task_set = UserBehavior _min_wait = 0 maxwait = 0

For previous chartmuseum version 0.12.0, the request rate for uploading charts could arrive 300 req/s. But now there is cpu problem when uploading charts.

marchdaisy commented 3 years ago

Hi, @scbizu Could you please help to update the status for this issue? I saw that there is a related PR under review. Do you have a plan to merge the solution in the next v0.13.2?

scbizu commented 3 years ago

Yeah , after we prove that this PR will exactly improve the load-testing , we haven't test it widely though.

marchdaisy commented 2 years ago

Hi, @scbizu Have you fixed this issue in v0.14.0? I found this PR https://github.com/helm/chartmuseum/pull/457 is still the milestone of v0.15.0. I wonder that do you still have some work for the performance issue? By the way, I tested the latest chartmuseum version with the scenario for concurrency uploading charts and geting index, and the above problem didn't appear any more. I think that there are some related commits that would fix the issue, right?

scbizu commented 2 years ago

@marchdaisy the latest version , you mean canary or v0.14.0 ?

marchdaisy commented 2 years ago

Hi, @scbizu The latest version I mentioned is v0.14.0. I made a mistake during uplifting the chartmuseum to v0.14.0 yesterday. So actually, the result is all right with v0.12.0, but with v0.14.0, the issue still existed. Will you fix it in v0.15.0?

scbizu commented 2 years ago

@marchdaisy Yep , the latency entered due to our 0.13.0's cache mechanism , perhaps you can set cache_interval to 0 and see if issue still exists , however, our next major milestone will fix this , stay tuned :P

marchdaisy commented 2 years ago

Hi, @scbizu I found that the default cache_interval is 0s. (set the interval of delta updating the cache (default: 0s)) Does it mean that upon each request, the storage backend is scanned for changes compared to the cache? From my point of view, the performance will be worse with cache_interval 0s than 5m or even longer interval. So I am confusing about your suggestion, could you please help to explain more about it? Thank you!

cbuto commented 2 years ago

@marchdaisy 👋 the changes in #556 should help here. If you want to test with the canary or latest image tag (otherwise it will be included in v0.15.0), I’d be interested in the results!

marchdaisy commented 2 years ago

Hi, @cbuto The fix works for my problem! I used the latest code and request rate for uploading charts remained the same as in v0.12.0. Thank you for your fix!