keikoproj / upgrade-manager

Reliable, extensible rolling-upgrades of Autoscaling groups in Kubernetes
Apache License 2.0
141 stars 45 forks source link

only allow maxParallel number of running CRs #351

Closed ZihanJiang96 closed 1 year ago

ZihanJiang96 commented 1 year ago

problem currently max-parallel only controls the number of concurrent reconcile() func, in a short period of time, all the IGs will enter the reconcile() at least once, and start to spin up new nodes / drain the old nodes. This will cause a memory spike in upgrade-manager.

proposal max-parallel should limit the max number of RollingUpgrade CRs in running state.

changes introduced by this PR

  1. make sure there is at most max-parallel number of IGs in the AdmissionMap.
  2. if length of AdmissionMap is zero, then possibly the pod restarted, fetch all CRs in running state by label, add to admission map, continue the reconcilation against these running IGs.

tests

  1. unit test pass.
  2. tested on a cluster with 50 IG, each IG has 3 node. submit 50 rollingupgrade CR at the same time. the current implementation requires 1.2GB memory by limiting max-parallel to 10, the memory consumption is 310MB
  3. tested on a cluster with 100 IG, each IG has 6 node. submit 100 rollingupgrade CR at the same time. the current implementation requires 5GB memory, the upgrade time is > 60min by limiting max-parallel to 10, the memory consumption is 700MB, the upgrade time is 85min
ZihanJiang96 commented 1 year ago

the CI failed, but i don't have permission to rerun it

shreyas-badiger commented 1 year ago

@ZihanJiang96 nice work!

Just two concerns:

codecov[bot] commented 1 year ago

Codecov Report

Merging #351 (343f99d) into master (fc20675) will increase coverage by 5.95%. The diff coverage is 54.05%.

@@            Coverage Diff             @@
##           master     #351      +/-   ##
==========================================
+ Coverage   39.22%   45.17%   +5.95%     
==========================================
  Files           7        7              
  Lines         724      757      +33     
==========================================
+ Hits          284      342      +58     
+ Misses        415      376      -39     
- Partials       25       39      +14     
Flag Coverage Δ
unittests 45.17% <54.05%> (+5.95%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
controllers/rollingupgrade_controller.go 35.71% <42.85%> (+35.71%) :arrow_up:
controllers/upgrade.go 46.07% <66.66%> (+0.16%) :arrow_up:
controllers/common/utils.go 31.57% <100.00%> (+31.57%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

ZihanJiang96 commented 1 year ago

CI green, let me add some unit test to improve test coverage