kruize / autotune

Autonomous Performance Tuning for Kubernetes!
Apache License 2.0
152 stars 52 forks source link

Scalability results for kruize release 0.0.20.3_mvp with Box plots preview #1148

Closed chandrams closed 2 months ago

chandrams commented 3 months ago

Scalability testing with kruize build kruize/autotune_operator:0.0.20.3_mvp:

Short Scalability run - 5K exps / 15 days of results / 2 containers per exp Kruize replicas - 10 OCP - Scalelab cluster

Kruize Release Exps / Results / Recos Execution time Latency (Max/ Avg) in seconds Postgres DB size(MB) Kruize Max CPU Kruize Max Memory (GB)
      UpdateRecommendations UpdateResults LoadResultsByExpName      
0.0.20.2_mvp 5K / 72L 3h 49mins 0.61 / 0.4 0.25 / 0.18 0.34 / 0.25 21 (GB) 5.5 37
0.0.20.3_mvp 5K / 72L / 3L 3h 49 mins 0.62/ 0.39 0.24 / 0.17 0.34 / 0.25 21302.32 4.8 40.6
0.0.20.3_mvp (With Box plots) 5K / 72L / 3L 3h 50mins 0.61 / 0.39 025 / 0.18 0.34 / 0.25 21855.04 4.7 35.1

Long Scalability run - 100K exps / 15 days of res / 2 containers per exp Kruize replicas - 10 OCP - AWS cluster

Kruize Release Exps / Results / Recos Execution time Latency (Max/ Avg) in seconds Postgres DB size(GB) Kruize Max CPU Kruize Max Memory (GB)
      UpdateRecommendations UpdateResults LoadResultsByExpName      
0.0.20.1_mvp 100K / 144M 142h 17mins 1.38 / 0.67 0.18 / 0.15 1.04 / 0..65 416 5.97 52
0.0.20.3_mvp (with plots)  100K/144M/60L 166h 22 mins   1.57 / 0.88 0.15 / 0.14  1.23 / 0.93  426  6  65.8  
chandrams commented 3 months ago

Hitting 504 Gateway Time-out issues in some of the clients in the 100k scalability run

Current status: 14 days of results uploaded completed, 15th day in progress

kubectl exec -it `kubectl get pods -o=name -n openshift-tuning | grep postgres` -n openshift-tuning -- psql -U admin -d kruizeDB -c "SELECT count(*) from public.kruize_experiments ;"; kubectl exec -it `kubectl get pods -o=name -n openshift-tuning | grep postgres` -n openshift-tuning -- psql -U admin -d kruizeDB -c "SELECT count(*) from public.kruize_results ;" ; kubectl exec -it `kubectl get pods -o=name -n openshift-tuning | grep postgres` -n openshift-tuning -- psql -U admin -d kruizeDB -c "SELECT count(*) from public.kruize_recommendations;";  kubectl exec -it `kubectl get pods -o=name -n openshift-tuning | grep postgres` -n openshift-tuning -- psql -U admin -d kruizeDB -c "SELECT pg_size_pretty( pg_database_size('kruizeDB') );";
 count  
--------
 100000
(1 row)

   count   
-----------
 141001968
(1 row)

  count  
---------
 5875136
(1 row)

 pg_size_pretty 
----------------
 418 GB
(1 row)
chandrams commented 2 months ago

@msvinaykumar - I have captured the resource usage with box plots preview for 100k exps in the table in the description, please review

msvinaykumar commented 2 months ago

cc : @dinogun @rbadagandi @chandrams

With box plots, I observe a 20-hour increase in execution time, a 10GB rise in Kuize memory, and surprisingly, not much impact on DB size. @chandrams Could you please run 'listRecommendations' and check the available plot data just in case, to double-check?

chandrams commented 2 months ago

@msvinaykumar - 10GB increase is postgres DB.

Run was done long back, so can't check the plot data.

ddoliver commented 2 months ago

What's the latest on this and its impact on execution time? @dinogun @chandrams

chandrams commented 2 months ago

Short scalability run 5k /15 days execution time is around 3 hrs 51 mins with Kruize release 0.0.22_mvp with resources set

Kruize - mem req - 4 Gi, mem limit - 8Gi Postgres- mem req - 10 Gi, mem limit - 30 Gi

We have created a JIRA (https://issues.redhat.com/browse/KRUIZE-149) to investigate the 100k execution time increase of 24 hrs with 0.0.20.3_mvp (Box plots preview), closing this issue.