Closed msvinaykumar closed 1 week ago
Would be good to include the overhead of the notifications in updateRecommendations API with a scalability run.
I agree , @chandrams we might need short scalability run for this... But please ensure each experiment creating at least one error or critical notifications.
@chandrams this sample results json create some error notifications
https://privatebin.corp.redhat.com/?3b171fbe1bbb3244#8ZnjimR1QbfKUksj9qGZegAGCPMJUkhSFdidVvrmV3gv
@msvinaykumar - Updated the kruize metrics script to capture the notifications and I have triggered a short scalability run with the new image that you provided - quay.io/vinakuma/autotune_operator:metrics2 and the results json that you shared.
@msvinaykumar - The scalability 5k / 15 days run took 3 hrs 16 mins which is lesser than the scale test run on the same cluster with 0.0.22_mvp which was 3 hrs 50 mins. Does your build contain all the latest changes along with this PR?
Summary of the test run
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
Postgres DB size in MB = 21767
Results count - 7200000
total_kruizeMetrics-20.csv
Update Reco Latency Max / Avg value: 0.61 / 0.39
Update Results Latency Max / Avg value: 0.13 / 0.11
LoadResultsByExpName Latency Max / Avg value: 0.2 / 0.16
Generate Plots Latency Max / Avg value: 0.0 / 0.0
Kruize memory Max value: 33.11 GB
Kruize cpu Max value: 6.92
Execution time - 03:15:30
The logs have these errors
scaletest250-2.log:psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: sorry, too many clients already
scaletest250-2.log:AN ERROR OCCURED: too many values to unpack (expected 2)
Summary of the test run with 0.0.23_mvp on the same cluster
Summary of the test run
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
Postgres DB size in MB = 21760
python3 parse_metrics.py -d /home/jenkins/kruize_scale_test_results_0.0.23_mvp_5k_15days/remote-monitoring-scale-test-202406262204/results -r 7200000
Directory path - /home/jenkins/kruize_scale_test_results_0.0.23_mvp_5k_15days/remote-monitoring-scale-test-202406262204/results
Results count - 7200000
total_kruizeMetrics-20.csv
Update Reco Latency Max / Avg value: 0.63 / 0.39
Update Results Latency Max / Avg value: 0.24 / 0.17
LoadResultsByExpName Latency Max / Avg value: 0.35 / 0.25
Generate Plots Latency Max / Avg value: 0.0 / 0.0
Kruize memory Max value: 32.59 GB
Kruize cpu Max value: 4.52
Execution time - 03:51:01
@chandrams can you please confirm kruizeRecommendation_total counts
This error occurs when the PostgreSQL server has reached the maximum number of allowed client connections. we can increase the max_connections setting in your PostgreSQL configuration or optimize your application to use fewer connections. But however we can ignore this error bcoz there is no data a loss exp_count / results_count / reco_count = 5000 / 7200000 / 300000 we can consider this as warning
Build is having this PR change and please confirm we have generated enough KruizeRecommendations metrics
@msvinaykumar - You can check the last column in this spreadsheet, was expecting values to be present for all entries but it stopped after a while.
@msvinaykumar - You can check the last column in this spreadsheet, was expecting values to be present for all entries but it stopped after a while.
This looks good. The count is over 500k, so the idea is to generate more notifications without impacting execution time. Based on the results, performance is unaffected, so we're good to proceed. We also have a flag to disable it just in case any issues arise.
Pull Request Description: Add Metrics Logging for Kruize Recommendations
Summary
This pull request introduces the
KruizeNotificationCollectionRegistry
class, which is responsible for logging and creating metrics for notifications related to Kruize recommendations. The class processes recommendation notifications at various levels (container, timestamp, term, and model) and creates appropriate counters using Micrometer.Key Features
KruizeNotificationCollectionRegistry
: This new class handles the collection and logging of recommendation notifications.logNotification
: Logs notifications fromContainerData
by iterating through its recommendation structure and creating counters.createCounterTag
: Creates a counter with tags for the given level, term, model, and list of recommendation notifications.Detailed Description
Class
KruizeNotificationCollectionRegistry
:Constructor:
experiment_name
,interval_end_time
, andcontainer_name
.Method
logNotification
:ContainerData
object as input.ContainerData
.createCounterTag
.Method
createCounterTag
:RecommendationNotification
objects.KruizeDeploymentInfo.log_recommendation_metrics_level
.KruizeConstants.KRUIZE_RECOMMENDATION_METRICS
.Benefits
Notes
Testing
Related Issues
Please review the changes and provide feedback. Your input is valuable to ensure that this feature integrates seamlessly and functions as expected.
test Image : quay.io/vinakuma/autotune_operator:metrics