Add Metrics Logging for Kruize Recommendations

msvinaykumar commented 1 month ago

Pull Request Description: Add Metrics Logging for Kruize Recommendations

Summary

This pull request introduces the KruizeNotificationCollectionRegistry class, which is responsible for logging and creating metrics for notifications related to Kruize recommendations. The class processes recommendation notifications at various levels (container, timestamp, term, and model) and creates appropriate counters using Micrometer.

Key Features

Class KruizeNotificationCollectionRegistry: This new class handles the collection and logging of recommendation notifications.
Constructor: Initializes the class with experiment name, interval end time, and container name.
Method logNotification: Logs notifications from ContainerData by iterating through its recommendation structure and creating counters.
Method createCounterTag: Creates a counter with tags for the given level, term, model, and list of recommendation notifications.

Detailed Description

Class KruizeNotificationCollectionRegistry:
- This class is introduced to streamline the process of logging recommendation notifications and creating metrics.
- It holds information about the experiment name, interval end time, and container name, which are essential for tagging metrics.
Constructor:
- Initializes the object with necessary parameters: experiment_name, interval_end_time, and container_name.
Method logNotification:
- Accepts a ContainerData object as input.
- Iterates through the nested structure of recommendations within the ContainerData.
- For each level (container, timestamp, term, model), it collects notifications and calls createCounterTag.
Method createCounterTag:
- Accepts parameters such as level, term, model, and a collection of RecommendationNotification objects.
- Checks if the notification type is configured to be logged based on KruizeDeploymentInfo.log_recommendation_metrics_level.
- Creates additional tags using the provided information and formats them according to KruizeConstants.KRUIZE_RECOMMENDATION_METRICS.
- Finds or creates a counter for the metric and increments it.

Benefits

Enhanced Observability: By logging metrics for recommendations, this feature improves the observability and monitoring of Kruize recommendations.
Granular Metrics: Metrics are logged at various levels, providing detailed insights into different aspects of the recommendation process.
Configurable Logging: Only logs notifications that match the configured logging levels, ensuring flexibility and control over what gets logged.

Notes

Ensure that the necessary dependencies for Micrometer and other related utilities are available in the project.
This PR addresses the need for detailed metrics in Kruize recommendations, aiding in performance monitoring and debugging.

Testing

Thorough testing should be conducted to ensure that the metrics are correctly logged at each level.
Verify that the counters are created and incremented accurately based on the incoming notifications.
Ensure that the tags are properly formatted and include all relevant information.

Related Issues

References to any related issues or enhancement requests can be mentioned here.

Please review the changes and provide feedback. Your input is valuable to ensure that this feature integrates seamlessly and functions as expected.

test Image : quay.io/vinakuma/autotune_operator:metrics

msvinaykumar commented 3 weeks ago

Would be good to include the overhead of the notifications in updateRecommendations API with a scalability run.

I agree , @chandrams we might need short scalability run for this... But please ensure each experiment creating at least one error or critical notifications.

msvinaykumar commented 3 weeks ago

@chandrams this sample results json create some error notifications

https://privatebin.corp.redhat.com/?3b171fbe1bbb3244#8ZnjimR1QbfKUksj9qGZegAGCPMJUkhSFdidVvrmV3gv

chandrams commented 1 week ago

@msvinaykumar - Updated the kruize metrics script to capture the notifications and I have triggered a short scalability run with the new image that you provided - quay.io/vinakuma/autotune_operator:metrics2 and the results json that you shared.

chandrams commented 1 week ago

@msvinaykumar - The scalability 5k / 15 days run took 3 hrs 16 mins which is lesser than the scale test run on the same cluster with 0.0.22_mvp which was 3 hrs 50 mins. Does your build contain all the latest changes along with this PR?

Summary of the test run
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
Postgres DB size in MB = 21767
Results count - 7200000
total_kruizeMetrics-20.csv
Update Reco Latency Max / Avg value: 0.61 / 0.39
Update Results Latency Max / Avg value: 0.13 / 0.11
LoadResultsByExpName Latency Max / Avg value: 0.2 / 0.16
Generate Plots Latency Max / Avg value: 0.0 / 0.0
Kruize memory Max value: 33.11 GB
Kruize cpu Max value: 6.92
Execution time - 03:15:30

The logs have these errors

scaletest250-2.log:psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  sorry, too many clients already
scaletest250-2.log:AN ERROR OCCURED: too many values to unpack (expected 2)

Summary of the test run with 0.0.23_mvp on the same cluster

Summary of the test run
exp_count / results_count / reco_count = 5000 / 7200000 / 300000
Postgres DB size in MB = 21760
python3 parse_metrics.py -d /home/jenkins/kruize_scale_test_results_0.0.23_mvp_5k_15days/remote-monitoring-scale-test-202406262204/results -r 7200000
Directory path - /home/jenkins/kruize_scale_test_results_0.0.23_mvp_5k_15days/remote-monitoring-scale-test-202406262204/results
Results count - 7200000
total_kruizeMetrics-20.csv
Update Reco Latency Max / Avg value: 0.63 / 0.39
Update Results Latency Max / Avg value: 0.24 / 0.17
LoadResultsByExpName Latency Max / Avg value: 0.35 / 0.25
Generate Plots Latency Max / Avg value: 0.0 / 0.0
Kruize memory Max value: 32.59 GB
Kruize cpu Max value: 4.52
Execution time - 03:51:01

msvinaykumar commented 1 week ago

@chandrams can you please confirm kruizeRecommendation_total counts

msvinaykumar commented 1 week ago

This error occurs when the PostgreSQL server has reached the maximum number of allowed client connections. we can increase the max_connections setting in your PostgreSQL configuration or optimize your application to use fewer connections. But however we can ignore this error bcoz there is no data a loss exp_count / results_count / reco_count = 5000 / 7200000 / 300000 we can consider this as warning

msvinaykumar commented 1 week ago

Build is having this PR change and please confirm we have generated enough KruizeRecommendations metrics

chandrams commented 1 week ago

total_kruizeMetrics-20.csv

@msvinaykumar - You can check the last column in this spreadsheet, was expecting values to be present for all entries but it stopped after a while.

msvinaykumar commented 1 week ago

total_kruizeMetrics-20.csv

@msvinaykumar - You can check the last column in this spreadsheet, was expecting values to be present for all entries but it stopped after a while.

This looks good. The count is over 500k, so the idea is to generate more notifications without impacting execution time. Based on the results, performance is unaffected, so we're good to proceed. We also have a flag to disable it just in case any issues arise.

kruize / autotune