opensearch-project / performance-analyzer-rca

The Performance Analyzer RCA is a framework that builds on the Performance Analyzer engine to support root cause analysis (RCA) of performance and reliability problems for OpenSearch instances.
https://opensearch.org/docs/latest/monitoring-plugins/pa/rca/index/
Apache License 2.0
30 stars 56 forks source link

[BUG] SearchBackPressureRCA causing Gauntlet test failure #503

Closed khushbr closed 1 year ago

khushbr commented 1 year ago

What is the bug? SearchBackPressureRCA causing Gauntlet test failure. refer: https://github.com/opensearch-project/performance-analyzer-rca/actions/runs/6385777195/job/17331223490

 The Runner found the following errors in log: [
  00:20:20.398 [ELECTED_CLUSTER_MANAGER-task-1-] ERROR SearchBackPressureRCA:getHeapStats()::line:403 - Failed to parse metric in FlowUnit from org.opensearch.performanceanalyzer.rca.framework.api.metrics.Heap_Used
  00:20:20.480 [DATA_0-task-1-] ERROR SearchBackPressureRCA:getHeapStats()::line:403 - Failed to parse metric in FlowUnit from org.opensearch.performanceanalyzer.rca.framework.api.metrics.Heap_Used
  00:20:25.402 [ELECTED_CLUSTER_MANAGER-task-1-] ERROR SearchBackPressureRCA:getHeapStats()::line:403 - Failed to parse metric in FlowUnit from org.opensearch.performanceanalyzer.rca.framework.api.metrics.Heap_Used
  00:20:25.488 [DATA_0-task-0-] ERROR SearchBackPressureRCA:getHeapStats()::line:403 - Failed to parse metric in FlowUnit from org.opensearch.performanceanalyzer.rca.framework.api.metrics.Heap_Used

The search backpressure metrics on PA end are:

1696544030000:{"searchbp_shard_stats_cancellationCount":0,"searchbp_shard_stats_limitReachedCount":0,"searchbp_shard_stats_resource_heap_usage_cancellationCount":0,"searchbp_shard_stats_resource_heap_usage_currentMax":0,"searchbp_shard_stats_resource_heap_usage_rollingAvg":0,"searchbp_shard_stats_resource_cpu_usage_cancellationCount":0,"searchbp_shard_stats_resource_cpu_usage_currentMax":0,"searchbp_shard_stats_resource_cpu_usage_currentAvg":0,"searchbp_shard_stats_resource_elaspedtime_usage_cancellationCount":0,"searchbp_shard_stats_resource_elaspedtime_usage_currentMax":0,"searchbp_shard_stats_resource_elaspedtime_usage_currentAvg":0,"searchbp_task_stats_cancellationCount":0,"searchbp_task_stats_limitReachedCount":0,"searchbp_task_stats_resource_heap_usage_cancellationCount":0,"searchbp_task_stats_resource_heap_usage_currentMax":0,"searchbp_task_stats_resource_heap_usage_rollingAvg":0,"searchbp_task_stats_resource_cpu_usage_cancellationCount":0,"searchbp_task_stats_resource_cpu_usage_currentMax":0,"searchbp_task_stats_resource_cpu_usage_currentAvg":0,"searchbp_task_stats_resource_elaspedtime_usage_cancellationCount":0,"searchbp_task_stats_resource_elaspedtime_usage_currentMax":0,"searchbp_task_stats_resource_elaspedtime_usage_currentAvg":0,"searchbp_mode":"MONITOR_ONLY","searchbp_nodeid":"lGFg--rtQ1CRMVK_x8SWmA"}$

Do you have any additional context? Add any other context about the problem.

khushbr commented 1 year ago

ThegetHeapStats method is failing to parse the metrics: https://github.com/opensearch-project/performance-analyzer-rca/blob/7391dae89d22cd66b96a443764acca735aa7a133/src/main/java/org/opensearch/performanceanalyzer/rca/store/rca/searchbackpressure/SearchBackPressureRCA.java#L399

khushbr commented 1 year ago

The SBP changes have been reverted and the gauntlet tests are now passing. This is blocked by availability of https://github.com/opensearch-project/OpenSearch/pull/10028/