Implementation of Error Metric for Rest Actions

vzez commented 2 years ago

Is your feature request related to a problem? Please describe. Whenever 4xx or 5xx errors occur for any API call, we don’t have way to know what number of errors caused by which API.
This feature is to provide customer a sort of metrics for API-wise 4xx and 5xx errors. This helps to monitor the system behaviour for different APIs as well.

Describe the solution you'd like Whenever an API throws 4xx or 5xx, we’ll keep 4xx and 5xx error count in-memory for that API. We are going to use an existing API (/_nodes/stats) to return that count metric. This API is already being used to retrieve the statistics. We are going to add a new matrix in the response named “rest_actions” which will contain number of times 4xx and 5xx error occurred for an API. As of now we’ll capture the error count for indexing, search API and other APIs error count is combined. New metric to be introduced : “rest_actions” For API call : GET /_nodes/stats/rest_actions Response will look like as follows...

{
    "_nodes": {
        "total": 2,
        "successful": 2,
        "failed": 0
    },
    "cluster_name": "docker-cluster",
    "nodes": {
        "hD-meqDVTTm7fcqCP2F7xQ": {
            "rest_actions": {
                "errors": {
                    "indexing": {
                        "5xx": 10,
                        "4xx": 2
                    },
                    "searching": {
                        "5xx": 1,
                        "4xx": 0
                    },
                    "others": {
                        "5xx": 20,
                        "4xx": 4
                    }
                }
            }
        },
        "pHa-gp3ES_K4PfETNUBqzA": {
            "rest_actions": {
                "errors": {
                    "indexing": {
                        "5xx": 10,
                        "4xx": 2
                    },
                    "searching": {
                        "5xx": 1,
                        "4xx": 0
                    },
                    "others": {
                        "5xx": 20,
                        "4xx": 4
                    }
                }
            }
        }
    }
}

This approach is considered is good option for below reasons... Easy and quick to implement Can be extended further if more info required easily Independent to node type

Describe alternatives you've considered Another approach was to use PA plugin. We keep the error count same way as above approach but PA plugin would collect those matrix instead of exposing via API. PA collects the error count every 5 sec and store them in matricDB files. This approach is discarded for following reasons... PA is a separate component which runs on every node, so extra work to be done in order to collect matrix every 5 sec It store the matrix in storage hence additional storage is required Have some limitation as it does’t run on T2 nodes.

One more alternate approach was to use (/_nodes/usage) to return the response but this API is specific to get usage for ex, it returns number of time a rest action has been called.

"rest_actions": {
    "nodes_usage_action": 1,
    "get_indices_action": 4,
    "nodes_stats_action": 2,
    "main_action": 2,
    "PerformanceAnalyzerClusterConfigAction": 3
}

Therefore, the error count doesn’t fall under this category. And chose (stats) API.

Additional context Add any other context or screenshots about the feature request here.

shwetathareja commented 2 years ago

This is an interesting proposal. We have seen in the past in managed service when customer's cluster experiencing high error rate. They want to know which APIs are impacted and having this exposed as a stat would help in plotting this over time (by fetching periodically using the stats APIs like other stats) and correlating with other metrics like incoming request traffic/ latency/ change in no. of nodes etc. Few things you can further think about like:

There are many APIs and tracking metrics at each API might result in too many buckets, we can look into logical grouping of APIs

dblock commented 2 years ago

This looks like a "performance counter" proposal, and I like the spirit of where this is coming from.

As proposed I would likely not use this to monitor my system. The main pieces missing in this proposal are time, and the ability to correlate errors to volume of activity. In active systems we typically emit an event every time there's a success or an error, then slice/dice/aggregate those errors in a variety of ways in external systems. So this proposal should be some aggregation subset of that. For example, if we're talking REST actions I prefer something to the alternative proposed, but that also counts successes:

rest_actions: {
    'all': {
       200: 6739,
       400: 45.
       500: 2
    },
    '_nodes/stats': {
        200: 1234,
        400: 5
    },
    '_cat/nodes`: {
        200: 456,
        500: 1
    },
    ...
}

The api would take ?path=rest_actions/_nodes/stats or ?duration=60s to restrict what counters I am looking at.

mgodwan commented 2 years ago

Thanks for the proposal @vzez . This should prove to be helpful in being able to monitor stats around node performance for rest actions.

@dblock I like the idea of counting successes as well, and treating this as an overall performance counter as suggested. It will provide a more holistic view of how the system is actually behaving.

@vzez

Given each API can have multiple routes, we can visit if it may be easier to maintain the action name instead since it provides action level uniqueness example
We can also see if its feasible to provide the ability to have multiple actions provided in the request to obtain stats to allow the operators/users to get stats for a logical set of APIs which may make sense for them.

reta commented 2 years ago

@dblock @mgodwan @vzez it sounds like we should look at larger picture that was discussed here [1]. I am largely agree with @dblock here, the idea of performance counters over time could (and probably should) be exposed as a metric by APIs slice at least.

What is an audience for these metrics? Certanly not operations - because it is not possible to extract sufficient context what is failing and for how long. What these counters tell me is that nodes are (or were?) at least handling some requests, and are (were?) available. Shoud it be part of extended Health checks [2] may be?

[1] https://github.com/opensearch-project/OpenSearch/issues/1061 [2] https://github.com/opensearch-project/OpenSearch/issues/4244

vzez commented 2 years ago

@dblock @mgodwan @shwetathareja I think we can give both options; stats for each API and stats for logical set of APIs. This help reduce the response size as well. APIs can be grouped as shown below Ref: https://opensearch.org/docs/latest/opensearch/rest-api/document-apis/delete-document/#:~:text=Popular%20APIs-,REST%20API%20reference,-Index%20APIs

User can pass a query param if stats are logically grouped or not. (?rest_actions_grouped=true)

GET /_nodes/stats/rest_actions

rest_actions: {
    'all': {
       200: 300,
       400: 50,
       500: 3
    },
    'create_index_action': {
        200: 100,
        400: 25,
        500: 1
    },
    'get_indices_action': {
        200: 100,
        400: 10,
        500: 1
    },
    'document_get_action': {
        200: 50,
        400: 10,
        500: 1
    },
    'document_delete_action': {
        200: 50,
        400: 5,
        500: 0
    },
    ...
}

GET /_nodes/stats/rest_actions??rest_actions_grouped=true

rest_actions: {
    'all': {
       200: 300,
       400: 50,
       500: 3
    },
    'index_action': {
       200: 200,
       400: 35,
       500: 2
    },
    'document_action': {
       200: 100,
       400: 15,
       500: 1
    },
    ...
}

@dblock Providing a duration option wouldn't be feasible as we can have only limited number of duration option and that too will put unnessesory load on memory. Though users can achieve durational stats by themselves as per their requirements.

dblock commented 2 years ago

So what is the duration of rest_actions numbers? Total since node start? It will become extremely difficult to make those useful across a distributed cluster. For example after running for a year the number of errors will be tiny compared to the total number of requests.

itiyama commented 2 years ago

So what is the duration of rest_actions numbers? Total since node start? It will become extremely difficult to make those useful across a distributed cluster. For example after running for a year the number of errors will be tiny compared to the total number of requests.

@dblock I have seen users of Opensearch use these numbers by periodically calling the stats API at a fixed interval and then do a diff across successive intervals. The diff is then plotted in a dashboard over time. This helps us compute averages and get a sense of how the system is performing e.g. indexingRate is calculated by using indexed doc count stat difference over time.

I can see value in doing this given polling is the only way Opensearch users get stats about their cluster today. However, in the long term, I would think more about automating the above step for our users or take it one step further by pushing stat events to a pluggable sink. This thread was discussed earlier as well. More details here.

shwetathareja commented 2 years ago

Thanks @itiyama for clarifying.

@dblock : yes for long term, we should build the metrics framework which publishes event with time to pluggable sink and users can write their consumer plugins and build dashboards on top of it. To answer @reta question also

What is an audience for these metrics?

Today OpenSearch publishes lot of cumulative stats around indexing and search, API usage etc. Like @itiyama mentioned plotting the diff over time is useful to understand how system is being used and performing over time. Lets take couple of examples where these stats would be useful:

A user sees failures and wants to understand which all APIs are impacted, is it only search or only indexing or all APIs are failing and can also configure alarms if needed.
A user is upgrading their cluster to a different version (using rolling restart) and started seeing errors but only few requests are failing and not all. User sends requests through the load balancer to distribute the traffic across all nodes. Now these API level stats will provide node metrics and gives an idea which nodes are failing the requests and debug faster as compared to grepping the logs.

+1 on the suggestion to provide both success and failure metrics.

@vzez : You should also evaluate the /_nodes/usage API as well to publish these stats.

Gaganjuneja commented 2 years ago

What is the benefit you see in keeping the whole history at node level? Keeping absolute total count could be an issue because it should always be the same or increasing. In cases when some node goes down. Let's say at a certain point of time there were 100K failures and after that one node goes down then it will be 90k failures which can't be corrected. So reading the graph would always require a context or may be analysis at node level which you may not require always. Can we keep some sort of sliding window maybe just for 5/10 minutes and the user should be able to query and get the delta for just the last X minutes/seconds.

I still feel that PA + RCA can do a better job here for dashboard building otherwise it looks like a heavy lifting at client end.

vzez commented 2 years ago

@Gaganjuneja We are following the same approach as _nodes/usage or _node/stats APIs, which keep data node wise. Even if any node goes down, we still have relevant data, ex: _nodes/usage APIs return number of times a rest action is called per node, so even if a node is not responding, the response data would be correlated. The goal of this approach is to provide building block to users, any they can make use of it as per their needs. And trade off and limitations of PA+RCA are mentioned in proposal itself.

@shwetathareja _nodes/usage APIs already has "rest_action" object as response, changing it's structure will cause backport incompatibility. and _nodes/usage API doesn't have options to return a specific or a group of rest action metric data. It returns usage of all rest actions.

Gaganjuneja commented 1 year ago

@vzez Thanks for the clarification. We can build it incrementally. It is not coming out well how these PA limitations are going to impact, like if we talk about storage requirements then it just keeps the 5 sec aggregated samples over 7 minutes and how much would be needed etc. Also, the strengths of PA. These tools are very powerful in aggregations, windowing, etc. and may help customers build the analytics on top of it. We can also think how your proposed/preferred design choice can be integrated with PA may be in the next phase to ease out the consumption of this data at customer end.

shwetathareja commented 1 year ago

@CaptainDredge explore _nodes/usage API to expose these API level success and failure count as usage API already provides api count. It would make sense to keep all these counts at one place. Instead of changing existing rest_actions you can think about introducing new map with all counts and then deprecate existing rest_actions in next major version.

opensearch-project / OpenSearch

Implementation of Error Metric for Rest Actions #4401