Add logging to 4xx responses from API at reduced rate.

mgodwan commented 1 year ago

Is your feature request related to a problem? Please describe. Today, a few errors returned by opensearch are not logged if they are customer side errors. While we may provide the message as part of the response, it is not always possible to determine the cause of the errors. We can explore to determine if it may benefit to have the logging for such errors at a reduced rate (e.g. how it is done today for connection errors in nodes). This will allow to better debug issues without the need to determine which debug/trace logs should be enabled.

Describe the solution you'd like Sample 4xx errors and log at reduced rate

Describe alternatives you've considered Enable debug/trace logs at multiple places. This requires users/operators to still identify the classes for which logs should be enabled and may not be straight forward.

rayshrey commented 1 year ago

Rekindling the discussion by adding to the above points. @mgodwan @sarthakaggarwal97 please take a look.

Is your feature request related to a problem? Please describe.

At the moment, there is no way to rate limit the logging of exceptions at a particular log level. Currently if we enable a lesser specific logging level for a class to gain more insights into the exceptions occurring, we might run into bloating of log files with excessive trivial logs. This will further enable us to add more logging, with the ability to rate limit, and debug root cause of the issues quickly.

Describe the solution you'd like

We will develop a central framework, which will enable a custom log4j filter. In this filter, we will introduce our own filtering logic, and on the basis of our algorithm, we will return the three states (ACCEPT, DENY, NEUTRAL) under different circumstances.

This will also require us to enable dynamic cluster settings, where we can tune the frequency of the threshold at which we want to limit the exceptions, the level at which we want to enable the rate limiting in the class, and the packages/classes where we are enabling the exceptions using their package name like we do for configuring dynamic logging.

{
    "transient": {
        "log_limiting.filter.org.opensearch.index.engine.InternalEngine":"true",
        "log_limiting._settings.threshold":5,
        "log_limiting._settings.level": "DEBUG"
    }
}

This is how we can dynamically toggle the settings to achieve the logging at the reduced rate. In the cases where we have enabled different settings for the parent-child classes/packages, we can give the priority to the more specific package setting.

This approach is considered is good option for below reasons

It would enable us to log the root cause of the issues with more flexibility and without bloating the log files.
It is independent and extensible to all flows in the OpenSearch.

Describe alternatives you've considered

We considered a approach that we can enable rate limiting for few logs like in NodeConnectionsService, by placing a counter for the exception, and increasing the log level for the exception after every nth time. This approach was discarded because of the following reasons.

It will not give us flexibility of the toggling the parameters of rate limiting without a code change.
We will have to make code change to enable/disable rate limiting every time.
We will have to maintain multiple counters for each exception in the class, which is not the best design practice.

mgodwan commented 1 year ago

Thanks @rayshrey for adding details.

Whay criteria do we use to determine the common log within the window?
With the proposed approach, what kind of compute/JVM overhead will we incur with the log4j filter in the critical path for indexing/search requests?

In the cases where we have enabled different settings for the parent-child classes/packages, we can give the priority to the more specific package setting.

Do you mean we prioritize package over classes here? Should it not be the lowest common ancestor among a class and available settings which gets the higher priority?

"log_limiting._settings.threshold":5, "log_limiting._settings.level": "DEBUG"

Are you proposing this as a common setting for all loggers? Do you think we can have this defined on class/package level as well if needed?

where we can tune the frequency of the threshold at which we want to limit the exceptions

How do we define the window for this threshold?

mgodwan commented 1 year ago

@shwetathareja Thoughts on this?

sarthakaggarwal97 commented 1 year ago

Do you mean we prioritize package over classes here? Should it not be the lowest common ancestor among a class and available settings which gets the higher priority?

@mgodwan I think the log level settings prioritizes the parent over child in applying settings (@rayshrey can verify this once), but I agree with you, we should give higher priority to the lowest common ancestor.

With the proposed approach, what kind of compute/JVM overhead will we incur with the log4j filter in the critical path for indexing/search requests?

@rayshrey three types of benchmarks will be helpful. One without the filter, one with the filter but not applied via cluster settings (since all logs will pass through the present filter even if it has no affect), and one where we have the filter applied. You can also look into microbenchmarking for this.

rayshrey commented 1 year ago

Thanks for the comments @mgodwan and @sarthakaggarwal97 Please find below the replies to your queries.

What criteria do we use to determine the common log within the window?

For now we have set a expiry time of 1hr for the cache. So the counter for the exception type resets after 1 hr from the last accessed time.

With the proposed approach, what kind of compute/JVM overhead will we incur with the log4j filter in the critical path for indexing/search requests?

Will benchmark the results as described by @sarthakaggarwal97 and let you know about the findings once the implementation is complete

Should it not be the lowest common ancestor among a class and available settings which gets the higher priority

@sarthakaggarwal97 I verified that the logger also follows lowest common ancestor settings. Hence we should also follow similarly. Will update the implementation accordingly.

Are you proposing this as a common setting for all loggers? Do you think we can have this defined on class/package level as well if needed?

Yes, initially though of keeping it common for all loggers. But as suggested by @mgodwan we can make this setting on a modular level. Will update the implementation accordingly.

shwetathareja commented 1 year ago

Thanks @rayshrey for the proposal. There is already WIP for Request Tracing Framework. @Gaganjuneja has started discussion around Sampling https://github.com/opensearch-project/OpenSearch/issues/8918 and one of the option would be to enable Tail Sampling based on response status. Instead of adding specific support for sampling 4xx logs explicitly. It might be useful to onboard to Tracing framework. Thoughts @mgodwan

Gaganjuneja commented 1 year ago

@shwetathareja, Thanks for calling this out. Yes, these use cases will be easily served through sampled traces.

mgodwan commented 1 year ago

@shwetathareja

Logging is still something most customers use today and this sampling will benefit such customers out of the box. We already have a lot of error logs at debug/trace level, and allowing local sampling of them should provide debugging benefits without adding much cost for users interested in using only logging. Based on the infrastructure customers choose to setup, logging or tracing can help with such debugging.
I see https://github.com/opensearch-project/OpenSearch/issues/8918 suggests to use head based sampling for now. This may mean that we may miss not too frequents errors in such cases which may otherwise be caught with sampling on exceptions at a node level itself as logs are discretely written on each node.

opensearch-project / OpenSearch

Add logging to 4xx responses from API at reduced rate. #4770