Logs should include who/what was issuing the request

peternied commented 1 year ago

Is your feature request related to a problem? Please describe.

If you have a large cluster with many different users, it is not possible to determine the upstream source of a slow request. This also applies to task logs, deprecation logs, and application logs.

This can be worked around by enabling audit logging, with as many details as possible to attempt to line up the timestamp, node, index, and request body. This is a complex and potentially expensive work around to this problem.

Example slow log entry

node1 | [2019-10-24T19:48:51,012][WARN][i.i.s.index] [node1]
[some-index/i86iF5kyTyy-PS8zrdDeAA] took[3.4ms], took_millis[3], type[_doc], id[1], routing[],
source[{"title":"Your Name", "Director":"Makoto Shinkai"}]

Describe the solution you'd like

When log entries are constructed, the identity should be included in these messages, maybe something like subject[name=peternied, domain=amzn.ldap, application[alerting]]

Potential log output

node1 | [2019-10-24T19:48:51,012][WARN][i.i.s.index] [node1]
[some-index/i86iF5kyTyy-PS8zrdDeAA] took[3.4ms], took_millis[3], type[_doc], id[1], routing[],
source[{"title":"Your Name", "Director":"Makoto Shinkai"}],
subject[name=peternied, domain=amzn.ldap, application[alerting]]

peternied commented 1 year ago

Note; filing this issue on behalf of a OpenSearch customer that is struggling to find a user that is unknowingly causing stability problems.

manojfaria commented 1 year ago

Thanks peternied. To add, this problem of identifying the user/team of an expensive/bad query is amplified for an OpenSearch cluster that serves multiple consumers/teams (aka a multi-tenant cluster).

For requests that are received from authenticated users, logging the user/team identity along with slow queries can help opensearch platform teams to identify and track consumers teams/users that submit expensive/bad queries to the cluster, and to line up appropriate next steps.

dblock commented 1 year ago

Will request tracing (#7352) solve this problem?

peternied commented 1 year ago

Thanks for the reference to that other issue, I suppose it depends if request tracing will satisfy the same compliance requirements around that audit logging is used for today. If so - that sounds like a great angle to unify this information.

wbeckler commented 1 year ago

A high-volume user who is overwhelmed by the volume of data created by audit logging would be even worse off with request tracing on all requests. Slow logs is filtered down to misbehaving queries, and that's where the metadata should be stored to find the culprit if there is an issue with too much data.

On the other hand, request caller details should be something that is default disabled, because logging PII unexpectedly would be a major issue for some users.

manojfaria commented 12 months ago

+1 wbeckler@.

Summary: may i request that we provide the option to record request caller details (aka user info) via slow logs as well as request tracing.

Details:

Per my understanding, request tracing can be used for deeper root cause investigation of slow or misbehaving queries that are logged to the slow log, which may also help to optimize misbehaving query.

Also, it seems that request tracing may also track slow or misbehaving queries that are not logged to the slow log. For example, if a query causes an out-of-memory exception, the query may not be logged to slow logs, but request tracing may still be able to track it.

In order to make it easier to track slow or misbehaving queries that are not logged to the slow log, it would be helpful to log request caller details as part of request tracing. This would allow developers to identify the user who issued the query, even if the query itself is not logged to slow logs.

ndasari commented 11 months ago

+1 to this feature. Right now there is no unique ID between the Slow logs and Audit logs to find the source of the slow logs. Slow logs doesnt get user information and audit logs has query which is sometimes long and hard to parse to match with slow logs. Like few from the community mentioned this is a Cluster stability concern as one user running a long query could potentially cause performance and reliability issues on the OS cluster.

opensearch-project / OpenSearch