opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.8k stars 1.82k forks source link

Log slow bulk and search task execution across various node interactions #985

Open Bukhtawar opened 3 years ago

Bukhtawar commented 3 years ago

Problem

If there are problems in the system occurring due to slow IO or Network on a particular node the rest of the nodes also get impacted waiting on the response. There is no easy way to know from logs if a particular node has gone slow, either while doing a live DEBUG or figuring out the root cause of the incident. The current slow logs serves into providing shard-level details on slow query or index. At present there is no way to know the coordinator view

Proposal

We can attach a listener to the TransportAction corresponding to bulk and search and log warning of the execution time of the bulk task eg shard bulk/primary/replication tasks if they take beyond a reasonable threshold with the below details.

From the below details we could atleast make out there is a slow replication action(time in queue + time spent in the replica indexing) on the problematic node. Since we know the task id and the parent id we should be able to get a break down of the entire bulk request.

Note since the replica action took longer the primary and the coordinator action will have a corresponding log line. Looking at all of them holistically we should be able to reason and pin point about the slow action

On Primary

[2021-06-17T01:19:48,596][WARN ][c.a.c.e.logger           ] [dcdce07c4314957e5cc0e43d5cd60511] Action [indices:data/write/bulk[s][p]] on node [dcdce07c4314957e5cc0e43d5cd60511], 
task id : [26786394294], parent task id : [26795282749] took [88s] longer than the threshold of [60]s

On Coordinator

[2021-06-17T01:19:48,596][WARN ][c.a.c.e.logger           ] [hecgeg07c4314957e5cc0e43d5cd22221] Action [indices:data/write/bulk[s]] on node [hecgeg07c4314957e5cc0e43d5cd22221], 
task id : [26795282749], parent task id : [28668306216] took [98s] longer than the threshold of [60]s

This way we know the time spent across layers also getting the N/W round trip delay between tasks

getsaurabh02 commented 3 years ago

Thanks. Few high level thoughts to discuss more:

malpani commented 3 years ago

Today a combination of tracking gc + queue size + slow logs would be required to derive similar insights and a log will help pinpoint such issues faster. With ARS everywhere the search side of the problem due to a slow/gray node will reduce, leaving the slow replication/slow bulk problem.

What are your thoughts on modelling this as a separate generic slow task log inspired along the lines of the slow query/indexing logs? Given these response times will be a function of workload (eg. burst of requests can lead to queue build up and expected slowness), I think it will help for the thresholds to be flexible.

Bukhtawar commented 3 years ago

Yes @malpani thats the plan, we will have separate logging thresholds for search and indexing. Not sure if you meant having separate log files too for them much like slow logs

malpani commented 3 years ago

yes, if this could potentially extend in the future to other long running tasks eg. snapshot, then a separate log file is what i was thinking of

anasalkouz commented 3 years ago

Hi @Bukhtawar, what is the update on this enhancement? what is the next step?

anasalkouz commented 2 years ago

@Bukhtawar, do you have any update on this?