Open Bukhtawar opened 3 years ago
Thanks. Few high level thoughts to discuss more:
Wondering if it makes more sense for root cause analysis than for live debugging ? Given the distributed nature of indexing/search requests, on a relatively large cluster, any high cardinality request will make the real time traceability difficult. Using _tasks API to check for stuck tasks (build-up), or similar constructs (such as Stats API of #478) makes more sense for the ease of live debugging.
Default threshold vs Auto threshold ? Thresholds are critical for optimising such logging, similar to how we have for slow logs. How do you suggest, we can achieve that. Making it too aggressive will miss out logging on few instances, while being too considerate might impact adversely by excessive logging. Should we also think of some auto-thresholds based on recent/historical latency of actions, to identify and log degradation?
How much is this different from the current Slow Logs ? Today slow logs have slow searches (query and fetch phases) for every shard. Can this information be used to reconstruct the same information, and identify which node essentially experienced degradation during root cause?
Can we miss logging for time actually taken on a failed (problematic) path? For a typical search request, if the primary data shard failed serving the request, or coordinator timed out on the primary shard, coordinator can retry on the replica shard to complete the request. In this case, since the problematic path (primary shard) which failed the request, and took most of the time for coordinator action, will not be logged.
Should we rather have node to node network visibility? Here we are essentially trying to construct a network level visibility, where we can isolate/map a particular degraded path during the distributed execution of a request. Having nodes log warning/info during remote transport calls to other nodes, in case they breach a soft threshold can be on the similar lines. In this case ActionListenerResponseHandler
can essentially log remote transport calls to other nodes which are taking longer than a soft predefined threshold (dynamic setting). This will directly help point to nodes facing connectivity issue or taking longer time to execute transport requests, from other node's perspective. Thoughts?
Today a combination of tracking gc + queue size + slow logs would be required to derive similar insights and a log will help pinpoint such issues faster. With ARS everywhere the search side of the problem due to a slow/gray node will reduce, leaving the slow replication/slow bulk problem.
What are your thoughts on modelling this as a separate generic slow task log inspired along the lines of the slow query/indexing logs? Given these response times will be a function of workload (eg. burst of requests can lead to queue build up and expected slowness), I think it will help for the thresholds to be flexible.
Yes @malpani thats the plan, we will have separate logging thresholds for search and indexing. Not sure if you meant having separate log files too for them much like slow logs
yes, if this could potentially extend in the future to other long running tasks eg. snapshot, then a separate log file is what i was thinking of
Hi @Bukhtawar, what is the update on this enhancement? what is the next step?
@Bukhtawar, do you have any update on this?
Problem
If there are problems in the system occurring due to slow IO or Network on a particular node the rest of the nodes also get impacted waiting on the response. There is no easy way to know from logs if a particular node has gone slow, either while doing a live DEBUG or figuring out the root cause of the incident. The current slow logs serves into providing shard-level details on slow query or index. At present there is no way to know the coordinator view
Proposal
We can attach a listener to the
TransportAction
corresponding to bulk and search and log warning of the execution time of the bulk task eg shard bulk/primary/replication tasks if they take beyond a reasonable threshold with the below details.From the below details we could atleast make out there is a slow replication action(time in queue + time spent in the replica indexing) on the problematic node. Since we know the task id and the parent id we should be able to get a break down of the entire bulk request.
Note since the replica action took longer the primary and the coordinator action will have a corresponding log line. Looking at all of them holistically we should be able to reason and pin point about the slow action
On Primary
On Coordinator
This way we know the time spent across layers also getting the N/W round trip delay between tasks