Enhancement for tidb query diagnose

cfzjywxk commented 2 years ago

Feature Request

Is your feature request related to a problem? Please describe:

In the architecture of TiDB, tidb-server manages the user connections and process incoming queries,these queries will be converted into different kv requests and sent to tikv-server using tidb batch client and grpc components. The question is, after turing query into kv requests many execution context and query context will be lost, which make it difficult to diagnose "query-diamension" issues, for example slow queries. For example we could often see slow queries like:

4f12266030e202b41c1f0531d03ba799a458f11a88635a3f424506a12e3ee543,SELECT event_time FROM followers WHERE uid = 59322005 AND target_uid = 75161335 LIMIT 1;,10.160.32.142:10080,blued,32214282,1,1632294495.143163,1.004865,0.000028,0.000029,0.000053,0.000000,0.000091,0.000406,0.000000,0.000000,0.000000,0.000000,0,0,427896207863981621,," id task estRows operator info actRows execution info memory disk Point_Get_1 root 1 table:followers, index:PRIMARY(uid, target_uid) 0 time:1s, loops:1, Get:{num_rpc:1, total_time:1s} N/A N/A",0,,,,blued,10.10.96.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0,0,0,0,0,0,0,,,0,0,0,0,0

In the slow log above, we could only know the kv get rpc is slow(1s), but what is the root cause and how to solve it there's no useful information. Usually we need to check the grafana metrics for more information but this is the incorrect dianosing way, diagnosing query issues using server dimenson information is quite inefficient and inappropriate.

Describe the feature you'd like:

Try to pass the needed query context and execution context into the tikv-server, and record duration for each stage. So we could clearly know what is the root cause slowing down specific queries and will not need to check the aggregated data in grafana. This is the key point to enhance the diagnosibility for TiDB.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

breezewish commented 2 years ago

TL;DR:

[x] TiKV side process and wait time in Coprocessor requests
[x] TiKV side RocksDB scan details in Coprocessor requests
[ ] TiKV side process and wait time in Get requests
[x] TiKV side RocksDB scan details in Get requests

We are still working on delivering the Top SQL feature, which solves some more urgent diagnostics issues.

However there is a very detailed implementation instruction in https://github.com/tikv/tikv/issues/8942#issuecomment-915972826 I would appreciate if someone interested can give a help to bring it to our code base :)

Note: after https://github.com/tikv/tikv/issues/8942 is resolved, we still have more work to do with single-query diagnostics. From my current understanding, a complete solution should at least cover these information:

[x] The executor sum elapsed time and sum processed rows
- [ ] What we really want: The cost (e.g. CPU) of each executor
[x] The RPC time from TiDB side
- [x] The total time, wait time at TiKV side for all kind of requests
- [ ] What we really want: The real network latency (TiDB side RPC time - TiKV side RPC time - SuperBatch RPC framework time)
[x] The KV process time at TiKV side
- [x] The FS block read numbers at RocksDB for all kind of requests
- [ ] What we really want: The real disk read latency
[x] The acquire lock time at TiDB side
- [ ] What we really want: Lock time for all kind of locks and where these locks comes from
[ ] Advanced (mostly for debugging purpose):
- [ ] The sub-procedure level function timing (a.k.a. tracing)
- [ ] The elapsed time for all queuing operations in the SQL execution lifetime (e.g. Raftstore FSM queue)

cfzjywxk commented 2 years ago

Related: #17689 tikv/tikv#8942

TL;DR:

[x] TiKV side process and wait time in Coprocessor requests

[x] TiKV side RocksDB scan details in Coprocessor requests

[ ] TiKV side process and wait time in Get requests

[x] TiKV side RocksDB scan details in Get requests

We are still working on delivering the Top SQL feature, which solves some more urgent diagnostics issues.

However there is a very detailed implementation instruction in tikv/tikv#8942 (comment) I would appreciate if someone interested can give a help to bring it to our code base :)

Note: after tikv/tikv#8942 is resolved, we still have more work to do with single-query diagnostics. From my current understanding, a complete solution should at least cover these information:

[x] The executor sum elapsed time and sum processed rows

[ ] What we really want: The cost (e.g. CPU) of each executor

[x] The RPC time from TiDB side

[x] The total time, wait time at TiKV side for all kind of requests

[ ] What we really want: The real network latency (TiDB side RPC time - TiKV side RPC time - SuperBatch RPC framework time)

[x] The KV process time at TiKV side

[x] The FS block read numbers at RocksDB for all kind of requests

[ ] What we really want: The real disk read latency

[x] The acquire lock time at TiDB side

[ ] What we really want: Lock time for all kind of locks and where these locks comes from

[ ] Advanced (mostly for debugging purpose):

[ ] The sub-procedure level function timing (a.k.a. tracing)

[ ] The elapsed time for all queuing operations in the SQL execution lifetime (e.g. Raftstore FSM queue)

@breeswish Thanks for the information. The TiDB cluster has a separate architecture, a storage node is a stand-alone kv-service. As a result, there will be information lost in the processing process, such as the session context, execution context, and query-related context. Besides the diagnosing issues we're faced with, there are also other issues related to this loss, for example, execution timeout could not be handled properly at the tikv-server side or even the rocksdb side, query memory limit will just take effect in the tidb-server side, etc. To make TiDB work more like an integrated database other than combinations of several components, we may need to think about how to pass the information mentioned above when we implement a specific execution path as well as design the architeture.

breezewish commented 2 years ago

@cfzjywxk Thanks for the recap! Maybe we can separate what you describe into two things that we can consider work with independently:

a) The troubleshooting of a single SQL execution is not well supported. Currently these lack information can only be known from the metrics, which is an aggregation that may not be helpful to a single execution. This has been planned to be improved by @SunRunAway in the next several releases.

b) Components other than TiDB does not process the SQL request in the same way as TiDB.

I guess there is no architectural difficulty for (b), as some behavior has been successfully kept identical. For example, the handling of different SQL modes are well implemented in both TiKV side and TiDB side. They follows the same behavior for whatever SQL mode user sets. This indicates that we can do it right. However I admit that we need to implement these behaviors one by one for now, which is not a good way.

pingcap / tidb

Enhancement for tidb query diagnose #28937

Feature Request