Open hongshaoyang opened 1 month ago
There are MPP tasks on the Tiflash cluster from https://snapshots.raintank.io/dashboard/snapshot/lVDBv9SP20idxAjt2zjX8PC6Ks5dUdoX?orgId=0&viewPanel=9&from=1720284342000&to=1720318170000
Relevant logs
Aggregation does not support spill because aggregator hash table does not support two level
[2024/07/06 18:55:32.755 +00:00] [WARN] [Aggregator.cpp:346] ["Aggregation does not support spill because aggregator hash table does not support two level"] [source="MPP<gather_id:<gather_id:1, query_ts:1720292129539717116, local_query_id:461048, server_id:511, start_ts:450964260000890920, resource_group: redash_my>,task_id:3>_Aggregation_HashAgg_10"] [thread_id=2321]
Key locked started here
[2024/07/06 19:00:03.347 +00:00] [WARN] [CoprocessorHandler.cpp:202] ["LockException: region_id=2246306414, message: Key is locked (1 locks in regions {2246306414})"] [source="Coprocessor, is_remote_read: true, start_ts: 450964331776180266, region_info: {2246306414, 8585, 18346}, resource_group: default"] [thread_id=938977]
Key locked ended here
[2024/07/06 21:20:34.175 +00:00] [WARN] [CoprocessorHandler.cpp:202] ["LockException: region_id=2246343413, message: Key is locked (1 locks in regions {2246343413})"] [source="Coprocessor, is_remote_read: true, start_ts: 450966541859815581, region_info: {2246343413, 616015, 31527}, resource_group: default"] [thread_id=942206]
Next exceptions from MPPTaskManager.cpp module. I think the task was ABORTED here.
[2024/07/06 21:20:34.260 +00:00] [WARN] [<unknown>] ["region {2246343413,616015,31527} find error: epoch_not_match {\n}\n"] [source=pingcap.tikv] [thread_id=976852]
[2024/07/06 21:21:15.659 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:7, query_ts:1720300835869717319, local_query_id:201721, server_id:79, start_ts:450966542148173844, resource_group: redash_id>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=940502]
[2024/07/06 21:21:15.659 +00:00] [WARN] [MPPTaskManager.cpp:288] ["<gather_id:7, query_ts:1720300835869717319, local_query_id:201721, server_id:79, start_ts:450966542148173844, resource_group: redash_id> does not found in task manager, skip abort"] [thread_id=940502]
[2024/07/06 21:23:02.847 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:8, query_ts:1720300937773760016, local_query_id:767227, server_id:215, start_ts:450966568873492484, resource_group: redash_id>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=941679]
[2024/07/06 21:23:02.847 +00:00] [WARN] [MPPTaskManager.cpp:288] ["<gather_id:8, query_ts:1720300937773760016, local_query_id:767227, server_id:215, start_ts:450966568873492484, resource_group: redash_id> does not found in task manager, skip abort"] [thread_id=941679]
We get the same ERRORS after that from 2024/07/06 21:23:48.585 to 2024/07/06 21:30:23.676
DB::Exception: Check added_tokens >= 0 failed
[2024/07/06 21:23:48.585 +00:00] [ERROR] [LocalAdmissionController.cpp:122] ["Code: 49, e.displayText() = DB::Exception: Check added_tokens >= 0 failed, e.what() = DB::Exception, Stack trace:
0x1b68670 DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+28739184]
dbms/src/Common/Exception.h:46
0x7d02dac DB::LocalAdmissionController::fetchTokensFromGAC(std::__1::vector<DB::LocalAdmissionController::AcquireTokenInfo, std::__1::allocator<DB::LocalAdmissionController::AcquireTokenInfo> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) [tiflash+131083692]
dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:267
0x7cffde0 DB::LocalAdmissionController::startBackgroudJob() [tiflash+131071456]
dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:107
0x1c29420 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, DB::LocalAdmissionController::LocalAdmissionController(pingcap::kv::Cluster*, std::__1::shared_ptr<DB::Etcd::Client>)::'lambda'()> >(void*) [tiflash+29529120]
/usr/local/bin/../include/c++/v1/thread:291
0xffff8f55ca28 start_thread [libc.so.6+535080]
0xffff8f505b9c thread_start [libc.so.6+179100]"] [source=LocalAdmissionController] [thread_id=2051]
tidb.ai says, The provided log snippet shows the error occurring in the LocalAdmissionController.cpp file at line 122. The stack trace indicates that the error is related to the fetchTokensFromGAC function, which is part of the resource control mechanism in TiFlash.
The next line of ERRORS ,
[2024/07/06 21:30:27.469 +00:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Exchange receiver meet error : Receive cancel request from TiDB, current alive connections: 2"] [source="MPP<gather_id:<gather_id:4, query_ts:1720301147440367135, local_query_id:973089, server_id:1013, start_ts:450966623989268517, resource_group: redash_ph>,task_id:26> ExchangeReceiver_490 async tunnel24+26"] [thread_id=4315]
tidb.ai says , Check the TiFlash logs for entries related to MPP tasks. Look for logs that mention ExchangeSender and ExchangeReceiver as these operators are responsible for cross-node data exchange in MPP mode
Analysing further lines:-
[2024/07/06 21:30:52.773 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=939210]
I think the query was aborted here:-
[2024/07/06 21:30:52.773 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=939210]
[2024/07/06 21:30:52.773 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:3, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=940517]
ExchangeReceiver received the request and aborted the MPP task here:-
[2024/07/06 21:30:52.774 +00:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Receive cancel request from TiDB, current alive connections: 2"] [source="MPP<gather_id:<gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>,task_id:9> ExchangeReceiver_182 local tunnel5+9"] [thread_id=939210]
[2024/07/06 21:30:52.774 +00:00] [WARN] [ExchangeReceiver.cpp:1003] ["Finish receiver channels, meet error: true, error message: Receive cancel request from TiDB"] [source="MPP<gather_id:<gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>,task_id:9> ExchangeReceiver_182"] [thread_id=939210]
Then the Store Size gradually started increasing, due to:
Meet a stale snapshot
:[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=941690] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2246343413, 616015, 31528}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.035349065]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=939041] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662646, 616015, 31528}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.027829322]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=940475] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2246340642, 615895, 31524}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.022052769]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=940633] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662406, 615895, 31524}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.018181362]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=939034] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662374, 615895, 31523}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.014203116]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=939227] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662617, 616015, 31527}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.011231263]"] [source=__global__.meta] [thread_id=79]
This issue is related to #5084.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Disk size / Store size of Tiflash slowly started increasing all of a sudden.
2. What did you expect to see? (Required)
3. What did you see instead (Required)
Disk size / Store size of Tiflash slowly started increasing all of a sudden.
See https://snapshots.raintank.io/dashboard/snapshot/moA6q18u4GrG0y5qLgZiXqjd9Lxw7BXP?orgId=0&viewPanel=53&from=1720289323641&to=1720386710155 for a snapshot of Store size metrics.
4. What is your TiFlash version? (Required)
v7.1.2