pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
941 stars 410 forks source link

Store size suddenly started increasing due to stale snapshot #9331

Open hongshaoyang opened 1 month ago

hongshaoyang commented 1 month ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Disk size / Store size of Tiflash slowly started increasing all of a sudden.

2. What did you expect to see? (Required)

3. What did you see instead (Required)

Disk size / Store size of Tiflash slowly started increasing all of a sudden.

See https://snapshots.raintank.io/dashboard/snapshot/moA6q18u4GrG0y5qLgZiXqjd9Lxw7BXP?orgId=0&viewPanel=53&from=1720289323641&to=1720386710155 for a snapshot of Store size metrics.

Screenshot 2024-08-16 at 5 24 45 PM

4. What is your TiFlash version? (Required)

v7.1.2

hongshaoyang commented 1 month ago

Aggregation does not support spill because aggregator hash table does not support two level

[2024/07/06 18:55:32.755 +00:00] [WARN] [Aggregator.cpp:346] ["Aggregation does not support spill because aggregator hash table does not support two level"] [source="MPP<gather_id:<gather_id:1, query_ts:1720292129539717116, local_query_id:461048, server_id:511, start_ts:450964260000890920, resource_group: redash_my>,task_id:3>_Aggregation_HashAgg_10"] [thread_id=2321]

Key locked started here

[2024/07/06 19:00:03.347 +00:00] [WARN] [CoprocessorHandler.cpp:202] ["LockException: region_id=2246306414, message: Key is locked (1 locks in regions {2246306414})"] [source="Coprocessor, is_remote_read: true, start_ts: 450964331776180266, region_info: {2246306414, 8585, 18346}, resource_group: default"] [thread_id=938977]

Key locked ended here

[2024/07/06 21:20:34.175 +00:00] [WARN] [CoprocessorHandler.cpp:202] ["LockException: region_id=2246343413, message: Key is locked (1 locks in regions {2246343413})"] [source="Coprocessor, is_remote_read: true, start_ts: 450966541859815581, region_info: {2246343413, 616015, 31527}, resource_group: default"] [thread_id=942206]

Next exceptions from MPPTaskManager.cpp module. I think the task was ABORTED here.

[2024/07/06 21:20:34.260 +00:00] [WARN] [<unknown>] ["region {2246343413,616015,31527} find error: epoch_not_match {\n}\n"] [source=pingcap.tikv] [thread_id=976852]
[2024/07/06 21:21:15.659 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:7, query_ts:1720300835869717319, local_query_id:201721, server_id:79, start_ts:450966542148173844, resource_group: redash_id>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=940502]
[2024/07/06 21:21:15.659 +00:00] [WARN] [MPPTaskManager.cpp:288] ["<gather_id:7, query_ts:1720300835869717319, local_query_id:201721, server_id:79, start_ts:450966542148173844, resource_group: redash_id> does not found in task manager, skip abort"] [thread_id=940502]
[2024/07/06 21:23:02.847 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:8, query_ts:1720300937773760016, local_query_id:767227, server_id:215, start_ts:450966568873492484, resource_group: redash_id>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=941679]
[2024/07/06 21:23:02.847 +00:00] [WARN] [MPPTaskManager.cpp:288] ["<gather_id:8, query_ts:1720300937773760016, local_query_id:767227, server_id:215, start_ts:450966568873492484, resource_group: redash_id> does not found in task manager, skip abort"] [thread_id=941679]

We get the same ERRORS after that from 2024/07/06 21:23:48.585 to 2024/07/06 21:30:23.676

DB::Exception: Check added_tokens >= 0 failed

[2024/07/06 21:23:48.585 +00:00] [ERROR] [LocalAdmissionController.cpp:122] ["Code: 49, e.displayText() = DB::Exception: Check added_tokens >= 0 failed, e.what() = DB::Exception, Stack trace:

       0x1b68670    DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+28739184]
                    dbms/src/Common/Exception.h:46
       0x7d02dac    DB::LocalAdmissionController::fetchTokensFromGAC(std::__1::vector<DB::LocalAdmissionController::AcquireTokenInfo, std::__1::allocator<DB::LocalAdmissionController::AcquireTokenInfo> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) [tiflash+131083692]
                    dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:267
       0x7cffde0    DB::LocalAdmissionController::startBackgroudJob() [tiflash+131071456]
                    dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:107
       0x1c29420    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, DB::LocalAdmissionController::LocalAdmissionController(pingcap::kv::Cluster*, std::__1::shared_ptr<DB::Etcd::Client>)::'lambda'()> >(void*) [tiflash+29529120]
                    /usr/local/bin/../include/c++/v1/thread:291
  0xffff8f55ca28    start_thread [libc.so.6+535080]
  0xffff8f505b9c    thread_start [libc.so.6+179100]"] [source=LocalAdmissionController] [thread_id=2051]

tidb.ai says, The provided log snippet shows the error occurring in the LocalAdmissionController.cpp file at line 122. The stack trace indicates that the error is related to the fetchTokensFromGAC function, which is part of the resource control mechanism in TiFlash.

The next line of ERRORS ,

[2024/07/06 21:30:27.469 +00:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Exchange receiver meet error : Receive cancel request from TiDB, current alive connections: 2"] [source="MPP<gather_id:<gather_id:4, query_ts:1720301147440367135, local_query_id:973089, server_id:1013, start_ts:450966623989268517, resource_group: redash_ph>,task_id:26> ExchangeReceiver_490 async tunnel24+26"] [thread_id=4315]

tidb.ai says , Check the TiFlash logs for entries related to MPP tasks. Look for logs that mention ExchangeSender and ExchangeReceiver as these operators are responsible for cross-node data exchange in MPP mode

Analysing further lines:-

[2024/07/06 21:30:52.773 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=939210]

I think the query was aborted here:-

[2024/07/06 21:30:52.773 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=939210]
[2024/07/06 21:30:52.773 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:3, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=940517]

ExchangeReceiver received the request and aborted the MPP task here:-

[2024/07/06 21:30:52.774 +00:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Receive cancel request from TiDB, current alive connections: 2"] [source="MPP<gather_id:<gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>,task_id:9> ExchangeReceiver_182 local tunnel5+9"] [thread_id=939210]

[2024/07/06 21:30:52.774 +00:00] [WARN] [ExchangeReceiver.cpp:1003] ["Finish receiver channels, meet error: true, error message: Receive cancel request from TiDB"] [source="MPP<gather_id:<gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>,task_id:9> ExchangeReceiver_182"] [thread_id=939210]

Then the Store Size gradually started increasing, due to:

Meet a stale snapshot:

[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=941690] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2246343413, 616015, 31528}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.035349065]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=939041] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662646, 616015, 31528}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.027829322]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=940475] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2246340642, 615895, 31524}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.022052769]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=940633] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662406, 615895, 31524}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.018181362]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=939034] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662374, 615895, 31523}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.014203116]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=939227] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662617, 616015, 31527}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.011231263]"] [source=__global__.meta] [thread_id=79]
hongshaoyang commented 1 month ago

This issue is related to #5084.