pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
937 stars 409 forks source link

Parallel prehandle snapshot to speed up catch up with TiKV large region #8081

Closed CalvinNeo closed 9 months ago

CalvinNeo commented 10 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

CalvinNeo commented 10 months ago

TiFlash can't catch up, it is pre-handling snapshot

[2023/09/10 14:08:08.876 +08:00] [INFO] [FlashService.cpp:454] ["Handling mpp dispatch request, task meta: start_ts: 444157410608676864\ntask_id: 14\naddress: \"tc-tiflash-1.tc-tiflash-peer.endless-htap-stale-read-tps-2340453-1-599.svc:3930\"\ngather_id: 1\nquery_ts: 1694326088861873714\nlocal_query_id: 217\nserver_id: 1046\nmpp_version: 2\nkeyspace_id: 4294967295\ncoordinator_address: \"tc-tidb-0.tc-tidb-peer.endless-htap-stale-read-tps-2340453-1-599.svc:10080\"\n"] [source=FlashService] [thread_id=1004]
[2023/09/10 14:08:08.876 +08:00] [INFO] [FlashService.cpp:454] ["Handling mpp dispatch request, task meta: start_ts: 444157410608676864\ntask_id: 7\naddress: \"tc-tiflash-1.tc-tiflash-peer.endless-htap-stale-read-tps-2340453-1-599.svc:3930\"\ngather_id: 1\nquery_ts: 1694326088861873714\nlocal_query_id: 217\nserver_id: 1046\nmpp_version: 2\nkeyspace_id: 4294967295\ncoordinator_address: \"tc-tidb-0.tc-tidb-peer.endless-htap-stale-read-tps-2340453-1-599.svc:10080\"\n"] [source=FlashService] [thread_id=125]
[2023/09/10 14:08:08.876 +08:00] [DEBUG] [MPPTask.cpp:393] ["Handling 1 regions from 1 physical tables in MPP task"] [source="MPP<gather_id:<gather_id:1, query_ts:1694326088861873714, local_query_id:217, server_id:1046, start_ts:444157410608676864, resource_gorup: >,task_id:14>"] [thread_id=1004]
[2023/09/10 14:08:08.883 +08:00] [DEBUG] [Region.cpp:703] ["[region_id=535 applied_term=7 applied_index=732115] need to wait learner index 865644 timeout 300000"] [thread_id=921]
[2023/09/10 14:13:08.883 +08:00] [WARN] [Region.cpp:720] ["[region_id=535] wait learner index 865644 timeout"] [thread_id=921]
[2023/09/10 14:09:20.545 +08:00] [INFO] [snapshot.rs:115] ["pre apply snapshot"] [snapshot="Some(Snapshot { key: SnapKey { region_id: 535, term: 7, idx: 862994 }, display_path: \"/data0/proxy/snap/rev_535_7_862994_(default|lock|write).sst\" })"] [pending=0] [has_snap=true] [snap_key="SnapKey { region_id: 535, term: 7, idx: 862994 }"] [region_id=535] [peer_id=741]
[2023/09/10 14:09:20.867 +08:00] [INFO] [snapshot.rs:257] ["post apply snapshot"] [pending=1] [region="id: 535 start_key: 748000000000000AFF7E5F728000000001FFC2E98E0000000000FA end_key: 748000000000000AFF7E5F728000000003FF858A070000000000FA region_epoch { conf_ver: 22 version: 77 } peers { id: 537 store_id: 8 } peers { id: 538 store_id: 29 } peers { id: 539 store_id: 5 role: Learner } peers { id: 553 store_id: 7 } peers { id: 722 store_id: 4 role: Learner } peers { id: 741 store_id: 6 role: Learner }"] [region_id=535] [snap_key="SnapKey { region_id: 535, term: 7, idx: 862994 }"] [peer_id=741]
[2023/09/10 14:14:32.789 +08:00] [INFO] [peer_storage.rs:605] ["begin to apply snapshot"] [peer_id=741] [region_id=535]
[2023/09/10 14:14:32.789 +08:00] [INFO] [peer_storage.rs:688] ["apply snapshot with state ok"] [for_witness=false] [state="applied_index: 994199 truncated_state { index: 994199 term: 7 }"] [region="id: 535 start_key: 748000000000000AFF7E5F728000000001FFC2E98E0000000000FA end_key: 748000000000000AFF7E5F728000000003FF858A070000000000FA region_epoch { conf_ver: 22 version: 77 } peers { id: 537 store_id: 8 } peers { id: 538 store_id: 29 } peers { id: 539 store_id: 5 role: Learner } peers { id: 553 store_id: 7 } peers { id: 722 store_id: 4 role: Learner } peers { id: 741 store_id: 6 role: Learner }"] [peer_id=741] [region_id=535]
[2023/09/10 14:14:32.789 +08:00] [INFO] [snapshot.rs:115] ["pre apply snapshot"] [snapshot="Some(Snapshot { key: SnapKey { region_id: 535, term: 7, idx: 994199 }, display_path: \"/data0/proxy/snap/rev_535_7_994199_(default|lock|write).sst\" })"] [pending=0] [has_snap=true] [snap_key="SnapKey { region_id: 535, term: 7, idx: 994199 }"] [region_id=535] [peer_id=741]
[2023/09/10 14:14:33.193 +08:00] [INFO] [snapshot.rs:257] ["post apply snapshot"] [pending=1] [region="id: 535 start_key: 748000000000000AFF7E5F728000000001FFC2E98E0000000000FA end_key: 748000000000000AFF7E5F728000000003FF858A070000000000FA region_epoch { conf_ver: 22 version: 77 } peers { id: 537 store_id: 8 } peers { id: 538 store_id: 29 } peers { id: 539 store_id: 5 role: Learner } peers { id: 553 store_id: 7 } peers { id: 722 store_id: 4 role: Learner } peers { id: 741 store_id: 6 role: Learner }"] [region_id=535] [snap_key="SnapKey { region_id: 535, term: 7, idx: 994199 }"] [peer_id=741]

There is only one snapshot in prehandle pool. It takes 4min to prehandle

TiFlash is handling the only snapshot, the next one arrives before the first one is finished.

[2023/09/10 14:09:20.545 +08:00] [INFO] [snapshot.rs:115] ["pre apply snapshot"] [snapshot="Some(Snapshot { key: SnapKey { region_id: 535, term: 7, idx: 862994 }, display_path: \"/data0/proxy/snap/rev_535_7_862994_(default|lock|write).sst\" })"] [pending=0] [has_snap=true] [snap_key="SnapKey { region_id: 535, term: 7, idx: 862994 }"] [region_id=535] [peer_id=741]
[2023/09/10 14:09:20.555 +08:00] [INFO] [apply.rs:3927] ["re-register to apply delegates"] [term=7] [peer_id=741] [region_id=535]
[2023/09/10 14:09:20.867 +08:00] [INFO] [region.rs:455] ["begin apply snap data"] [peer_id=741] [region_id=535]
[2023/09/10 14:09:20.867 +08:00] [INFO] [snapshot.rs:257] ["post apply snapshot"] [pending=1] [region="id: 535 start_key: 748000000000000AFF7E5F728000000001FFC2E98E0000000000FA end_key: 748000000000000AFF7E5F728000000003FF858A070000000000FA region_epoch { conf_ver: 22 version: 77 } peers { id: 537 store_id: 8 } peers { id: 538 store_id: 29 } peers { id: 539 store_id: 5 role: Learner } peers { id: 553 store_id: 7 } peers { id: 722 store_id: 4 role: Learner } peers { id: 741 store_id: 6 role: Learner }"] [region_id=535] [snap_key="SnapKey { region_id: 535, term: 7, idx: 862994 }"] [peer_id=741]
[2023/09/10 14:13:13.421 +08:00] [INFO] [snapshot.rs:331] ["get prehandled snapshot success"] [pending=1] [region_id=535] [snap_key="SnapKey { region_id: 535, term: 7, idx: 862994 }"] [peer_id=741]
[2023/09/10 14:13:13.534 +08:00] [INFO] [snapshot.rs:366] ["apply snapshot finished"] [elapsed=232666] [pending=0] [region="id: 535 start_key: 748000000000000AFF7E5F728000000001FFC2E98E0000000000FA end_key: 748000000000000AFF7E5F728000000003FF858A070000000000FA region_epoch { conf_ver: 22 version: 77 } peers { id: 537 store_id: 8 } peers { id: 538 store_id: 29 } peers { id: 539 store_id: 5 role: Learner } peers { id: 553 store_id: 7 } peers { id: 722 store_id: 4 role: Learner } peers { id: 741 store_id: 6 role: Learner }"] [snap_key="SnapKey { region_id: 535, term: 7, idx: 862994 }"] [peer_id=741] [mode=normal]
[2023/09/10 14:13:13.534 +08:00] [INFO] [region.rs:503] ["apply new data"] [time_takes=232.666819125s] [region_id=535]
[2023/09/10 14:13:14.134 +08:00] [INFO] [tablet_snap.rs:499] ["begin to receive tablet snapshot files"] [temp_exists=false] [region_id=535] [file=/data0/proxy/snap_v2/rev_535_741_7_994199.tmp]
[2023/09/10 14:13:14.134 +08:00] [INFO] [tablet_snap.rs:513] ["not using cache"] [err="Other(\"cache is disabled\")"] [region_id=535]
[2023/09/10 14:14:32.789 +08:00] [INFO] [peer_storage.rs:605] ["begin to apply snapshot"] [peer_id=741] [region_id=535]
[2023/09/10 14:14:32.789 +08:00] [INFO] [peer_storage.rs:1046] ["finish clear peer meta"] [takes=4.602µs] [raft_key=1] [apply_key=1] [meta_key=1] [region_id=535]
[2023/09/10 14:14:32.789 +08:00] [INFO] [peer_storage.rs:688] ["apply snapshot with state ok"] [for_witness=false] [state="applied_index: 994199 truncated_state { index: 994199 term: 7 }"] [region="id: 535 start_key: 748000000000000AFF7E5F728000000001FFC2E98E0000000000FA end_key: 748000000000000AFF7E5F728000000003FF858A070000000000FA region_epoch { conf_ver: 22 version: 77 } peers { id: 537 store_id: 8 } peers { id: 538 store_id: 29 } peers { id: 539 store_id: 5 role: Learner } peers { id: 553 store_id: 7 } peers { id: 722 store_id: 4 role: Learner } peers { id: 741 store_id: 6 role: Learner }"] [peer_id=741] [region_id=535]
[2023/09/10 14:14:32.789 +08:00] [INFO] [peer.rs:4894] ["snapshot is persisted"] [region="id: 535 start_key: 748000000000000AFF7E5F728000000001FFC2E98E0000000000FA end_key: 748000000000000AFF7E5F728000000003FF858A070000000000FA region_epoch { conf_ver: 22 version: 77 } peers { id: 537 store_id: 8 } peers { id: 538 store_id: 29 } peers { id: 539 store_id: 5 role: Learner } peers { id: 553 store_id: 7 } peers { id: 722 store_id: 4 role: Learner } peers { id: 741 store_id: 6 role: Learner }"] [peer_id=741] [region_id=535]