pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
941 stars 410 forks source link

The restart time under disagg is too long if some meta is not yet ready #8946

Open JaySon-Huang opened 5 months ago

JaySon-Huang commented 5 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Deploy a disagg arch cluster with following tiflash wn config

              [profiles.default]
              remote_checkpoint_interval_seconds = 600

2. What did you expect to see? (Required)

3. What did you see instead (Required)

The main thread after restart, we can see that Waiting for restore checkpoint info from S3 block for 10 minutes

[2024/04/15 02:30:56.507 +08:00] [INFO] [PageDirectoryFactory.cpp:68] ["PageDirectory restored, max_page_id=1037521 max_applied_ver=119316445"] [source=uni_write] [thread_id=1]
[2024/04/15 02:30:57.360 +08:00] [INFO] [BackgroundProcessingPool.cpp:90] ["Create BackgroundProcessingPool, prefix=ps-checkpoint n_threads=1"] [thread_id=1]
[2024/04/15 02:30:57.360 +08:00] [INFO] [Context.cpp:1775] ["initialized GlobalUniversalPageStorage(WriteNode)"] [source=Context] [thread_id=1]
[2024/04/15 02:30:57.364 +08:00] [INFO] [Server.cpp:1337] ["StoreIdent restored, {cluster_id: 7357357158694045006 store_id: 6}"] [thread_id=1]
[2024/04/15 02:30:57.364 +08:00] [INFO] [Disagg.cpp:104] ["Init FAP Context, concurrency=25"] [thread_id=1]
[2024/04/15 02:30:57.364 +08:00] [INFO] [RateLimiter.cpp:495] ["storage.io_rate_limit is not found in config, use default config."] [source=IORateLimiter] [thread_id=1]
[2024/04/15 02:30:57.364 +08:00] [INFO] [RateLimiter.cpp:499] ["storage.io_rate_limit is not changed."] [source=IORateLimiter] [thread_id=1]
[2024/04/15 02:30:57.364 +08:00] [INFO] [RateLimiter.cpp:578] ["bg_thread_ids 32 => [42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73]"] [source=IORateLimiter] [thread_id=1]
[2024/04/15 02:30:57.364 +08:00] [INFO] [Server.cpp:1376] ["Detected memory capacity 51539607552 bytes, you have config `max_memory_usage_for_all_queries` to 0, finally limit to 0 bytes."] [thread_id=1]
[2024/04/15 02:30:57.364 +08:00] [INFO] [Server.cpp:1464] ["Loading metadata."] [thread_id=1]
[2024/04/15 02:30:57.364 +08:00] [INFO] [DatabaseOrdinary.cpp:98] ["Total 0 tables."] [source="DatabaseOrdinary (system)"] [thread_id=1]
[2024/04/15 02:30:57.365 +08:00] [INFO] [Server.cpp:369] ["update cluster config, ca_path: , cert_path: , key_path: , api_version: 0"] [thread_id=1]
[2024/04/15 02:30:57.365 +08:00] [INFO] [KVStore.cpp:70] ["KVStore inited, eager_raft_log_gc_enabled=true"] [thread_id=1]
[2024/04/15 02:30:57.365 +08:00] [DEBUG] [Server.cpp:380] ["/workspace/source/tiflash/contrib/grpc/src/core/lib/iomgr/ev_posix.cc, line number: 172, log msg : Using polling engine: epollex"] [source=grpc] [thread_id=1]
[2024/04/15 02:30:57.365 +08:00] [DEBUG] [Server.cpp:380] ["/workspace/source/tiflash/contrib/grpc/src/core/ext/filters/client_channel/lb_policy_registry.cc, line number: 42, log msg : registering LB policy factory for \"grpclb\""] [source=grpc] [thread_id=1]
["/workspace/source/tiflash/contrib/grpc/src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line number: 624, log msg : Using ares dns resolver"] [source=grpc] [thread_id=1]
 ...
[2024/04/15 02:30:57.366 +08:00] [DEBUG] [Server.cpp:380]["/workspace/source/tiflash/contrib/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc, line number: 429, log msg : Disabling AF_INET6 sockets because ::1 is not available."] [source=grpc] [thread_id=1]
[2024/04/15 02:30:57.369 +08:00] [INFO] [<unknown>] ["init cluster id done: 7357357158694045006"] [source=pingcap.pd] [thread_id=1]
[2024/04/15 02:30:57.373 +08:00] [INFO] [MinTSOScheduler.cpp:73] ["thread_hard_limit is 10000, thread_soft_limit is 5000, and active_set_soft_limit is 16 in MinTSOScheduler."] [thread_id=1]
[2024/04/15 02:30:57.377 +08:00] [DEBUG] [Client.cpp:420] ["keep alive update deadline, ttl=60 lease_deadline=2024-04-14 18:31:57.376565"] [source="lease=5f788ed7d776071e"] [thread_id=1]
[2024/04/15 02:30:57.377 +08:00] [INFO] [OwnerManager.cpp:157] ["start campaign owner"] [source="owner_id=0.0.0.0:3930"] [thread_id=1]
[2024/04/15 02:30:57.377 +08:00] [INFO] [TMTContext.cpp:182] ["Using remote_gc_method=ScanThenDelete"] [thread_id=1]
[2024/04/15 02:30:57.378 +08:00] [INFO] [KVStore.cpp:390] ["Set store info id: 6"] [thread_id=1]
[2024/04/15 02:30:57.378 +08:00] [INFO] [KVStore.cpp:384] ["Region compact log thresholds, rows=40960 bytes=33554432 gap=200 eager_gc_gap=512"] [thread_id=1]
[2024/04/15 02:30:57.378 +08:00] [INFO] [TMTContext.cpp:408] ["read-index timeout: 10000ms; wait-index timeout: 300000ms; wait-region-ready timeout: 1200s; read-index-worker-tick: 10ms"] [thread_id=1]
[2024/04/15 02:30:57.378 +08:00] [INFO] [Server.cpp:1497] ["dt_enable_read_thread true"] [thread_id=1]

...

[2024/04/15 02:30:57.576 +08:00] [INFO] [RegionTable.cpp:127] ["RegionTable restore end, n_tables=11"] [thread_id=1]
[2024/04/15 02:30:57.576 +08:00] [INFO] [BackgroundService.cpp:55] ["Start background storage gc worker with interval 60 seconds."] [thread_id=1]
[2024/04/15 02:30:57.954 +08:00] [INFO] [BgStorageInit.cpp:31] ["Wait for storage init thread to join"] [thread_id=1]
[2024/04/15 02:30:58.935 +08:00] [INFO] [UniversalPageStorage.cpp:463] ["Waiting for restore checkpoint info from S3"] [source=uni_write] [thread_id=1]
[2024/04/15 02:40:58.710 +08:00] [INFO] [Server.cpp:684] ["No TCP server is created"] [thread_id=1]
[2024/04/15 02:40:58.710 +08:00] [INFO] [Server.cpp:1607] ["Available RAM = 51539607552; physical cores = 24; logical cores = 8."] [thread_id=1]
[2024/04/15 02:40:58.710 +08:00] [INFO] [Server.cpp:1610] ["Ready for connections."] [thread_id=1]

The thread of uploading wn checkpoint. In 02:30:57:364, the initialization from S3 is skipped because tmt context is not ready yet. Because profiles.default.remote_checkpoint_interval_seconds is set to 600 seconds, the retry happened at 02:40:57. So it block the main thread from restarting.

[2024/04/15 02:30:57.364 +08:00] [INFO] [UniversalPageStorageService.cpp:138] ["Skip checkpoint because context is not initialized"] [source=UniPSService] [thread_id=2]
[2024/04/15 02:40:57.485 +08:00] [DEBUG] [S3Common.cpp:949] ["listPrefix prefix=s6/manifest/, total_keys=8, cost=0.05s"] [source="bucket=analytical-storage root=qa/v810/consistency1/"] [thread_id=2]
[2024/04/15 02:40:58.401 +08:00] [INFO] [S3LockLocalManager.cpp:90] ["restore from S3, key=s6/manifest/mf_169 cur_entries=100000 cur_copied_entries=1619"] [thread_id=2]
[2024/04/15 02:40:58.467 +08:00] [INFO] [S3LockLocalManager.cpp:90] ["restore from S3, key=s6/manifest/mf_169 cur_entries=100000 cur_copied_entries=952"] [thread_id=2]
[2024/04/15 02:40:58.535 +08:00] [INFO] [S3LockLocalManager.cpp:90] ["restore from S3, key=s6/manifest/mf_169 cur_entries=100000 cur_copied_entries=18043"] [thread_id=2]
[2024/04/15 02:40:58.707 +08:00] [INFO] [S3LockLocalManager.cpp:90] ["restore from S3, key=s6/manifest/mf_169 cur_entries=30367 cur_copied_entries=21440"] [thread_id=2]
[2024/04/15 02:40:58.709 +08:00] [INFO] [S3LockLocalManager.cpp:109] ["restore from S3 finish, elapsed=1.274s last_upload_sequence=169 num_s3_entries=330367 num_copied_entries=42054 last_prefix=file_format: 1 local_sequence: 116738918 last_local_sequence: 111285200 create_at_ms: 1713119112398 writer_info { store_id: 6 version: \"v8.1.0\" version_git: \"e3f77a2f7504573fb5d845c090ad3416e9f6ae23\" start_at_ms: 1713084841000 remote_info { type_name: \"S3\" root: \"qa/v810/consistency1/\" } }"] [thread_id=2]

The retry of UniversalPageStorageService::uploadCheckpoint after restart should be more frequent

4. What is your TiFlash version? (Required)

master

JaySon-Huang commented 4 months ago

Change the severity to moderate because the default value of profiles.default.remote_checkpoint_interval_seconds is 60 seconds, which is acceptable.