pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
937 stars 409 forks source link

Disagg: failed to read page when gc and dump checkpoint #9098

Open CalvinNeo opened 1 month ago

CalvinNeo commented 1 month ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

with profiles.default.remote_checkpoint_only_upload_manifest = false

[2024/04/28 10:19:58.878 +00:00] [ERROR] [CPFilesWriter.cpp:210] ["failed to read page, record={type:VAR_ENT, page_id:0x01010200000000000005BB01000000000024A0AC, ori_id:0x.0, version:35064431.0, entry:PageEntry{file: 271, offset: 0x34748B, size: 435, checksum: 0xD9156DC79015E875, tag: 0, field_offsets: [], checkpoint_info: invalid}, being_ref_count:1}"] [thread_id=218]
[2024/04/28 10:20:18.833 +00:00] [ERROR] [Exception.cpp:96] ["Code: 49, e.displayText() = DB::Exception: Check index.has_value() failed: Can not find path for PageFile file_id=271_0, e.what() = DB::Exception, Stack trace:\n\n\n       0x1ecbb3e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+32291646]\n                \tdbms/src/Common/Exception.h:46\n       0x806ff27\tDB::PSDiskDelegatorGlobalMulti::getPageFilePath(std::__1::pair<unsigned long, unsigned int> const&) const [tiflash+134676263]\n                \tdbms/src/Storages/PathPool.cpp:1133\n       0x1e46daf\tDB::PS::V3::BlobStore<DB::PS::V3::universal::BlobStoreTrait>::getBlobFile(unsigned long) [tiflash+31747503]\n                \tdbms/src/Storages/Page/V3/BlobStore.cpp:1506\n       0x1e47d6d\tDB::PS::V3::BlobStore<DB::PS::V3::universal::BlobStoreTrait>::read(DB::UniversalPageId const&, unsigned long, unsigned long, char*, unsigned long, std::__1::shared_ptr<DB::ReadLimiter> const&, bool) [tiflash+31751533]\n                \tdbms/src/Storages/Page/V3/BlobStore.cpp:1146\n       0x1e4f7c1\tDB::PS::V3::BlobStore<DB::PS::V3::universal::BlobStoreTrait>::read(std::__1::pair<DB::UniversalPageId, DB::PS::V3::PageEntryV3> const&, std::__1::shared_ptr<DB::ReadLimiter> const&) [tiflash+31782849]\n                \tdbms/src/Storages/Page/V3/BlobStore.cpp:1099\n       0x86f7147\tDB::PS::V3::CPFilesWriter::writeEditsAndApplyCheckpointInfo(DB::PS::V3::PageEntriesEdit<DB::UniversalPageId>&, DB::PS::V3::CPFilesWriter::CompactOptions const&, bool) [tiflash+141521223]\n                \tdbms/src/Storages/Page/V3/CheckpointFile/CPFilesWriter.cpp:185\n       0x86dda77\tDB::UniversalPageStorage::dumpIncrementalCheckpoint(DB::UniversalPageStorage::DumpCheckpointOptions const&) [tiflash+141417079]\n                \tdbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp:547\n       0x86ee206\tstd::__1::__function::__func<DB::UniversalPageStorageService::create(DB::Context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::PSDiskDelegator>, DB::PageStorageConfig const&)::$_3, std::__1::allocator<DB::UniversalPageStorageService::create(DB::Context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::PSDiskDelegator>, DB::PageStorageConfig const&)::$_3>, bool ()>::operator()() [tiflash+141484550]\n                \t/usr/local/bin/../include/c++/v1/__functional/function.h:345\n       0x80476ab\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, DB::BackgroundProcessingPool::BackgroundProcessingPool(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >)::$_1> >(void*) [tiflash+134510251]\n                \t/usr/local/bin/../include/c++/v1/thread:291\n  0x7f0beb597ea5\tstart_thread [libpthread.so.0+32421]\n  0x7f0beaea696d\t__clone [libc.so.6+1042797]"] [source="DB::PS::V3::CPDataDumpStats DB::PS::V3::CPFilesWriter::writeEditsAndApplyCheckpointInfo(universal::PageEntriesEdit &, const CPFilesWriter::CompactOptions &, bool)"] [thread_id=218]

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

CalvinNeo commented 1 month ago

By @JaySon-Huang Thread B Background GC: Find that blob_id=161 should do full gc Copy the data from blob_id=161 to blob_id=221

Thread A DumpIncrSnap: Acquire a snap-A Call dumpIncrementalCheckpoint, get edit_from_mem with page_id_1 -> e1{blob_id=161} [v1] Thread A yield

Thread B resume: Copy data from blob_id=161 to blob_id=221 done, blob_id=161 become "ReadOnly" gcApply will add a new "version" that page_id_1 -> e1'{blob_id=221} [v1'] in the PageDirectory Next GC round run, PageDirectory::gcInMemEntries will remove [v1] but keep [v1'] for page_id_1 blob_id=161 is "ReadOnly" and [v1] is removed, no others entries left on blob_id=161, then the file is removed from disk

Thread A resume: Try to read page data by e1{blob_id=161}, but find that blob_id=161 is already removed from disk