milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.29k stars 2.91k forks source link

[Bug]:[row cound dismatched] Load collection get stuck #19500

Closed icodersky closed 1 year ago

icodersky commented 2 years ago

Is there an existing issue for this?

Environment

- Milvus version:2.1.2
- Deployment mode(standalone or cluster):k8s cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus v2.1.2
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 256GB
- GPU: 
- Others:

Current Behavior

Operating in the attu, load collection, still loading after 5 hours. seem load collection get stuck.

Had 9 partitions.

dimension: 2048 index: IVFSQ8 collection Entity Count: 31,411,515

image

Expected Behavior

No response

Steps To Reproduce

Operating in the attu, load collection, still loading after 5 hours.

Milvus Log

hx-release-milvus-proxy-5b95cfbcf5-q66xc.log hx-release-milvus-querycoord-78896f65f6-w4jvl.log hx-release-milvus-querynode-847968b458-4q9vs.log

Anything else?

No response

icodersky commented 2 years ago

milvus-log.zip milvus-log.z01.zip milvus-log.z02.zip milvus-log.z03.zip

These are zip separate volumes. "milvus-log.z01.zip, milvus-log.z02.zip, milvus-log.z03.zip" ,this files's suffix name '.zip' needs to be removed. Like this: milvus-log.z01, milvus-log.z02, milvus-log.z03

icodersky commented 2 years ago

By the way, before load collection, I had Changed the milvus from the previous configuration:

helm upgrade --set indexNode.replicas=20 --set dataNode.replicas=20 --set proxy.replicas=9 --set queryNode.replicas=30 --set attu.enabled=true hx-release milvus/milvus

to the current configuration:

helm upgrade --set indexNode.replicas=5 --set dataNode.replicas=5 --set proxy.replicas=9 --set queryNode.replicas=20 --set attu.enabled=true hx-release milvus/milvus

yanliang567 commented 2 years ago

/assign @jiaoew1991

XuanYang-cn commented 2 years ago

129016 [2022/09/28 06:40:03.109 +00:00] [WARN] [querycoord/cluster.go:253] ["loadSegments: queryNode load segments error"] [nodeID=1262] ["error info"="[UnexpectedError] Assert \"row_countopt.value() == size\" at / go/src/github.com/milvus-io/milvus/internal/core/src/segcore/SegmentSealedImpl.cpp:177\n => field (0) data has different row count (19299) than other column's row count (17003)"]

icodersky commented 2 years ago

129016 [2022/09/28 06:40:03.109 +00:00] [WARN] [querycoord/cluster.go:253] ["loadSegments: queryNode load segments error"] [nodeID=1262] ["error info"="[UnexpectedError] Assert "row_countopt.value() == size" at / go/src/github.com/milvus-io/milvus/internal/core/src/segcore/SegmentSealedImpl.cpp:177\n => field (0) data has different row count (19299) than other column's row count (17003)"]

What does this mean? Is there a problem with the inserted data? or problem with the index data?

How to solve this problem?

icodersky commented 2 years ago

Some querynode are now automatically restarted.

image

XuanYang-cn commented 2 years ago

What does this mean? Is there a problem with the inserted data? or problem with the index data?

How to solve this problem?

Seems like a problem with the index data. The numRow of segment 436168328301772801 is 20152, and field 0 has numRow 20152, but the str indexes only contain 20139 numRows.

XuanYang-cn commented 2 years ago

crash stack

18005 fatal error: unexpected signal during runtime execution                                                       
18006 [signal SIGSEGV: segmentation violation code=0x1 addr=0x2f0b pc=0x2f0b]                                       
18007                        
18008 runtime stack:                                                                                                                               
18009 runtime.throw(0x4065a52, 0x2a)                                                                                                               
18010     /usr/local/go/src/runtime/panic.go:1117 +0x72                                                                                            
18011 runtime.sigpanic()                                                                                                                           
18012     /usr/local/go/src/runtime/signal_unix.go:718 +0x2e5                                                                                      
18013                                                                                                                                              
18014 goroutine 77787 [syscall]:                                                                                                                   
18015 runtime.cgocall(0x23483a2, 0xc00076f698, 0x3c19600)                                                                                          
18016     /usr/local/go/src/runtime/cgocall.go:154 +0x5b fp=0xc00076f640 sp=0xc00076f608 pc=0x85265b                                               
18017 github.com/milvus-io/milvus/internal/querynode._Cfunc_LoadFieldData(0x7f15935c5260, 0x64, 0xc004e4c000, 0xb8d5e, 0x4dd3, 0x0, 0x0)           
18018     _cgo_gotypes.go:550 +0x57 fp=0xc00076f698 sp=0xc00076f640 pc=0x227acf7                                    
18019 github.com/milvus-io/milvus/internal/querynode.(*Segment).segmentLoadFieldData.func1.1(0xc0026e09c0, 0x64, 0xc004e4c000, 0xb8d5e, 0x4dd3, 0xc003ca92b8, 0xc001359200)                                                                                            
18020     /go/src/github.com/milvus-io/milvus/internal/querynode/segment.go:771 +0x85 fp=0xc00076f6e8 sp=0xc00076f698 pc=0x229cbe5
18021 github.com/milvus-io/milvus/internal/querynode.(*Segment).segmentLoadFieldData.func1(0xc0007b8e50, 0xca5681c8360fc, 0xc00076f778, 0x85540b)
18022     /go/src/github.com/milvus-io/milvus/internal/querynode/segment.go:771 +0x5b fp=0xc00076f738 sp=0xc00076f6e8 pc=0x229cc7b
18023 github.com/milvus-io/milvus/internal/util/concurrency.(*Pool).Submit.func1()                                  
18024     /go/src/github.com/milvus-io/milvus/internal/util/concurrency/pool.go:46 +0x59 fp=0xc00076f788 sp=0xc00076f738 pc=0x1e209d9              
18025 github.com/panjf2000/ants/v2.(*goWorker).run.func1(0xc000433e30)                                              
18026     /go/pkg/mod/github.com/panjf2000/ants/v2@v2.4.8/worker.go:70 +0x94 fp=0xc00076f7d8 sp=0xc00076f788 pc=0x1e20014                          
18027 runtime.goexit()                                                                                              
18028     /usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc00076f7e0 sp=0xc00076f7d8 pc=0x8c2101                                              
18029 created by github.com/panjf2000/ants/v2.(*goWorker).run                                                       
18030     /go/pkg/mod/github.com/panjf2000/ants/v2@v2.4.8/worker.go:48 +0x4c                                                                       
18031                                                                                                                                              
18032 goroutine 1 [chan receive, 2678 minutes]:                                                                                                    
18033 github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run(0xc000148080, 0xc000012000, 0x0, 0x0)                                               
18034     /go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:523 +0x47c                                         
18035 github.com/milvus-io/milvus/cmd/milvus.(*run).execute(0xc0001442d0, 0xc000050090, 0x3, 0x3, 0xc0001421e0)                                    
18036     /go/src/github.com/milvus-io/milvus/cmd/milvus/run.go:111 +0x496                                          
18037 github.com/milvus-io/milvus/cmd/milvus.RunMilvus(0xc000050090, 0x3, 0x3)                                                                     
18038     /go/src/github.com/milvus-io/milvus/cmd/milvus/milvus.go:60 +0x162                                                                       
18039 main.main()                                                                                                                                  
18040     /go/src/github.com/milvus-io/milvus/cmd/main.go:26 +0x45  
XuanYang-cn commented 2 years ago

Binlog meta is correct, looking into string index now

{\"segmentID\":436168328301772801,\"partitionID\":436166538015539201,\"collectionID\":436166537753657345,\"binlog_paths\":[
    {\"fieldID\":105,\"binlogs\":[
        {\"entries_num\":5071,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168328249606148,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/105/436168330176364550\",\"log_size\":41541639},
        {\"entries_num\":5059,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168333885440005,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/105/436168336625106950\",\"log_size\":41443335},
        {\"entries_num\":4937,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168339691929605,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/105/436168341487616006\",\"log_size\":40443911},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/105/436168347005747206\",\"log_size\":20774919},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/105/436168347477606406\",\"log_size\":20774919},
        {\"entries_num\":13,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/105/436168505394200582\",\"log_size\":106503}]},
    {"fieldID: 0", \"binlogs\":[
        {\"entries_num\":5071,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168328249606148,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/0/436168330176364551\",\"log_size\":40576},
        {\"entries_num\":5059,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168333885440005,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/0/436168336625106951\",\"log_size\":40480},
        {\"entries_num\":4937,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168339691929605,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/0/436168341487616007\",\"log_size\":39504},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/0/436168347005747207\",\"log_size\":20296},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/0/436168347477606407\",\"log_size\":20296},
        {\"entries_num\":13,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/0/436168505394200583\",\"log_size\":112}]},
    {\"fieldID\":1,\"binlogs\":[
        {\"entries_num\":5071,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168328249606148,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/1/436168330176364552\",\"log_size\":40576},
        {\"entries_num\":5059,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168333885440005,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/1/436168336625106952\",\"log_size\":40480},
        {\"entries_num\":4937,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168339691929605,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/1/436168341487616008\",\"log_size\":39504},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/1/436168347005747208\",\"log_size\":20296},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/1/436168347477606408\",\"log_size\":20296},
        {\"entries_num\":13,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/1/436168505394200584\",\"log_size\":112}]},
    {\"fieldID\":100,\"binlogs\":[
        {\"entries_num\":5071,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168328249606148,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/100/436168330176364545\",\"log_size\":7},
        {\"entries_num\":5059,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168333885440005,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/100/436168336625106945\",\"log_size\":7},
        {\"entries_num\":4937,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168339691929605,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/100/436168341487616001\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/100/436168347005747201\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/100/436168347477606401\",\"log_size\":7},
        {\"entries_num\":13,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/100/436168505394200577\",\"log_size\":7}]},
    {\"fieldID\":101,\"binlogs\":[
        {\"entries_num\":5071,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168328249606148,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/101/436168330176364546\",\"log_size\":7},
        {\"entries_num\":5059,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168333885440005,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/101/436168336625106946\",\"log_size\":7},
        {\"entries_num\":4937,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168339691929605,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/101/436168341487616002\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/101/436168347005747202\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/101/436168347477606402\",\"log_size\":7},
        {\"entries_num\":13,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/101/436168505394200578\",\"log_size\":7}]},
    {\"fieldID\":102,\"binlogs\":[
        {\"entries_num\":5071,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168328249606148,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/102/436168330176364547\",\"log_size\":7},
        {\"entries_num\":5059,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168333885440005,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/102/436168336625106947\",\"log_size\":7},
        {\"entries_num\":4937,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168339691929605,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/102/436168341487616003\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/102/436168347005747203\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/102/436168347477606403\",\"log_size\":7},
        {\"entries_num\":13,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/102/436168505394200579\",\"log_size\":7}]},
    {\"fieldID\":103,\"binlogs\":[
        {\"entries_num\":5071,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168328249606148,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/103/436168330176364548\",\"log_size\":7},
        {\"entries_num\":5059,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168333885440005,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/103/436168336625106948\",\"log_size\":7},
        {\"entries_num\":4937,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168339691929605,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/103/436168341487616004\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/103/436168347005747204\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/103/436168347477606404\",\"log_size\":7},
        {\"entries_num\":13,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/103/436168505394200580\",\"log_size\":7}]},
    {\"fieldID\":104,\"binlogs\":[
        {\"entries_num\":5071,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168328249606148,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/104/436168330176364549\",\"log_size\":7},
        {\"entries_num\":5059,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168333885440005,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/104/436168336625106949\",\"log_size\":7},
        {\"entries_num\":4937,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168339691929605,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/104/436168341487616005\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/104/436168347005747205\",\"log_size\":7},
        {\"entries_num\":2536,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/104/436168347477606405\",\"log_size\":7},
        {\"entries_num\":13,\"timestamp_from\":18446744073709551615,\"timestamp_to\":436168345432883201,\"log_path\":\"file/insert_log/436166537753657345/436166538015539201/436168328301772801/104/436168505394200581\",\"log_size\":7}]}]
\"num_of_rows\":20152,
\"statslogs\":[
    {\"fieldID\":100,\"binlogs\":[
        {\"log_path\":\"file/stats_log/436166537753657345/436166538015539201/436168328301772801/100/436168505394200585\",\"log_size\":117960}]}],
\"deltalogs\":[{}],
\"index_infos\":[
    {\"fieldID\":101,\"enable_index\":true,\"index_name\":\"patid\",\"indexID\":436167134655545345,\"buildID\":436168347110080518,
    \"index_params\":[{\"key\":\"index_type\",\"value\":\"Trie\"}],
    \"index_file_paths\":[
        \"file/index_files/436168347110080518/1/436166538015539201/436168328301772801/indexParams\",
        \"file/index_files/436168347110080518/1/436166538015539201/436168328301772801/marisa_trie_index\",
        \"file/index_files/436168347110080518/1/436166538015539201/436168328301772801/marisa_trie_str_ids\"],\"index_size\":213992},

    {\"fieldID\":102,\"enable_index\":true,\"index_name\":\"group\",\"indexID\":436167136032587777,\"buildID\":436168347110080520,
    \"index_params\":[{\"key\":\"index_type\",\"value\":\"Trie\"}],
    \"index_file_paths\":[
        \"file/index_files/436168347110080520/1/436166538015539201/436168328301772801/indexParams\",
        \"file/index_files/436168347110080520/1/436166538015539201/436168328301772801/marisa_trie_index\",
        \"file/index_files/436168347110080520/1/436166538015539201/436168328301772801/marisa_trie_str_ids\"],\"index_size\":165248},

    {\"fieldID\":103,\"enable_index\":true,\"index_name\":\"agentno\",\"indexID\":436167137670201345,\"buildID\":436168347110080522,
    \"index_params\":[{\"key\":\"index_type\",\"value\":\"Trie\"}],
    \"index_file_paths\":[
        \"file/index_files/436168347110080522/1/436166538015539201/436168328301772801/indexParams\",
        \"file/index_files/436168347110080522/1/436166538015539201/436168328301772801/marisa_trie_index\",
        \"file/index_files/436168347110080522/1/436166538015539201/436168328301772801/marisa_trie_str_ids\"],\"index_size\":164544},

    {\"fieldID\":104,\"enable_index\":true,\"index_name\":\"isopen\",\"indexID\":436167139203743745,\"buildID\":436168347123187714,
    \"index_params\":[{\"key\":\"index_type\",\"value\":\"Trie\"}],
    \"index_file_paths\":[
        \"file/index_files/436168347123187714/1/436166538015539201/436168328301772801/indexParams\",
        \"file/index_files/436168347123187714/1/436166538015539201/436168328301772801/marisa_trie_index\",
        \"file/index_files/436168347123187714/1/436166538015539201/436168328301772801/marisa_trie_str_ids\"],\"index_size\":164544},

    {\"fieldID\":105,\"enable_index\":true,\"index_name\":\"img\",\"indexID\":436167142075006977,\"buildID\":436168347123187716,
    \"index_params\":[{\"key\":\"index_type\",\"value\":\"IVF_SQ8\"},{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"params\",\"value\":\"{\\\"nlist\\\":\\\"2048\\\"}\"}],
    \"index_file_paths\":[
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/indexParams\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_0\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_1\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_10\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_2\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_3\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_4\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_5\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_6\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_7\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_8\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/IVF_9\",
        \"file/index_files/436168347123187716/1/436166538015539201/436168328301772801/SLICE_META\"],\"index_size\":45653614}],
\"segment_size\":46802472,\
"insert_channel\":\"by-dev-rootcoord-dml_13_436166537753657345v1\"}
icodersky commented 2 years ago

Binlog meta is correct, looking into string index now

Did you find the problem now? @XuanYang-cn

XuanYang-cn commented 2 years ago

@icodersky Could you use birdwatcher to get a copy of etcd so that I can exam the meta? https://github.com/milvus-io/birdwatcher

Milvus(by-dev): backup --component all
found 37 keys, at revision 533816, starting backup...
Backing up ... 100%(37/37)
backup etcd for prefix by-dev/meta done, stored in file: bw_etcd_ALL.220707-152246.bak.gz
icodersky commented 2 years ago

Can't build the birdwatcher.

image

icodersky commented 2 years ago

go install github.com/milvus-io/birdwatcher

go install also the same timeout. @XuanYang-cn

image

icodersky commented 2 years ago

@icodersky Could you use birdwatcher to get a copy of etcd so that I can exam the meta? https://github.com/milvus-io/birdwatcher

Milvus(by-dev): backup --component all
found 37 keys, at revision 533816, starting backup...
Backing up ... 100%(37/37)
backup etcd for prefix by-dev/meta done, stored in file: bw_etcd_ALL.220707-152246.bak.gz

bw_etcd_ALL.220930-004115.bak.gz

XuanYang-cn commented 2 years ago

@icodersky Did you create index before the insert done or after?

icodersky commented 2 years ago

@icodersky Did you create index before the insert done or after?

Create index before the insert start.

XuanYang-cn commented 2 years ago

@icodersky Do you still have logs before this operation? BTW did datanode/datacoord restart during insert?

By the way, before load collection, I had Changed the milvus from the previous configuration:

helm upgrade --set indexNode.replicas=20 --set dataNode.replicas=20 --set proxy.replicas=9 --set queryNode.replicas=30 --set attu.enabled=true hx-release milvus/milvus

to the current configuration:

helm upgrade --set indexNode.replicas=5 --set dataNode.replicas=5 --set proxy.replicas=9 --set queryNode.replicas=20 --set attu.enabled=true hx-release milvus/milvus

XuanYang-cn commented 2 years ago

Create Index Data Paths are less than real data paths

Create index data Paths

SegmentID: 436181007276703745    State: Flushed
    Index name: patid, Index build ID: 436181033831104514, state: Finished, numRows: 19369, 
data paths: [
    file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181011064422402 
    file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181020030009346 
    file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181028353081346
    file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181034618322946]

Actual Data Paths:

Field 101: [
entries_num:4966 timestamp_from:18446744073709551615 timestamp_to:436181007276703747
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181011064422402" log_size:7
entries_num:4965 timestamp_from:18446744073709551615 timestamp_to:436181017093996545 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181020030009346" log_size:7
entries_num:5000 timestamp_from:18446744073709551615 timestamp_to:436181024643481604 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181028353081346" log_size:7
entries_num:1902 timestamp_from:18446744073709551615 timestamp_to:436181033110994946 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181034618322946" log_size:7
entries_num:2536 timestamp_from:18446744073709551615 timestamp_to:436181033110994946 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181035129503746" log_size:7  
entries_num:554 timestamp_from:18446744073709551615 timestamp_to:436181033110994946 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181192219295746" log_size:7 ]
icodersky commented 2 years ago

@icodersky Do you still have logs before this operation? BTW did datanode/datacoord restart during insert?

By the way, before load collection, I had Changed the milvus from the previous configuration: helm upgrade --set indexNode.replicas=20 --set dataNode.replicas=20 --set proxy.replicas=9 --set queryNode.replicas=30 --set attu.enabled=true hx-release milvus/milvus to the current configuration: helm upgrade --set indexNode.replicas=5 --set dataNode.replicas=5 --set proxy.replicas=9 --set queryNode.replicas=20 --set attu.enabled=true hx-release milvus/milvus

We don't have that time's logs now. It's been reinstalled several times. The datanode/datacoord didn't restart during insert.

icodersky commented 2 years ago

Create Index Data Paths are less than real data paths

Create index data Paths

SegmentID: 436181007276703745  State: Flushed
  Index name: patid, Index build ID: 436181033831104514, state: Finished, numRows: 19369, 
data paths: [
    file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181011064422402 
    file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181020030009346 
    file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181028353081346
    file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181034618322946]

Actual Data Paths:

Field 101: [
entries_num:4966 timestamp_from:18446744073709551615 timestamp_to:436181007276703747
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181011064422402" log_size:7
entries_num:4965 timestamp_from:18446744073709551615 timestamp_to:436181017093996545 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181020030009346" log_size:7
entries_num:5000 timestamp_from:18446744073709551615 timestamp_to:436181024643481604 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181028353081346" log_size:7
entries_num:1902 timestamp_from:18446744073709551615 timestamp_to:436181033110994946 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181034618322946" log_size:7
entries_num:2536 timestamp_from:18446744073709551615 timestamp_to:436181033110994946 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181035129503746" log_size:7  
entries_num:554 timestamp_from:18446744073709551615 timestamp_to:436181033110994946 
    log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181192219295746" log_size:7 ]

What causes it? Is it because the data we inserted is abnormal?

czs007 commented 2 years ago

segmentID: 436181007276703745 create time 2022-09-23 10:01:35.776 +0800 CST IndexBuildID: 436181033831104514 create time 2022-09-23 10:03:17.073 +0800 CST

IndexRequest And Binlog both have logIDs: logID: 436181011064422402 create time 2022-09-23 10:01:50.225 +0800 CST ts_to 2022-09-23 10:01:35.776 +0800 CST logID: 436181020030009346 create time 2022-09-23 10:02:24.426 +0800 CST ts_to 2022-09-23 10:02:13.226 +0800 CST logID: 436181028353081346 create time 2022-09-23 10:02:56.176 +0800 CST ts_to 2022-09-23 10:02:42.025 +0800 CST logID: 436181034618322946 create time 2022-09-23 10:03:20.076 +0800 CST ts_to 2022-09-23 10:03:14.326 +0800 CST

Only Binlog have: logID: 436181035129503746 create time 2022-09-23 10:03:22.026 +0800 CST ts_to 2022-09-23 10:03:14.326 +0800 CST logID: 436181192219295746 create time 2022-09-23 10:13:21.276 +0800 CST ts_to 2022-09-23 10:03:14.326 +0800 CST

@XuanYang-cn DataCoord updated its binlog path 10 minutes after IndexCoord queried the binlog path.

xiaofan-luan commented 2 years ago

segmentID: 436181007276703745 create time 2022-09-23 10:01:35.776 +0800 CST IndexBuildID: 436181033831104514 create time 2022-09-23 10:03:17.073 +0800 CST

IndexRequest And Binlog both have logIDs: logID: 436181011064422402 create time 2022-09-23 10:01:50.225 +0800 CST ts_to 2022-09-23 10:01:35.776 +0800 CST logID: 436181020030009346 create time 2022-09-23 10:02:24.426 +0800 CST ts_to 2022-09-23 10:02:13.226 +0800 CST logID: 436181028353081346 create time 2022-09-23 10:02:56.176 +0800 CST ts_to 2022-09-23 10:02:42.025 +0800 CST logID: 436181034618322946 create time 2022-09-23 10:03:20.076 +0800 CST ts_to 2022-09-23 10:03:14.326 +0800 CST

Only Binlog have: logID: 436181035129503746 create time 2022-09-23 10:03:22.026 +0800 CST ts_to 2022-09-23 10:03:14.326 +0800 CST logID: 436181192219295746 create time 2022-09-23 10:13:21.276 +0800 CST ts_to 2022-09-23 10:03:14.326 +0800 CST

@XuanYang-cn DataCoord updated its binlog path 10 minutes after IndexCoord queried the binlog path.

The interesting part is logID 436181035129503746 and 436181192219295746 have same ts_to as 436181034618322946

XuanYang-cn commented 2 years ago

The scale-down triggered the recovery logic.

If the recovery happens right when a segment is about to flush but not yet, DataCoord will send this flushSegment request to the New DataNode, which just started and didn't finish consuming the data.

So data for the create-index is less than the original data.

This is a known bug and has been fixed by #19985 at the master branch

/unassign /assign @yanliang567

XuanYang-cn commented 2 years ago

To be more clarify:

  1. DataNode rescaled, channels are rebalances to new DNs
  2. DC call sealed&flush to New DataNode(bug, fixed at #19985 in master),
  3. DN generates entries_num:1902 timestamp_from:18446744073709551615 timestamp_to:436181033110994946 log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181034618322946" log_size:7

DN is still consuming data from pulsar, consumes 2536 rows

  1. DN auto flush, generates entries_num:2536 timestamp_from:18446744073709551615 timestamp_to:436181033110994946 log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181035129503746" log_size:7

Now the flush of 3 succeeds and DN changes this segment to Flushed

DN is still consuming data from pulsar, consumes 554 rows, but the rows are not updated into segment replica(channel meta)

  1. After 10 minutes, DC flush staled segments, then the DN flushes the last 554 num rows and generates entries_num:554 timestamp_from:18446744073709551615 timestamp_to:436181033110994946 log_path:"file/insert_log/436166537753657345/436166538238623745/436181007276703745/101/436181192219295746" log_size:7 ]

Meta in DataCoord counts 554 rows less because all the 5th numRows info from DN is wrong.

XuanYang-cn commented 1 year ago

19985

jiaoew1991 commented 1 year ago

/unassign

yanliang567 commented 1 year ago

@czs007 so was it fixed