Open zhuwenxing opened 2 weeks ago
/assign @xiaocai2333 /unassign
after offline discussion with @foxspy , it looks like an issue about index engine version /assign @foxspy /unassign @xiaocai2333
@congqixia what about datanode memory? in @weiliu1031 opinion, it was caused by the large number of collections
@zhuwenxing pls verify with the lastest master, thanks~
/assign @zhuwenxing /unassign
querynode crash was fixed in milvus-io-master-d159629-20241115
However, after this fix, there has been a significant increase in the interruptions between search/query, which previously lasted around 10 seconds during the rolling upgrade, but now can last up to 10 minutes.
log artifacts-kafka-mixcoord-5489-server-logs.tar.gz
/assign @weiliu1031 /unassign
/assign @bigsheeper
with 5000 empty collections, datanode oom with 16G memory
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5491/pipeline log: artifacts-kafka-mixcoord-5491-server-logs.tar.gz
@foxspy
querynode crash was fixed in milvus-io-master-d159629-20241115
There is an error in the statement here, it still crashes during the upgrade process.
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:383] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="start reduce query result, traceID = bf1049c9822b3b71d5c551fc5a39372e, vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.270221ms]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_11_453946215647690701v3, segmentIDs = []"] [duration=388.358458ms]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:74] ["shard leader get valid search results"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [numbers=2]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=0] [nq=5] [topk=1]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=1] [nq=5] [topk=1]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:301] ["skip duplicated search result"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [count=0]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.388494ms]
_ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_
/usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814
_ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev
/usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814
_ZN6milvus7segcore13LoadIndexInfoD4Ev
/workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814
DeleteLoadIndexInfo
/workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814
runtime.asmcgocall
/usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7
SIGSEGV: segmentation violation
PC=0x7fada783f47d m=233 sigcode=1
signal arrived during cgo execution
goroutine 3688 [syscall, locked to thread]:
_ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev
/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:194 pc=0x7fada783f47d
_ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev
/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:195 pc=0x7fada783f47d
_ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED4Ev
/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:579 pc=0x7fada783f47d
_ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED0Ev
/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:581 pc=0x7fada783f47d
_ZN8knowhere5IndexINS_9IndexNodeEED4Ev
/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/include/knowhere/index/index.h:207 pc=0x7fadab39d8b2
_ZN6milvus5index14VectorMemIndexIfED4Ev
/workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2
_ZN6milvus5index14VectorMemIndexIfED0Ev
/workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2
_ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_
/usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814
_ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev
/usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814
_ZN6milvus7segcore13LoadIndexInfoD4Ev
/workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814
DeleteLoadIndexInfo
/workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814
runtime.asmcgocall
/usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7
runtime.cgocall(0x4fd22a0, 0xc001a1eed8)
/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc001a1eeb0 sp=0xc001a1ee78 pc=0x1e4444b
github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_DeleteLoadIndexInfo(0x7faae68a0000)
_cgo_gotypes.go:510 +0x3f fp=0xc001a1eed8 sp=0xc001a1eeb0 pc=0x4d862ff
github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1.1(0x10000c0039ec060?)
/workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x34 fp=0xc001a1ef10 sp=0xc001a1eed8 pc=0x4d8d2f4
github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1()
/workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x17 fp=0xc001a1ef28 sp=0xc001a1ef10 pc=0x4d8d297
github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1()
/workspace/source/pkg/util/conc/pool.go:81 +0xb3 fp=0xc001a1ef88 sp=0xc001a1ef28 pc=0x4d60953
github.com/panjf2000/ants/v2.(*goWorker).run.func1()
/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 +0x8d fp=0xc001a1efe0 sp=0xc001a1ef88 pc=0x3b211ad
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001a1efe8 sp=0xc001a1efe0 pc=0x1eb4441
created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 3698
/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:48 +0x5c
@foxspy
querynode crash was fixed in milvus-io-master-d159629-20241115
There is an error in the statement here, it still crashes during the upgrade process.
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:383] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="start reduce query result, traceID = bf1049c9822b3b71d5c551fc5a39372e, vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.270221ms] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_11_453946215647690701v3, segmentIDs = []"] [duration=388.358458ms] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:74] ["shard leader get valid search results"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [numbers=2] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=0] [nq=5] [topk=1] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=1] [nq=5] [topk=1] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:301] ["skip duplicated search result"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [count=0] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.388494ms] _ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_ /usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814 _ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev /usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814 _ZN6milvus7segcore13LoadIndexInfoD4Ev /workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814 DeleteLoadIndexInfo /workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814 runtime.asmcgocall /usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7 SIGSEGV: segmentation violation PC=0x7fada783f47d m=233 sigcode=1 signal arrived during cgo execution goroutine 3688 [syscall, locked to thread]: _ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:194 pc=0x7fada783f47d _ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:195 pc=0x7fada783f47d _ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED4Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:579 pc=0x7fada783f47d _ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED0Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:581 pc=0x7fada783f47d _ZN8knowhere5IndexINS_9IndexNodeEED4Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/include/knowhere/index/index.h:207 pc=0x7fadab39d8b2 _ZN6milvus5index14VectorMemIndexIfED4Ev /workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2 _ZN6milvus5index14VectorMemIndexIfED0Ev /workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2 _ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_ /usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814 _ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev /usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814 _ZN6milvus7segcore13LoadIndexInfoD4Ev /workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814 DeleteLoadIndexInfo /workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814 runtime.asmcgocall /usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7 runtime.cgocall(0x4fd22a0, 0xc001a1eed8) /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc001a1eeb0 sp=0xc001a1ee78 pc=0x1e4444b github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_DeleteLoadIndexInfo(0x7faae68a0000) _cgo_gotypes.go:510 +0x3f fp=0xc001a1eed8 sp=0xc001a1eeb0 pc=0x4d862ff github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1.1(0x10000c0039ec060?) /workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x34 fp=0xc001a1ef10 sp=0xc001a1eed8 pc=0x4d8d2f4 github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1() /workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x17 fp=0xc001a1ef28 sp=0xc001a1ef10 pc=0x4d8d297 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1() /workspace/source/pkg/util/conc/pool.go:81 +0xb3 fp=0xc001a1ef88 sp=0xc001a1ef28 pc=0x4d60953 github.com/panjf2000/ants/v2.(*goWorker).run.func1() /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 +0x8d fp=0xc001a1efe0 sp=0xc001a1ef88 pc=0x3b211ad runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001a1efe8 sp=0xc001a1efe0 pc=0x1eb4441 created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 3698 /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:48 +0x5c
@zhuwenxing give me the loki link, pls
cluster: 4am ns: chaos-testing pod info
2024-11-15T11:18:20.053Z] [2024-11-15 11:18:19 - INFO - ci_test]: kubectl get pod|grep pulsar-mixcoord-5493
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-etcd-0 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-etcd-1 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-etcd-2 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-datanode-69dc484dd9-7dxhk 1/1 Running 0 35m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-indexnode-79d978fff7-bvrcn 1/1 Running 0 28m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-indexnode-79d978fff7-mdlks 1/1 Running 0 27m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-indexnode-79d978fff7-tgr9b 1/1 Running 0 26m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-mixcoord-68848ddc5c-h4zkd 1/1 Running 0 23m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-proxy-6fd6c7d64b-6wt8w 1/1 Running 0 35m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-querynode-0-78fd8db695-h6mvj 0/1 CrashLoopBackOff 6 (10s ago) 35m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-querynode-0-78fd8db695-lp88q 0/1 CrashLoopBackOff 6 (68s ago) 35m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-querynode-1-77d8965ccc-99hqp 1/1 Running 0 67s
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-querynode-1-77d8965ccc-vnfbw 1/1 Running 0 15m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-minio-0 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-minio-1 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-minio-2 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-minio-3 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-bookie-0 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-bookie-1 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-bookie-init-q8gb9 0/1 Completed 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-broker-0 1/1 Running 4 (86s ago) 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-proxy-0 1/1 Running 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-pulsar-init-sj4kv 0/1 Completed 0 38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-zookeeper-0 1/1 Running 0 38m
here is the pod info @foxspy
This PR (https://github.com/milvus-io/milvus/pull/34278) accelerates the subscription speed of the dispatcher. However, the subscription becomes too fast for the dispatcher to merge in time, causing the DataNode OOM.
Test with PR https://github.com/milvus-io/milvus/pull/34278 Consumer num = 1.5k
After revert PR https://github.com/milvus-io/milvus/pull/34278 Consumer num = 214
I think we should limit the concurrency of DataNode subscriptions.
@foxspy
querynode crash was fixed in milvus-io-master-d159629-20241115
There is an error in the statement here, it still crashes during the upgrade process.
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:383] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="start reduce query result, traceID = bf1049c9822b3b71d5c551fc5a39372e, vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.270221ms] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_11_453946215647690701v3, segmentIDs = []"] [duration=388.358458ms] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:74] ["shard leader get valid search results"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [numbers=2] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=0] [nq=5] [topk=1] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=1] [nq=5] [topk=1] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:301] ["skip duplicated search result"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [count=0] [2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.388494ms] _ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_ /usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814 _ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev /usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814 _ZN6milvus7segcore13LoadIndexInfoD4Ev /workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814 DeleteLoadIndexInfo /workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814 runtime.asmcgocall /usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7 SIGSEGV: segmentation violation PC=0x7fada783f47d m=233 sigcode=1 signal arrived during cgo execution goroutine 3688 [syscall, locked to thread]: _ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:194 pc=0x7fada783f47d _ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:195 pc=0x7fada783f47d _ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED4Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:579 pc=0x7fada783f47d _ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED0Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:581 pc=0x7fada783f47d _ZN8knowhere5IndexINS_9IndexNodeEED4Ev /workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/include/knowhere/index/index.h:207 pc=0x7fadab39d8b2 _ZN6milvus5index14VectorMemIndexIfED4Ev /workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2 _ZN6milvus5index14VectorMemIndexIfED0Ev /workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2 _ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_ /usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814 _ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev /usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814 _ZN6milvus7segcore13LoadIndexInfoD4Ev /workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814 DeleteLoadIndexInfo /workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814 runtime.asmcgocall /usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7 runtime.cgocall(0x4fd22a0, 0xc001a1eed8) /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc001a1eeb0 sp=0xc001a1ee78 pc=0x1e4444b github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_DeleteLoadIndexInfo(0x7faae68a0000) _cgo_gotypes.go:510 +0x3f fp=0xc001a1eed8 sp=0xc001a1eeb0 pc=0x4d862ff github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1.1(0x10000c0039ec060?) /workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x34 fp=0xc001a1ef10 sp=0xc001a1eed8 pc=0x4d8d2f4 github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1() /workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x17 fp=0xc001a1ef28 sp=0xc001a1ef10 pc=0x4d8d297 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1() /workspace/source/pkg/util/conc/pool.go:81 +0xb3 fp=0xc001a1ef88 sp=0xc001a1ef28 pc=0x4d60953 github.com/panjf2000/ants/v2.(*goWorker).run.func1() /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 +0x8d fp=0xc001a1efe0 sp=0xc001a1ef88 pc=0x3b211ad runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001a1efe8 sp=0xc001a1efe0 pc=0x1eb4441 created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 3698 /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:48 +0x5c
@zhuwenxing give me the loki link, pls
Index was built by faiss_hnsw, but loaded with hnswlib. When indexNode is upgraded, hnsw index will be built directly from faiss_hnsw, which cannot be parsed for queryNode that has not been upgraded. It needs to be isolated by version to ensure that the index creation capability with version less than or equal to 5 is still built through hnswlib. After the upgrade is completed, the version will reach 6, and the index construction will be completed by faiss_hnsw.
@zhuwenxing Fixed and republished knowhere, please re-verify, thanks~
@foxspy
querynode crash issue was verified and fix in master-00edec2-20241118
@bigsheeper we still need a fix for datanode high memory usage issue
@bigsheeper we still need a fix for datanode high memory usage issue
As mentioned, this is for 5K collections scenario.
Is there an existing issue for this?
Environment
Current Behavior
The time point when the memory usage surged drastically coincided with the time when mixcoord started to upgrade.
Logs when the query node crashes
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5485/pipeline log: artifacts-kafka-mixcoord-5485-server-logs.tar.gz
cluster: 4am ns: chaos-testing pod info
Anything else?
it is a stable reproduced issue