milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.48k stars 2.92k forks source link

[Bug]: During the rolling upgrade process, the datanode experiences a sudden surge in memory usage, leading to OOM kills, and the querynode also continues to crash #37665

Open zhuwenxing opened 15 hours ago

zhuwenxing commented 15 hours ago

Is there an existing issue for this?

Environment

- Milvus version:v2.4.14 --> master-20241114-cd181e4c-amd64
- Deployment mode(standalone or cluster):mixcoord
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024-11-14T06:05:08.223Z] + kubectl get pods -o wide

[2024-11-14T06:05:08.225Z] + grep kafka-mixcoord-5485

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-0                                       1/1     Running            0                 46m     10.104.17.54    4am-node23   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-1                                       1/1     Running            0                 46m     10.104.25.192   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-2                                       1/1     Running            0                 46m     10.104.32.84    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-0                                      2/2     Running            0                 46m     10.104.33.144   4am-node36   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-1                                      2/2     Running            0                 46m     10.104.25.195   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-2                                      2/2     Running            0                 46m     10.104.34.150   4am-node37   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-exporter-598459d88d-s6rnl              1/1     Running            3 (46m ago)       46m     10.104.1.52     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-0                            1/1     Running            0                 46m     10.104.25.194   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-1                            1/1     Running            0                 46m     10.104.16.205   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-2                            1/1     Running            0                 46m     10.104.32.87    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-flrm4             1/1     Running            1 (7m34s ago)     24m     10.104.16.233   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-fqflz             1/1     Running            1 (7m38s ago)     23m     10.104.32.94    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-p9kbm             1/1     Running            1 (7m35s ago)     23m     10.104.25.200   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-7mjb6              0/1     Terminating        0                 45m     10.104.1.54     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-f56rp              0/1     Terminating        0                 45m     10.104.9.10     4am-node14   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-s8n8q              0/1     Terminating        0                 45m     10.104.26.215   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-77mfr            1/1     Running            0                 36m     10.104.32.91    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-ckwq6            1/1     Running            0                 38m     10.104.16.207   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-v9wzt            1/1     Running            0                 37m     10.104.26.243   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-mixcoord-785c66488f-cpdxl             1/1     Running            0                 33m     10.104.16.223   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-proxy-5b4469bfd9-qn2q5                1/1     Running            0                 22m     10.104.16.234   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-6g6bt          0/1     CrashLoopBackOff   9 (3m48s ago)     45m     10.104.26.219   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-j6ckw          0/1     CrashLoopBackOff   9 (4m18s ago)     45m     10.104.9.11     4am-node14   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-kc2nh          0/1     CrashLoopBackOff   10 (3m19s ago)    45m     10.104.1.55     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-0                                      1/1     Running            0                 46m     10.104.33.143   4am-node36   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-1                                      1/1     Running            0                 46m     10.104.25.191   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-2                                      1/1     Running            0                 46m     10.104.32.85    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-3                                      1/1     Running            0                 46m     10.104.34.148   4am-node37   <none>           <none>

image image

The time point when the memory usage surged drastically coincided with the time when mixcoord started to upgrade.

Logs when the query node crashes

[2024/11/14 06:00:50.517 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=kafka-mixcoord-5485-rootcoord-dml_8_453918478401953022v0] [collectionID=453918478401953022] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/manager.go:198] ["start merging..."] [role=querynode] [nodeID=48] [vchannel="{\"kafka-mixcoord-5485-rootcoord-dml_2_453918478401952460v2\":{}}"]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:178] ["get signal"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=pause] [isMain=true]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:211] ["stop working"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [isMain=true]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:201] ["handle signal done"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=pause] [isMain=true]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:178] ["get signal"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=pause] [isMain=false]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:211] ["stop working"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [isMain=false]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:201] ["handle signal done"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=pause] [isMain=false]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:150] ["add new target"] [vchannel=kafka-mixcoord-5485-rootcoord-dml_2_453918478401952460v2] [isMain=true]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:178] ["get signal"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=terminate] [isMain=false]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgstream/mq_msgstream.go:218] ["start to close mq msg stream"] ["producer num"=0] ["consumer num"=1]
[2024/11/14 06:00:50.541 +00:00] [INFO] [pipeline/delete_node.go:57] ["pipeline fetch delete msg"] [collectionID=453918478401350763] [partitionID=-1] [deleteRowNum=1] [timestampMin=453919126392406017] [timestampMax=453919126392406017]
[2024/11/14 06:00:50.542 +00:00] [DEBUG] [delegator/delegator_data.go:182] ["start to process delete"] [collectionID=453918478401350763] [channel=kafka-mixcoord-5485-rootcoord-dml_0_453918478401350763v0] [replicaID=453918478559870985] [ts=453919126418620419]
[2024/11/14 06:00:50.542 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=kafka-mixcoord-5485-rootcoord-dml_10_453918478401350747v0] [collectionID=453918478401350747] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/11/14 06:00:50.542 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=kafka-mixcoord-5485-rootcoord-dml_10_453918478401350747v0] [collectionID=453918478401350747] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/11/14 06:00:50.542 +00:00] [DEBUG] [gc/gc_tuner.go:90] ["GC Tune done"] ["previous GOGC"=200] ["heapuse "=43] ["total memory"=850] ["next GC"=88] ["new GOGC"=200] [gc-pause=134.246µs] [gc-pause-end=1731564050542301578]
I20241114 06:00:50.547470   527 VectorMemIndex.cpp:405] [SERVER][Load][milvus] construct binary set...
I20241114 06:00:50.547560   527 VectorMemIndex.cpp:408] [SERVER][Load][milvus] add index data to binary set: HNSW
I20241114 06:00:50.547580   527 VectorMemIndex.cpp:421] [SERVER][Load][milvus] load index into Knowhere...
W20241114 06:00:50.547721   527 hnsw.cc:509] [KNOWHERE][Deserialize][milvus] hnsw inner error: Invalid metric type of float type(float32, float16 and bfloat16):551472220233
I20241114 06:00:50.547762   527 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Load index: done (0.056755 ms)
 => failed to Deserialize index: hnsw inner error at /workspace/source/internal/core/src/index/VectorMemIndex.cpp:192

[2024/11/14 06:00:50.547 +00:00] [WARN] [segments/load_index_info.go:210] ["CStatus returns err"] [traceID=74963be0f3628827c3e3a2d3c9e1fd06] [error=" => failed to Deserialize index: hnsw inner error at /workspace/source/internal/core/src/index/VectorMemIndex.cpp:192\n"] [extra="AppendIndex failed"]

SIGNAL CATCH BY NON-GO SIGNAL HANDLER

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5485/pipeline log: artifacts-kafka-mixcoord-5485-server-logs.tar.gz

cluster: 4am ns: chaos-testing pod info

[2024-11-14T06:05:08.223Z] + kubectl get pods -o wide

[2024-11-14T06:05:08.225Z] + grep kafka-mixcoord-5485

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-0                                       1/1     Running            0                 46m     10.104.17.54    4am-node23   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-1                                       1/1     Running            0                 46m     10.104.25.192   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-2                                       1/1     Running            0                 46m     10.104.32.84    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-0                                      2/2     Running            0                 46m     10.104.33.144   4am-node36   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-1                                      2/2     Running            0                 46m     10.104.25.195   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-2                                      2/2     Running            0                 46m     10.104.34.150   4am-node37   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-exporter-598459d88d-s6rnl              1/1     Running            3 (46m ago)       46m     10.104.1.52     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-0                            1/1     Running            0                 46m     10.104.25.194   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-1                            1/1     Running            0                 46m     10.104.16.205   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-2                            1/1     Running            0                 46m     10.104.32.87    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-flrm4             1/1     Running            1 (7m34s ago)     24m     10.104.16.233   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-fqflz             1/1     Running            1 (7m38s ago)     23m     10.104.32.94    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-p9kbm             1/1     Running            1 (7m35s ago)     23m     10.104.25.200   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-7mjb6              0/1     Terminating        0                 45m     10.104.1.54     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-f56rp              0/1     Terminating        0                 45m     10.104.9.10     4am-node14   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-s8n8q              0/1     Terminating        0                 45m     10.104.26.215   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-77mfr            1/1     Running            0                 36m     10.104.32.91    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-ckwq6            1/1     Running            0                 38m     10.104.16.207   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-v9wzt            1/1     Running            0                 37m     10.104.26.243   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-mixcoord-785c66488f-cpdxl             1/1     Running            0                 33m     10.104.16.223   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-proxy-5b4469bfd9-qn2q5                1/1     Running            0                 22m     10.104.16.234   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-6g6bt          0/1     CrashLoopBackOff   9 (3m48s ago)     45m     10.104.26.219   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-j6ckw          0/1     CrashLoopBackOff   9 (4m18s ago)     45m     10.104.9.11     4am-node14   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-kc2nh          0/1     CrashLoopBackOff   10 (3m19s ago)    45m     10.104.1.55     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-0                                      1/1     Running            0                 46m     10.104.33.143   4am-node36   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-1                                      1/1     Running            0                 46m     10.104.25.191   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-2                                      1/1     Running            0                 46m     10.104.32.85    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-3                                      1/1     Running            0                 46m     10.104.34.148   4am-node37   <none>           <none>

Anything else?

it is a stable reproduced issue

yanliang567 commented 14 hours ago

/assign @xiaocai2333 /unassign

congqixia commented 13 hours ago

after offline discussion with @foxspy , it looks like an issue about index engine version /assign @foxspy /unassign @xiaocai2333

zhuwenxing commented 12 hours ago

@congqixia what about datanode memory? in @weiliu1031 opinion, it was caused by the large number of collections