Closed zhuwenxing closed 5 months ago
another failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/2206/pipeline
I20240424 09:49:01.114799 2813 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Build index: done (108.956747 ms)
[2024/04/24 09:49:01.129 +00:00] [DEBUG] [delegator/delegator_data.go:158] ["insert into growing segment"] [collectionID=449302046281116611] [channel=by-dev-rootcoord-dml_12_449302046281116611v1] [replicaID=449302069506736136] [collectionID=449302046281116611] [segmentID=449302046285937186] [rowCount=1549] [maxTimestamp=449302263546511362]
I20240424 09:49:01.144289 2875 thread_pool.h:53] [KNOWHERE][operator()][knowhere_build1] Successfully set priority of knowhere thread.
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046284927885v0] [collectionID=449302046284927885] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046284324584v0] [collectionID=449302046284324584] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046284123108v0] [collectionID=449302046284123108] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046284324584v0] [collectionID=449302046284324584] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046284123108v0] [collectionID=449302046284123108] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046283721085v0] [collectionID=449302046283721085] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046283720200v0] [collectionID=449302046283720200] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046283721085v0] [collectionID=449302046283721085] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046283720200v0] [collectionID=449302046283720200] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046284927885v0] [collectionID=449302046284927885] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
I20240424 09:49:01.159169 2814 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Build index: done (146.126908 ms)
I20240424 09:49:01.178300 2849 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Build index: done (110.236858 ms)
I20240424 09:49:01.218554 2813 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Build index: done (98.580036 ms)
[2024/04/24 09:49:01.242 +00:00] [WARN] [sessionutil/session_util.go:553] ["fail to retry keepAliveOnce"] [serverName=querynode] [LeaseID=9064777446526199336] [error="etcdserver: requested lease not found"]
[2024/04/24 09:49:01.242 +00:00] [WARN] [sessionutil/session_util.go:882] ["connection lost detected, shuting down"]
[2024/04/24 09:49:01.242 +00:00] [ERROR] [querynodev2/server.go:170] ["Query Node disconnected from etcd, process will exit"] ["Server Id"=7] [stack="github.com/milvus-io/milvus/internal/querynodev2.(*QueryNode).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynodev2/server.go:170"]
I do wan't do a big modification on cgo. to make sure each go pthread is never blocked
/assign @congqixia /unassign
@zhuwenxing where can we see this log? "requested lease not found" checked the artifacts but didn't see a clue
Right before the etcd lease expire, the abnormal thread burst was observed: Go thread Container thread
With the help from @cqy123456, we confirmed that the KNOWHERE search/build pool size was very high: Build pool size = 32, and search pool size = 256 However the pod shared the node with many other service and cannot use actually 64 cores 288 threads of cpp make lease keep alive almost impossible to do
@zhuwenxing let's add limit & request for these chaos pods and retest /assign @zhuwenxing
Not reproduced for a long time.
Is there an existing issue for this?
Environment
Current Behavior
Perhaps it was due to the restart of the querynode that caused a search failure.
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/13671/pipeline log: artifacts-pulsar-pod-kill-13671-server-logs.tar.gz
Anything else?
No response