milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.05k stars 2.95k forks source link

[Bug]: [Nightly]Cluster pulsar failed for 6h timeout with milvus panic #27010

Closed NicoYuan1986 closed 1 year ago

NicoYuan1986 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: master-20230911-ac45af58
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):     kafka
- SDK version(e.g. pymilvus v2.0.0rc2): 2.3.0b0.post1.dev127
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Cluster pulsar failed for 6h timeout with milvus panic.

link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/494/pipeline/242/ log: artifacts-milvus-distributed-pulsar-nightly-494-pymilvus-e2e-logs.tar.gz

2023-09-12T02:39:20.451506535+08:00 stderr F panic: runtime error: invalid memory address or nil pointer dereference
2023-09-12T02:39:20.451527054+08:00 stderr F [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x3980c2c]
2023-09-12T02:39:20.451531954+08:00 stderr F
2023-09-12T02:39:20.451538067+08:00 stderr F goroutine 28349 [running]:
2023-09-12T02:39:20.45154535+08:00 stderr F panic({0x3f73060, 0x6268450})
2023-09-12T02:39:20.451555393+08:00 stderr F    /usr/local/go/src/runtime/panic.go:987 +0x3bb fp=0xc001a0d788 sp=0xc001a0d6c8 pc=0x169fbdb
2023-09-12T02:39:20.451560925+08:00 stderr F runtime.panicmem(...)
2023-09-12T02:39:20.45156462+08:00 stderr F     /usr/local/go/src/runtime/panic.go:260
2023-09-12T02:39:20.45156807+08:00 stderr F runtime.sigpanic()
2023-09-12T02:39:20.451572073+08:00 stderr F    /usr/local/go/src/runtime/signal_unix.go:841 +0x37d fp=0xc001a0d7e8 sp=0xc001a0d788 pc=0x16b7e1d
2023-09-12T02:39:20.451575848+08:00 stderr F github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).DescribeIndex(0xc0001b5740, {0x494d998, 0xc002078dc0}, 0x62a15945710d185)

Datacoord restarts 53 times [2023-09-12T00:01:37.231Z] mdp-494-n-milvus-datacoord-66dcb76658-4cnsc restarts 53, last terminateed reason is "Error"

[2023-09-12T00:01:37.231Z] mdp-494-n-etcd-0                                 1/1     Running       0                6h7m    10.105.7.191   ci-node12   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-etcd-1                                 1/1     Running       0                6h7m    10.105.5.234   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-etcd-2                                 1/1     Running       0                6h7m    10.105.1.209   ci-node10   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-datacoord-66dcb76658-4cnsc      1/1     Running       53 (6m21s ago)   6h7m    10.105.7.177   ci-node12   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-datanode-755cf99576-b8r57       1/1     Running       1 (6h2m ago)     6h7m    10.105.5.212   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-datanode-755cf99576-bqw94       1/1     Running       1 (6h2m ago)     6h7m    10.105.7.179   ci-node12   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-indexcoord-f5567d6f-frf5z       1/1     Running       0                6h7m    10.105.5.217   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-indexnode-7c9d85c475-2vsml      1/1     Running       0                6h7m    10.105.1.197   ci-node10   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-indexnode-7c9d85c475-9wb72      1/1     Running       0                6h7m    10.105.5.220   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-proxy-5b5c5f9cd-jj8vw           1/1     Running       1 (6h2m ago)     6h7m    10.105.7.180   ci-node12   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-proxy-5b5c5f9cd-k6j7l           1/1     Running       1 (6h2m ago)     6h7m    10.105.5.216   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-querycoord-74b64949bb-nfbkm     1/1     Running       3 (5h2m ago)     6h7m    10.105.5.213   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-querynode-6855cbc4c7-dsvqp      1/1     Running       0                6h7m    10.105.1.198   ci-node10   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-querynode-6855cbc4c7-zj99l      1/1     Running       0                6h7m    10.105.7.182   ci-node12   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-rootcoord-57c77cfd84-nnvdr      1/1     Running       1 (6h3m ago)     6h7m    10.105.5.214   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-minio-c6d64ff87-rmbbp                  1/1     Running       0                6h7m    10.105.1.205   ci-node10   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-bookie-0                        1/1     Running       0                6h7m    10.105.5.235   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-bookie-1                        1/1     Running       0                6h7m    10.105.1.212   ci-node10   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-bookie-2                        1/1     Running       0                6h7m    10.105.7.194   ci-node12   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-bookie-init-lchcf               0/1     Completed     0                6h7m    10.105.5.219   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-broker-0                        1/1     Running       0                6h7m    10.105.1.196   ci-node10   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-broker-1                        1/1     Running       0                6h7m    10.105.7.181   ci-node12   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-proxy-0                         1/1     Running       0                6h7m    10.105.5.215   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-pulsar-init-j42fl               0/1     Completed     0                6h7m    10.105.5.218   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-recovery-0                      1/1     Running       0                6h7m    10.105.7.178   ci-node12   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-zookeeper-0                     1/1     Running       0                6h7m    10.105.1.206   ci-node10   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-zookeeper-1                     1/1     Running       0                6h6m    10.105.5.243   ci-node11   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-pulsar-zookeeper-2                     1/1     Running       0                6h5m    10.105.7.202   ci-node12   <none>           <none>
[2023-09-12T00:01:37.231Z] mdp-494-n-milvus-datacoord-66dcb76658-4cnsc restarts 53, last terminateed reason is "Error"
[2023-09-12T00:01:37.489Z] mdp-494-n-milvus-datanode-755cf99576-b8r57 restarts 1, last terminateed reason is "Error"
[2023-09-12T00:01:37.489Z] mdp-494-n-milvus-datanode-755cf99576-bqw94 restarts 1, last terminateed reason is "Error"
[2023-09-12T00:01:37.489Z] mdp-494-n-milvus-proxy-5b5c5f9cd-jj8vw restarts 1, last terminateed reason is "Error"
[2023-09-12T00:01:37.489Z] mdp-494-n-milvus-proxy-5b5c5f9cd-k6j7l restarts 1, last terminateed reason is "Error"
[2023-09-12T00:01:37.745Z] mdp-494-n-milvus-querycoord-74b64949bb-nfbkm restarts 3, last terminateed reason is "Error"
[2023-09-12T00:01:37.745Z] mdp-494-n-milvus-rootcoord-57c77cfd84-nnvdr restarts 1, last terminateed reason is "Error"

Expected Behavior

pass

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 1 year ago

/assign @smellthemoon could you help on investigating

NicoYuan1986 commented 1 year ago

kafka: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/494/pipeline/205 datacoord restarts many times

[2023-09-11T18:16:08.959Z] mdk-494-n-milvus-datacoord-8577d865c7-wd6fq      0/1     CrashLoopBackOff   4 (92s ago)   21m   10.105.5.221   ci-node11   <none>           <none>
yanliang567 commented 1 year ago

/unassign

smellthemoon commented 1 year ago
1
smellthemoon commented 1 year ago

related with #26485, plz check after #27045 and #27013 merged.

NicoYuan1986 commented 1 year ago

Pulsar can run in 1h 35min now.

NicoYuan1986 commented 1 year ago

Seems fixed. 2.3.0-20230915-3f550cee