milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.06k stars 2.95k forks source link

[Bug]: streaming node keeps crash after pod kill chaos test #36016

Closed zhuwenxing closed 1 month ago

zhuwenxing commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version:master-20240905-8593c458-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024/09/05 09:30:25.565 +00:00] [INFO] [segment/segment_assign_interceptor.go:239] ["recover PChannel Assignment Manager success"] [pchannel="{\"Name\":\"by-dev-rootcoord-dml_6\",\"Term\":59}"]
[2024/09/05 09:30:25.565 +00:00] [ERROR] [conc/options.go:54] ["Conc pool panicked"] [panic="proto: required field \"ledgerId\" not set"] [stack="github.com/milvus-io/milvus/pkg/util/conc.(*poolOption).antsOptions.func1\n\t/go/src/github.com/milvus-io/milvus/pkg/util/conc/options.go:54\ngithub.com/panjf2000/ants/v2.(*goWorker).run.func1.1\n\t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:54\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:914\ngithub.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1.1\n\t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:74\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:914\ngithub.com/milvus-io/milvus/pkg/streaming/util/message/adaptor.MustGetMessageIDFromMQWrapperIDBytes\n\t/go/src/github.com/milvus-io/milvus/pkg/streaming/util/message/adaptor/message_id.go:65\ngithub.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl.(*channelLifetime).Run\n\t/go/src/github.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl/channel_lifetime.go:86\ngithub.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl.(*flusherImpl).Start.func1.1.1\n\t/go/src/github.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl/flusher_impl.go:141\ngithub.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n\t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\ngithub.com/panjf2000/ants/v2.(*goWorker).run.func1\n\t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67"]
panic: proto: required field "ledgerId" not set [recovered]
    panic: proto: required field "ledgerId" not set [recovered]
    panic: proto: required field "ledgerId" not set

goroutine 608 [running]:
panic({0x6331ee0?, 0xc0019b4b40?})
    /usr/local/go/src/runtime/panic.go:1017 +0x3ac fp=0xc001a297e8 sp=0xc001a29738 pc=0x213bc8c
github.com/milvus-io/milvus/pkg/util/conc.(*poolOption).antsOptions.func1({0x6331ee0, 0xc0019b4b40})
    /go/src/github.com/milvus-io/milvus/pkg/util/conc/options.go:56 +0x146 fp=0xc001a298b0 sp=0xc001a297e8 pc=0x3e17926
github.com/panjf2000/ants/v2.(*goWorker).run.func1.1()
    /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:54 +0x6d fp=0xc001a29928 sp=0xc001a298b0 pc=0x3e14f8d
runtime.deferCallSave(0xc001a299e0, 0xc001a29fb8?)
    /usr/local/go/src/runtime/panic.go:798 +0x84 fp=0xc001a29938 sp=0xc001a29928 pc=0x213b844
runtime.runOpenDeferFrame(0xc0019d67d0)
    /usr/local/go/src/runtime/panic.go:771 +0x1b8 fp=0xc001a29978 sp=0xc001a29938 pc=0x213b678
panic({0x6331ee0?, 0xc0019b4b40?})
    /usr/local/go/src/runtime/panic.go:914 +0x21f fp=0xc001a29a28 sp=0xc001a29978 pc=0x213baff
github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1.1()
    /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:74 +0x8d fp=0xc001a29a88 sp=0xc001a29a28 pc=0x4a9fcad
runtime.deferCallSave(0xc001a29b40, 0xc001a29f50?)
    /usr/local/go/src/runtime/panic.go:798 +0x84 fp=0xc001a29a98 sp=0xc001a29a88 pc=0x213b844
runtime.runOpenDeferFrame(0xc0019d6820)
    /usr/local/go/src/runtime/panic.go:771 +0x1b8 fp=0xc001a29ad8 sp=0xc001a29a98 pc=0x213b678
panic({0x6331ee0?, 0xc0019b4b40?})
    /usr/local/go/src/runtime/panic.go:914 +0x21f fp=0xc001a29b88 sp=0xc001a29ad8 pc=0x213baff
github.com/milvus-io/milvus/pkg/streaming/util/message/adaptor.MustGetMessageIDFromMQWrapperIDBytes({0x69bdcaa?, 0xc001a0ab00?}, {0x0?, 0x0?, 0x0?})
    /go/src/github.com/milvus-io/milvus/pkg/streaming/util/message/adaptor/message_id.go:65 +0x11c fp=0xc001a29bc8 sp=0xc001a29b88 pc=0x4a544dc
github.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl.(*channelLifetime).Run(0xc0017a6460)
    /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl/channel_lifetime.go:86 +0x30a fp=0xc001a29e00 sp=0xc001a29bc8 pc=0x5d7ffca
github.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl.(*flusherImpl).Start.func1.1.1()
    /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl/flusher_impl.go:141 +0x46 fp=0xc001a29f28 sp=0xc001a29e00 pc=0x5d81ae6
github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1()
    /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 +0xb3 fp=0xc001a29f88 sp=0xc001a29f28 pc=0x4a9fb73
github.com/panjf2000/ants/v2.(*goWorker).run.func1()
    /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 +0x8d fp=0xc001a29fe0 sp=0xc001a29f88 pc=0x3e14ead
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001a29fe8 sp=0xc001a29fe0 pc=0x21754e1
created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 403
    /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:48 +0x5c

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-straming-node-cron/detail/chaos-test-straming-node-cron/3/pipeline

log: artifacts-streamingnode-pod-kill-3-server-logs.tar.gz

pod info


[2024-09-05T09:15:54.301Z] + kubectl get pods -o wide

[2024-09-05T09:15:54.302Z] + grep streamingnode-pod-kill-3

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-etcd-0                                   1/1     Running            0               38m     10.104.23.111   4am-node27   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-etcd-1                                   1/1     Running            0               38m     10.104.21.84    4am-node24   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-etcd-2                                   1/1     Running            0               38m     10.104.17.254   4am-node23   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-datanode-9d86bb8c4-jbq65          1/1     Running            2 (37m ago)     38m     10.104.32.3     4am-node39   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-datanode-9d86bb8c4-vwgft          1/1     Running            2 (37m ago)     38m     10.104.13.181   4am-node16   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-indexnode-7c9dbd75bc-j47pv        1/1     Running            2 (37m ago)     38m     10.104.20.82    4am-node22   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-indexnode-7c9dbd75bc-jc476        1/1     Running            2 (37m ago)     38m     10.104.32.4     4am-node39   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-indexnode-7c9dbd75bc-n845m        1/1     Running            0               38m     10.104.1.132    4am-node10   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-mixcoord-55b4579575-hhk4t         1/1     Running            2 (37m ago)     38m     10.104.14.62    4am-node18   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-proxy-5f84b84669-bl2b2            1/1     Running            2 (37m ago)     38m     10.104.34.37    4am-node37   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-querynode-574d85dc7-6ng6q         1/1     Running            2 (37m ago)     38m     10.104.4.201    4am-node11   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-querynode-574d85dc7-msrrr         1/1     Running            2 (37m ago)     38m     10.104.16.31    4am-node21   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-querynode-574d85dc7-ptngz         1/1     Running            2 (37m ago)     38m     10.104.34.38    4am-node37   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-streamingnode-54cb645df5-2hljs    0/1     CrashLoopBackOff   6 (2m52s ago)   9m53s   10.104.1.163    4am-node10   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-milvus-streamingnode-54cb645df5-56kzh    0/1     CrashLoopBackOff   6 (57s ago)     9m53s   10.104.34.80    4am-node37   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-minio-0                                  1/1     Running            0               38m     10.104.21.85    4am-node24   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-minio-1                                  1/1     Running            0               38m     10.104.17.252   4am-node23   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-minio-2                                  1/1     Running            0               38m     10.104.23.118   4am-node27   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-minio-3                                  1/1     Running            0               38m     10.104.30.134   4am-node38   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-pulsar-bookie-0                          1/1     Running            0               38m     10.104.30.132   4am-node38   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-pulsar-bookie-1                          1/1     Running            0               38m     10.104.21.87    4am-node24   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-pulsar-bookie-init-6nhx4                 0/1     Completed          0               38m     10.104.14.59    4am-node18   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-pulsar-broker-0                          1/1     Running            0               38m     10.104.14.58    4am-node18   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-pulsar-proxy-0                           1/1     Running            0               38m     10.104.6.247    4am-node13   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-pulsar-pulsar-init-gp5b5                 0/1     Completed          0               38m     10.104.14.61    4am-node18   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-pulsar-recovery-0                        1/1     Running            0               38m     10.104.17.246   4am-node23   <none>           <none>

[2024-09-05T09:15:54.558Z] streamingnode-pod-kill-3-pulsar-zookeeper-0                       1/1     Running            0               38m     10.104.17.251   4am-node23   <none>           <none>

Anything else?

No response

zhuwenxing commented 2 months ago

/assign @chyezh PTAL

chyezh commented 2 months ago
[2024/09/05 09:00:55.774 +00:00] [INFO] [datacoord/services.go:641] ["DropVChannel plan to remove"] [traceID=0254935119240ad2ea375eecacf7dad4] [channel=by-dev-rootcoord-dml_0_452336160083542224v0]

[2024/09/05 09:00:55.775 +00:00] [WARN] [datacoord/services.go:644] ["DropVChannel failed to ReleaseAndRemove"] [traceID=0254935119240ad2ea375eecacf7dad4] [channel=by-dev-rootcoord-dml_0_452336160083542224v0] [error="fail to find matching nodeID: 13 with channelName: by-dev-rootcoord-dml_0_452336160083542224v0"]
yanliang567 commented 2 months ago

/unassign

chyezh commented 2 months ago

@zhuwenxing please verify it with latest master image

zhuwenxing commented 2 months ago

@chyezh

all pods can be healthy after the pod kill chaos test. but many test cases failed due to load and flush failure after recovery.

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-straming-node-cron/detail/chaos-test-straming-node-cron/8/pipeline log: artifacts-streamingnode-pod-kill-8-server-logs.tar.gz

chyezh commented 2 months ago

There's an error-handling bug at Txn of streaming service. Cause a dead lock forever.

            defer func() {
                if err != nil {
                    txnSession.AddNewMessageFail()
                }
                // perform keepalive for the transaction session if append success.
                txnSession.AddNewMessageDoneAndKeepalive(msg.TimeTick())
            }()
zhuwenxing commented 1 month ago

verified and fixed in master-20241008-ca9842e4-amd64