milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.08k stars 2.88k forks source link

[Bug]: Rocksmq panic with (send on closed channel) at runtime #29101

Open chyezh opened 10 months ago

chyezh commented 10 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.3-on-dev
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Macos
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Rocksmq has a data race, which is likely to cause panic of (send on closed channel).

Expected Behavior

no panic at runtime

Steps To Reproduce

1. Frequently creating consumer and product messages at milvus cluster.

Milvus Log

panic: send on closed channel

goroutine 2599 [running]: panic({0x105cd9740?, 0x1062026d0?}) /opt/homebrew/Cellar/go/1.21.4/libexec/src/runtime/panic.go:1017 +0x388 fp=0x14008e12960 sp=0x14008e128b0 pc=0x1025034b8 runtime.chansend(0x140032bb140, 0x14008e12a52, 0x0, 0x140061c08d0?) /opt/homebrew/Cellar/go/1.21.4/libexec/src/runtime/chan.go:206 +0x3d4 fp=0x14008e129d0 sp=0x14008e12960 pc=0x1024ccdf4 runtime.selectnbsend(0x1400138b3a8?, 0x105b416a0?) /opt/homebrew/Cellar/go/1.21.4/libexec/src/runtime/chan.go:694 +0x24 fp=0x14008e12a00 sp=0x14008e129d0 pc=0x1024cd9e4 github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/server.(rocksmq).Produce(0x1400138b340, {0x14002eb43f0, 0x16}, {0x1400729bf00, 0x1, 0x1}) /Users/zilliz/repo/github/chyezh/milvus/internal/mq/mqimpl/rocksmq/server/rocksmq_impl.go:666 +0x161c fp=0x14008e134a0 sp=0x14008e12a00 pc=0x10446682c github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/client.(producer).Send(0x14003e250c8, 0x1400729b8c0) /Users/zilliz/repo/github/chyezh/milvus/internal/mq/mqimpl/rocksmq/client/producer_impl.go:54 +0x138 fp=0x14008e135a0 sp=0x14008e134a0 pc=0x104474cc8 github.com/milvus-io/milvus/internal/mq/msgstream/mqwrapper/rmq.(*rmqProducer).Send(0x 14002a10bd0, {0x106236a80, 0x140040e3ef0}, 0x1400729b880)

Anything else?

No response

chyezh commented 10 months ago
  1. RegisterConsumer without topic lock may see the modification at halfway of consumers.
  2. Store the closed consumers back to the consumers.
  3. all function read or write on consumers variable like RegisterConsumer, DestroyConsumerGroup, Produce has data race. image image
yanliang567 commented 10 months ago

/unassign

stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh commented 4 months ago

related issue: #33285

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh commented 3 months ago

keep it

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh commented 1 month ago

keep it

stale[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh commented 3 weeks ago

/reopen