milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.76k stars 2.85k forks source link

[Bug]: rocksmq consumer get stucked if there's no more producing message #34045

Open chyezh opened 3 months ago

chyezh commented 3 months ago

Is there an existing issue for this?

Environment

- Milvus version: v2.4.4
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):   rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

rocksmq consumer deliver may skip the producing signal. So rocksmq consumer may get stucked if there's no more producing message.

https://github.com/milvus-io/milvus/blob/66710008d63c42633acbb785d4b9313513f07819/internal/mq/mqimpl/rocksmq/client/client_impl.go#L149

Expected Behavior

No response

Steps To Reproduce

1. Start rocksmq and create a `msgChannel=16` consumer of a topic.
2. Produce 20 message to it.
3. Consume the message, and get stucked, 4 message cannot be seen forever if there's no more message.

Milvus Log

No response

Anything else?

It's work on current milvus for there's always a timetick message. But the consuming rate is limited. And current milvus use a big buffer msgchannel by default, so the problem is hard to reproduce in milvus environment.

But in future, we introduce a wal interface, the bug should be fixed. #33285

chyezh commented 3 months ago

/assign

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh commented 2 months ago

keep it

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh commented 1 month ago

keep it

stale[bot] commented 5 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh commented 4 days ago

/reopen

xiaofan-luan commented 4 days ago

/assign @aoiasd

chyezh commented 4 days ago

Here's the bug explaination:

Rocksmq Consumer's dilivery logic works as below:

func (c *client) consume(consumer *consumer) {
...
        case _, ok := <-consumer.MsgMutex():
            if !ok {
                // consumer MsgMutex closed, goroutine exit
                log.Debug("Consumer MsgMutex closed")
                return
            }
            c.deliver(consumer)
        }
...
}

The consuming notification's signal MsgMutex is a channel which is triggered by Producer. That notification may be lost if MsgMutex is full.

msgMutex  chan struct{}

So once the producer do not send message, the tailing message may be never consumed by consumer, because that the MsgMutex is never notified.