milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.45k stars 2.92k forks source link

[Bug]: cluster can not insert data,and data node will restart when inserting data #33012

Open 1271653627 opened 6 months ago

1271653627 commented 6 months ago

Is there an existing issue for this?

Environment

- Milvus version:v2.3.13
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The cluster is unable to insert data, and every time data insertion is attempted, the data node restarts. When checking the logs, the following error is reported. However, the cluster can create collections and load them normally. milvus_data1.1.ddkurhxaadx8@gp22aitppap92xj | [2024/05/11 17:16:34.629 +00:00] [ERROR] [retry/retry.go:46] ["retry func failed"] ["retry time"=8] [error="server error: ServiceNotReady: Namespace bundle for topic (persistent://public/default/cpic-milvus-rootcoord-dml_3) not served by this instance:broker:8080. Please redo the lookup. Request is denied: namespace=public/default"] [stack="github.com/milvus-io/milvus/pkg/util/retry.Do\n\t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:46\ngithub.com/milvus-io/milvus/pkg/mq/msgstream.(MqTtMsgStream).AsConsumer\n\t/go/src/github.com/milvus-io/milvus/pkg/mq/msgstream/mq_msgstream.go:586\ngithub.com/milvus-io/milvus/pkg/mq/msgdispatcher.NewDispatcher\n\t/go/src/github.com/milvus-io/milvus/pkg/mq/msgdispatcher/dispatcher.go:100\ngithub.com/milvus-io/milvus/pkg/mq/msgdispatcher.(dispatcherManager).Add\n\t/go/src/github.com/milvus-io/milvus/pkg/mq/msgdispatcher/manager.go:93\ngithub.com/milvus-io/milvus/pkg/mq/msgdispatcher.(client).Register\n\t/go/src/github.com/milvus-io/milvus/pkg/mq/msgdispatcher/client.go:77\ngithub.com/milvus-io/milvus/internal/datanode.newDmInputNode\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_dmstream_input_node.go:49\ngithub.com/milvus-io/milvus/internal/datanode.getServiceWithChannel\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/data_sync_service.go:361\ngithub.com/milvus-io/milvus/internal/datanode.newServiceWithEtcdTickler\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/data_sync_service.go:431\ngithub.com/milvus-io/milvus/internal/datanode.(flowgraphManager).addAndStartWithEtcdTickler\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_manager.go:131\ngithub.com/milvus-io/milvus/internal/datanode.(DataNode).handlePutEvent\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/event_manager.go:179\ngithub.com/milvus-io/milvus/internal/datanode.(channelEventManager).Run.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/event_manager.go:268"]

Expected Behavior

insert data normally

Steps To Reproduce

1.deploy cluster
2.create collection
3.load 
4.insert

Milvus Log

This data node log milvus_data.log

Anything else?

No response

1271653627 commented 6 months ago

@yanliang567 May I ask: If I want to restart the milvus service to fix the issue with the milvus cluster deployed using Docker Swarm, can I first stop the milvus related services, that is, coordinate nodes, work nodes, and proxy nodes, and restart them. Just keep the minio, etcd, and plusar stationary?

yanliang567 commented 6 months ago

@1271653627 you can restart in that way, but you shall know that Docker Swarm is not a tested depoyment mean in the community.

/assign @congqixia looks like a mq issue, please help to confirm /unassign

1271653627 commented 6 months ago

@congqixia @yanliang567 Below is the log for the coord node that I've supplemented. coordnode.zip I noticed this issue #25267, and I've encountered a similar situation before where the number of entities on the attu is incorrect after inserting data. Referring to their method: changing the number of replicas of the datanode to 1, and increasing the rootCoord.dmlChannelNum parameter, can this solve the current problem? Also, I deployed a Milvus cluster with the same configuration in the test environment, and everything worked fine. However, in the production environment, I couldn't insert data. The test environment is running on Red Hat Enterprise Linux Server 7.4 Maipo (64-bit), while the production environment is on UOS 20 Fuyu (64-bit). I wonder if it's related to the operating system. Looking forward to your response. Thanks for your support.

congqixia commented 6 months ago

@1271653627 after some inspection from the log. It looks like the datanode failed to query topic from pulsar broker for a long period. datanode session id went 100+, so it repeatedly tried to subscribe for serving insert data. Did you pulsar cluster went abnormal during the problem occurred? And could you please provided the mq section in you configuration file? the port 8080 seems strange here according to @LoveEachDay

1271653627 commented 6 months ago

I set pulsar webport to 8080,because of below picture. image this is my milvus config file. @congqixia milvus-config.txt

1271653627 commented 6 months ago

@LoveEachDay Please help check the comments above, thank you.