milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.06k stars 2.88k forks source link

[Bug]: Milvus keep popping TimeTick log and cannot recover #36569

Open zeropoint5 opened 3 weeks ago

zeropoint5 commented 3 weeks ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4.5
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): langchain-milvus==0.1.1
pymilvus==2.4.3
- OS(Ubuntu or CentOS): Ubuntu 20.04.6 LTS
- CPU/Memory: 4c 8G
- GPU:
- Others:

Current Behavior

Milvus docker log keep poping this skip msg, and from sdk side, Milvus is unusable

Milvus log: 2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.317 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=450775850492102034] [replica=450775850778624001] [channel=by-dev-rootcoord-dml_0_450775850492102034v0] [from=-1] [to=95]
[2024/09/23 08:14:51.353 +00:00] [INFO] [task/scheduler.go:604] ["processed tasks"] [nodeID=95] [toProcessNum=1] [committedNum=0] [toRemoveNum=0]
[2024/09/23 08:14:51.353 +00:00] [INFO] [task/scheduler.go:610] ["process tasks related to node done"] [nodeID=95] [processingTaskNum=1] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=1]

from my python flask backend service error log:

2024-09-23 15:36:39,468 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>
2024-09-23 15:36:39,532 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>
2024-09-23 15:36:39,544 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>
2024-09-23 15:36:39,702 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>

Expected Behavior

Below is my connection part python code, the expect behavior is return a valid Milvus object which MilvusRepository need for initialize. The system was running pretty well and only occasionally encountered issues. In the past two months, this is the second time it has occurred. The first time it happened, I had to delete the entire external docker volume for Milvus and start from scratch to recover, as restarting or updating the Milvus version didn't resolve the issue. The second time was similar; I had to abandon the entire Milvus database before I could bring it back up successfully. Restarting alone was not effective, nor was updating the Milvus version.

import os

import dotenv from pymilvus import connections, utility from langchain_milvus import Milvus

from src.utils.logger import logger from src.utils.langchain_toolkit import embedding from src.repository.milvus_repository import MilvusRepository

dotenv.load_dotenv()

MILVUS_HOST = os.getenv("MILVUS_HOST") MILVUS_PORT = os.getenv("MILVUS_PORT") MILVUS_COLLECTION = os.getenv("KS_MILVUS_COLLECTION")

def get_vector_repository(): try: connections.connect( alias="default", host=MILVUS_HOST, port=MILVUS_PORT ) assert utility.get_server_version() milvus_db = Milvus(embedding_function=embedding, connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT}, collection_name=MILVUS_COLLECTION, auto_id=True) return MilvusRepository(milvus_db) except Exception as e: logger.error(f"Error initializing MilvusVectorStore") raise e

Steps To Reproduce

No response

Milvus Log

milvus_debug.zip

I have attached the latest rdb_data/LOG and a LOG.old file in the appendix, along with a 000661.log file. Additionally, I've included a list of all files in the external volume. If any other files could help with bug localization, I can upload them as well. I've also attached my docker-compose file: infra.yml, as other Docker components might have an impact, so I've included everything. There's also a milvus.yaml configuration file attached.

Anything else?

No response

yanliang567 commented 3 weeks ago

@zeropoint5 please export the milvus logs and attach them into the issue. It seems that the milvus service become unhealthy and your client fail to connect to it. please refer this doc to export the whole Milvus logs. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @zeropoint5 /unassign

xiaofan-luan commented 3 weeks ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: 2.4.5
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): langchain-milvus==0.1.1
pymilvus==2.4.3
- OS(Ubuntu or CentOS): Ubuntu 20.04.6 LTS
- CPU/Memory: 4c 8G
- GPU:
- Others:

Current Behavior

Milvus docker log keep poping this skip msg, and from sdk side, Milvus is unusable

Milvus log: 2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=]
[2024/09/23 08:14:51.317 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=450775850492102034] [replica=450775850778624001] [channel=by-dev-rootcoord-dml_0_450775850492102034v0] [from=-1] [to=95]
[2024/09/23 08:14:51.353 +00:00] [INFO] [task/scheduler.go:604] ["processed tasks"] [nodeID=95] [toProcessNum=1] [committedNum=0] [toRemoveNum=0]
[2024/09/23 08:14:51.353 +00:00] [INFO] [task/scheduler.go:610] ["process tasks related to node done"] [nodeID=95] [processingTaskNum=1] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=1]

from my python flask backend service error log:

2024-09-23 15:36:39,468 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>
2024-09-23 15:36:39,532 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>
2024-09-23 15:36:39,544 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>
2024-09-23 15:36:39,702 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>

Expected Behavior

Below is my connection part python code, the expect behavior is return a valid Milvus object which MilvusRepository need for initialize. The system was running pretty well and only occasionally encountered issues. In the past two months, this is the second time it has occurred. The first time it happened, I had to delete the entire external docker volume for Milvus and start from scratch to recover, as restarting or updating the Milvus version didn't resolve the issue. The second time was similar; I had to abandon the entire Milvus database before I could bring it back up successfully. Restarting alone was not effective, nor was updating the Milvus version.

import os

import dotenv from pymilvus import connections, utility from langchain_milvus import Milvus

from src.utils.logger import logger from src.utils.langchain_toolkit import embedding from src.repository.milvus_repository import MilvusRepository

dotenv.load_dotenv()

MILVUS_HOST = os.getenv("MILVUS_HOST") MILVUS_PORT = os.getenv("MILVUS_PORT") MILVUS_COLLECTION = os.getenv("KS_MILVUS_COLLECTION")

def get_vector_repository(): try: connections.connect( alias="default", host=MILVUS_HOST, port=MILVUS_PORT ) assert utility.get_server_version() milvus_db = Milvus(embedding_function=embedding, connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT}, collection_name=MILVUS_COLLECTION, auto_id=True) return MilvusRepository(milvus_db) except Exception as e: logger.error(f"Error initializing MilvusVectorStore") raise e

Steps To Reproduce

No response

Milvus Log

milvus_debug.zip

I have attached the latest rdb_data/LOG and a LOG.old file in the appendix, along with a 000661.log file. Additionally, I've included a list of all files in the external volume. If any other files could help with bug localization, I can upload them as well. I've also attached my docker-compose file: infra.yml, as other Docker components might have an impact, so I've included everything. There's also a milvus.yaml configuration file attached.

Anything else?

No response

The search seems to be with a lot of in, is this expected? file_list.txt in [xxxx......] many ids

xiaofan-luan commented 3 weeks ago

@yanliang567 we need to setup a meeting for this. there are a lot of people with standalone run into issue

xiaofan-luan commented 3 weeks ago

BTW, how large is your /var/lib/milvus/rdb_data? can you dump this director for us? We can try to bootstrap a rocksmq with this directory

yanliang567 commented 3 weeks ago

No problems. @zeropoint5 please send me a mail via yanliang.qiao@zilliz.com with your available time for a meeting.

xiaofan-luan commented 3 weeks ago

/assign @aoiasd could you help to investigate the skip msg log?I think this happens a lot of standalone + rocksmq

zeropoint5 commented 3 weeks ago

rdb_data.1.zip rdb_data.4.zip rdb_data.3.zip rdb_data.2.zip

I have attached 4 zip files for rdb_data since the file size limit to 25mb, please use command to combine them then unzip

cat rdb_data.*.zip > rdb_data_restored.zip

xiaofan-luan commented 3 weeks ago

@aoiasd please help on it

aoiasd commented 3 weeks ago

Rocksmq consumer seems seek to a early position, cause that it need consume and skip a lot of invalid message and cause watch timeout. We need more logs or reproduce to verify this guess.

To recover milvus instance

xiaofan-luan commented 1 week ago

@aoiasd can we cherry pick this to 2.4 as well?