Open zeropoint5 opened 3 weeks ago
@zeropoint5 please export the milvus logs and attach them into the issue. It seems that the milvus service become unhealthy and your client fail to connect to it. please refer this doc to export the whole Milvus logs. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.
/assign @zeropoint5 /unassign
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: 2.4.5 - Deployment mode(standalone or cluster): standalone - MQ type(rocksmq, pulsar or kafka): rocksmq - SDK version(e.g. pymilvus v2.0.0rc2): langchain-milvus==0.1.1 pymilvus==2.4.3 - OS(Ubuntu or CentOS): Ubuntu 20.04.6 LTS - CPU/Memory: 4c 8G - GPU: - Others:
Current Behavior
Milvus docker log keep poping this skip msg, and from sdk side, Milvus is unusable
Milvus log: 2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.317 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=450775850492102034] [replica=450775850778624001] [channel=by-dev-rootcoord-dml_0_450775850492102034v0] [from=-1] [to=95] [2024/09/23 08:14:51.353 +00:00] [INFO] [task/scheduler.go:604] ["processed tasks"] [nodeID=95] [toProcessNum=1] [committedNum=0] [toRemoveNum=0] [2024/09/23 08:14:51.353 +00:00] [INFO] [task/scheduler.go:610] ["process tasks related to node done"] [nodeID=95] [processingTaskNum=1] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=1]
from my python flask backend service error log:
2024-09-23 15:36:39,468 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)> 2024-09-23 15:36:39,532 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)> 2024-09-23 15:36:39,544 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)> 2024-09-23 15:36:39,702 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>
Expected Behavior
Below is my connection part python code, the expect behavior is return a valid Milvus object which MilvusRepository need for initialize. The system was running pretty well and only occasionally encountered issues. In the past two months, this is the second time it has occurred. The first time it happened, I had to delete the entire external docker volume for Milvus and start from scratch to recover, as restarting or updating the Milvus version didn't resolve the issue. The second time was similar; I had to abandon the entire Milvus database before I could bring it back up successfully. Restarting alone was not effective, nor was updating the Milvus version.
import os
import dotenv from pymilvus import connections, utility from langchain_milvus import Milvus
from src.utils.logger import logger from src.utils.langchain_toolkit import embedding from src.repository.milvus_repository import MilvusRepository
dotenv.load_dotenv()
MILVUS_HOST = os.getenv("MILVUS_HOST") MILVUS_PORT = os.getenv("MILVUS_PORT") MILVUS_COLLECTION = os.getenv("KS_MILVUS_COLLECTION")
def get_vector_repository(): try: connections.connect( alias="default", host=MILVUS_HOST, port=MILVUS_PORT ) assert utility.get_server_version() milvus_db = Milvus(embedding_function=embedding, connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT}, collection_name=MILVUS_COLLECTION, auto_id=True) return MilvusRepository(milvus_db) except Exception as e: logger.error(f"Error initializing MilvusVectorStore") raise e
Steps To Reproduce
No response
Milvus Log
I have attached the latest rdb_data/LOG and a LOG.old file in the appendix, along with a 000661.log file. Additionally, I've included a list of all files in the external volume. If any other files could help with bug localization, I can upload them as well. I've also attached my docker-compose file: infra.yml, as other Docker components might have an impact, so I've included everything. There's also a milvus.yaml configuration file attached.
Anything else?
No response
The search seems to be with a lot of in, is this expected? file_list.txt in [xxxx......] many ids
@yanliang567 we need to setup a meeting for this. there are a lot of people with standalone run into issue
BTW, how large is your /var/lib/milvus/rdb_data? can you dump this director for us? We can try to bootstrap a rocksmq with this directory
No problems. @zeropoint5 please send me a mail via yanliang.qiao@zilliz.com with your available time for a meeting.
/assign @aoiasd could you help to investigate the skip msg log?I think this happens a lot of standalone + rocksmq
rdb_data.1.zip rdb_data.4.zip rdb_data.3.zip rdb_data.2.zip
I have attached 4 zip files for rdb_data since the file size limit to 25mb, please use command to combine them then unzip
cat rdb_data.*.zip > rdb_data_restored.zip
@aoiasd please help on it
Rocksmq consumer seems seek to a early position, cause that it need consume and skip a lot of invalid message and cause watch timeout. We need more logs or reproduce to verify this guess.
To recover milvus instance
We could set dataCoord.channel.watchTimeoutInterval (default 5 min) for 30 min, or even longer. And then restart milvus, wait consumer consume and skip all invalid message.
But if last method not work, or need a quick fix. Could clean rocksdb data to drop all invalid data. WARN: MAKE SURE ALL DATA FLUSHED OR MAY CAUSE DATA LOSE. Remove rocksdb data file and meta file with milvus closed.(default /var/lib/milvus/rdb_data and /var/lib/milvus/rdb_data_meta_kv)
@aoiasd can we cherry pick this to 2.4 as well?
Is there an existing issue for this?
Environment
Current Behavior
Milvus docker log keep poping this skip msg, and from sdk side, Milvus is unusable
Milvus log: 2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.274 +00:00] [INFO] [msgstream/mq_msgstream.go:938] ["skip msg"] [source=66] [type=TimeTick] [size=17] [position=] [2024/09/23 08:14:51.317 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=450775850492102034] [replica=450775850778624001] [channel=by-dev-rootcoord-dml_0_450775850492102034v0] [from=-1] [to=95] [2024/09/23 08:14:51.353 +00:00] [INFO] [task/scheduler.go:604] ["processed tasks"] [nodeID=95] [toProcessNum=1] [committedNum=0] [toRemoveNum=0] [2024/09/23 08:14:51.353 +00:00] [INFO] [task/scheduler.go:610] ["process tasks related to node done"] [nodeID=95] [processingTaskNum=1] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=1]
from my python flask backend service error log:
2024-09-23 15:36:39,468 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)> 2024-09-23 15:36:39,532 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)> 2024-09-23 15:36:39,544 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)> 2024-09-23 15:36:39,702 - vector_db_helper - get_vector_repository - ERROR : Error initializing MilvusVectorStore: <MilvusException: (code=2, message=Fail connecting to server on milvus_standalone:19530, illegal connection params or server unavailable)>
Expected Behavior
Below is my connection part python code, the expect behavior is return a valid Milvus object which MilvusRepository need for initialize. The system was running pretty well and only occasionally encountered issues. In the past two months, this is the second time it has occurred. The first time it happened, I had to delete the entire external docker volume for Milvus and start from scratch to recover, as restarting or updating the Milvus version didn't resolve the issue. The second time was similar; I had to abandon the entire Milvus database before I could bring it back up successfully. Restarting alone was not effective, nor was updating the Milvus version.
import os
import dotenv from pymilvus import connections, utility from langchain_milvus import Milvus
from src.utils.logger import logger from src.utils.langchain_toolkit import embedding from src.repository.milvus_repository import MilvusRepository
dotenv.load_dotenv()
MILVUS_HOST = os.getenv("MILVUS_HOST") MILVUS_PORT = os.getenv("MILVUS_PORT") MILVUS_COLLECTION = os.getenv("KS_MILVUS_COLLECTION")
def get_vector_repository(): try: connections.connect( alias="default", host=MILVUS_HOST, port=MILVUS_PORT ) assert utility.get_server_version() milvus_db = Milvus(embedding_function=embedding, connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT}, collection_name=MILVUS_COLLECTION, auto_id=True) return MilvusRepository(milvus_db) except Exception as e: logger.error(f"Error initializing MilvusVectorStore") raise e
Steps To Reproduce
No response
Milvus Log
milvus_debug.zip
I have attached the latest rdb_data/LOG and a LOG.old file in the appendix, along with a 000661.log file. Additionally, I've included a list of all files in the external volume. If any other files could help with bug localization, I can upload them as well. I've also attached my docker-compose file: infra.yml, as other Docker components might have an impact, so I've included everything. There's also a milvus.yaml configuration file attached.
Anything else?
No response