milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.2k stars 2.89k forks source link

[Bug]: Unable to load the collection #35441

Open kish5430 opened 2 months ago

kish5430 commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version:2.4.4
- Deployment mode(standalone or cluster): Cluster
- MQ type(rocksmq, pulsar or kafka): Kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

We created a collection and inserted 1.5 billion vectors. When attempting to load the collection, the querynode and datanode pods crashed with OOM errors at 19% due to resource limitations. We unloaded the collection, increased the memory for the querynode and datanode pods, and restarted all pods. Upon retrying to load the collection, the progress jumped directly from 0% to 19% but then stalled. We noticed that one datanode and querynode reached 90% usage and crashed, while the other nodes were only using around 20%.

Then we tried to load the collection with 'collection.load(replica_number=50)' method. There was no pod crashes but still unable to load the collection

Expected Behavior

No response

Steps To Reproduce

collection.load(replica_number=50) RPC error: [get_loading_progress], <MilvusException: (code=101, message=collection not loaded[collection=451400828653517657])>, <Time:{'RPC start': '2024-08-12 22:39:38.687809', 'RPC error': '2024-08-12 22:39:38.852918'}> RPC error: [wait_for_loading_collection], <MilvusException: (code=101, message=collection not loaded[collection=451400828653517657])>, <Time:{'RPC start': '2024-08-12 22:29:37.771889', 'RPC error': '2024-08-12 22:39:38.858449'}> RPC error: [load_collection], <MilvusException: (code=101, message=collection not loaded[collection=451400828653517657])>, <Time:{'RPC start': '2024-08-12 22:29:32.098361', 'RPC error': '2024-08-12 22:39:38.858811'}> Traceback (most recent call last): File "", line 1, in File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/orm/collection.py", line 424, in load conn.load_collection( File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 147, in handler raise e from e File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 143, in handler return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 182, in handler return func(self, *args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 122, in handler raise e from e File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 87, in handler return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/client/grpc_handler.py", line 1148, in load_collection self.wait_for_loading_collection(collection_name, timeout, is_refresh=_refresh) File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 147, in handler raise e from e File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 143, in handler return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 182, in handler return func(self, *args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 122, in handler raise e from e File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 87, in handler return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/client/grpc_handler.py", line 1168, in wait_for_loading_collection progress = self.get_loading_progress( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 147, in handler raise e from e File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 143, in handler return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 182, in handler return func(self, *args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 122, in handler raise e from e File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 87, in handler return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/client/grpc_handler.py", line 1267, in get_loading_progress check_status(response.status) File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/client/utils.py", line 62, in check_status raise MilvusException(status.code, status.reason, status.error_code) pymilvus.exceptions.MilvusException: <MilvusException: (code=101, message=collection not loaded[collection=451400828653517657])>

No response

Milvus Log

No response

Anything else?

No response

congqixia commented 2 months ago

replica_number=50 meaning milvus shall load same collection 50 times in different sets of querynodes. Quick question, does your cluster have enough querynodes to hold 50 replicas?

kish5430 commented 2 months ago

replica_number=50 meaning milvus shall load same collection 50 times in different sets of querynodes. Quick question, does your cluster have enough querynodes to hold 50 replicas?

Please ignore above process.

I attempted to load the collection without using replicas. During the process, one DataNode's memory usage reached 92%, causing it to restart. This pattern continued with other DataNodes, each reaching over 90% memory usage and then restarting. The load is not being evenly distributed across all DataNodes simultaneously, leading to the collection load failing.

congqixia commented 2 months ago

Can I assume that your collection has only one shard? This means that the whole work load could be handled only be single datanode. The current solution might be scale-up your datanode memory, not scale-out

kish5430 commented 2 months ago

Can I assume that your collection has only one shard? This means that the whole work load could be handled only be single datanode. The current solution might be scale-up your datanode memory, not scale-out

Yes, this collection has only one shard, but no OOM errors have been found in the data nodes. Are you still recommending increasing the data node memory and reducing the number of data node replicas?

kish5430 commented 2 months ago

Can I assume that your collection has only one shard? This means that the whole work load could be handled only be single datanode. The current solution might be scale-up your datanode memory, not scale-out

Yes, this collection has only one shard, but no OOM errors have been found in the data nodes. Are you still recommending increasing the data node memory and reducing the number of data node replicas?

I increased the memory for the data node and reduced the number of replicas, but loading is still failing. Also i can see below segment failure messages during the load process.

Error: 2024/08/14 02:06:53.281 +00:00] [WARN] [querynodev2/services.go:315] ["failed to load growing segments"] [traceID=1777f0303ebb3f44ca3c721448bc9665] [collectionID=451400828653517657] [channel=isds-milvus-prod-east2-2-vista-rootcoord-dml_7_451400828653517657v0] [currentNodeID=3130] [error="load segment failed, OOM if load, maxSegmentSize = 1663.657211303711 MB, memUsage = 20965.7265625 MB, predictMemUsage = 366486.11928367615 MB, totalMem = 102400 MB thresholdFactor = 0.900000"]

congqixia commented 2 months ago

Error: 2024/08/14 02:06:53.281 +00:00] [WARN] [querynodev2/services.go:315] ["failed to load growing segments"] [traceID=1777f0303ebb3f44ca3c721448bc9665] [collectionID=451400828653517657] [channel=isds-milvus-prod-east2-2-vista-rootcoord-dml_7_451400828653517657v0] [currentNodeID=3130] [error="load segment failed, OOM if load, maxSegmentSize = 1663.657211303711 MB, memUsage = 20965.7265625 MB, predictMemUsage = 366486.11928367615 MB, totalMem = 102400 MB thresholdFactor = 0.900000"]

This error means the memory of querynode is not enough to load even the growing segments. This problem is caused by too many growing segments left without flushing them. It shall be fixed when data node success to go up and serving. You could trigger Flush operation to make sure all gorwing segment turned into flushed ones.

yanliang567 commented 2 months ago

please also check the building index progress, utility.index_building_progress(collection_name, index_name) i guess you could add more resources for index nodes as well to speed up the building index progress.

kish5430 commented 2 months ago

Error: 2024/08/14 02:06:53.281 +00:00] [WARN] [querynodev2/services.go:315] ["failed to load growing segments"] [traceID=1777f0303ebb3f44ca3c721448bc9665] [collectionID=451400828653517657] [channel=isds-milvus-prod-east2-2-vista-rootcoord-dml_7_451400828653517657v0] [currentNodeID=3130] [error="load segment failed, OOM if load, maxSegmentSize = 1663.657211303711 MB, memUsage = 20965.7265625 MB, predictMemUsage = 366486.11928367615 MB, totalMem = 102400 MB thresholdFactor = 0.900000"]

This error means the memory of querynode is not enough to load even the growing segments. This problem is caused by too many growing segments left without flushing them. It shall be fixed when data node success to go up and serving. You could trigger Flush operation to make sure all gorwing segment turned into flushed ones.

When I tried to execute utility.flush_all(), it has been displaying the following messages without any progress and terminated. No pods have crashed." Please find below resources usage chart and error logs. ![Uploading image.png…]()

utility.flush_all() E0814 08:02:36.829494000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value E0814 08:03:36.955125000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value E0814 08:04:37.102384000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value E0814 08:05:37.351339000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:06:37.734810000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:5, cost: 0.81s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:07:38.662704000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:6, cost: 2.43s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:08:41.195734000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:7, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:09:44.313958000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:8, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:10:47.433086000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:9, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:11:50.544276000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:10, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:12:53.663535000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:11, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:13:56.782930000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:12, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:14:59.897084000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:13, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:16:03.024335000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:14, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:17:06.147184000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:15, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:18:09.260847000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:16, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:19:12.380210000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:17, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:20:15.492076000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:18, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:21:18.627204000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:19, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:22:21.742769000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:20, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:23:24.870537000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:21, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:24:28.111091000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:22, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:25:31.225968000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:23, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:26:34.351632000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:24, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:27:37.476089000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:25, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:28:40.599552000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:26, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:29:43.804847000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:27, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:30:46.952472000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:28, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:31:50.065584000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:29, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:32:53.185043000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:30, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:33:56.297080000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:31, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:34:59.514078000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:32, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:36:02.626102000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:33, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:37:05.768075000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:34, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:38:08.884363000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:35, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:39:12.013672000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:36, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:40:15.147660000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:37, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:41:18.262974000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:38, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:42:21.379996000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:39, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:43:24.574837000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:40, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:44:27.755674000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:41, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:45:30.887213000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:42, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:46:34.014123000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:43, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:47:37.198353000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:44, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:48:40.383641000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:45, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:49:43.506943000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:46, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:50:46.631468000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:47, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:51:49.783123000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:48, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:52:52.987083000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:49, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:53:56.105511000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:50, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:54:59.223372000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:51, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:56:02.338983000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:52, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:57:05.513138000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:53, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:58:08.633579000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:54, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:59:11.755881000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:55, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:00:14.886193000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:56, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:01:18.004054000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:57, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:02:21.118745000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:58, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:03:24.590322000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:59, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:04:27.728271000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:60, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:05:30.872908000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:61, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:06:33.998146000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:62, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:07:37.123477000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:63, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:08:40.308031000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:64, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:09:43.439978000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:65, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:10:46.566722000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:66, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:11:49.691795000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:67, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:12:52.872242000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:68, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:13:55.997147000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:69, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:14:59.120795000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:70, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:16:02.243846000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:71, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:17:05.370285000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:72, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:18:08.548712000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:73, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:19:11.665520000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:74, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:20:14.789971000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:75, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:21:17.904751000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value RPC error: [flush_all], <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "Stream removed" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T09:21:17.905191-06:00", grpc_status:2, grpc_message:"Stream removed"}" , message=Retry run out of 75 retry times, message=Stream removed)>, <Time:{'RPC start': '2024-08-14 08:01:36.717121', 'RPC error': '2024-08-14 09:21:17.907205'}> Traceback (most recent call last): File "", line 1, in File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/orm/utility.py", line 1256, in flush_all return _get_connection(using).flush_all(timeout=timeout, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 147, in handler raise e from e File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 143, in handler return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 182, in handler return func(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 93, in handler raise MilvusException(e.code, f"{to_msg}, message={e.details()}") from e pymilvus.exceptions.MilvusException: <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "Stream removed" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T09:21:17.905191-06:00", grpc_status:2, grpc_message:"Stream removed"}" , message=Retry run out of 75 retry times, message=Stream removed)>

kish5430 commented 2 months ago

please also check the building index progress, utility.index_building_progress(collection_name, index_name) i guess you could add more resources for index nodes as well to speed up the building index progress.

Please find below result and suggest

utility.index_building_progress("CVL_image_vfm","embedding_index") {'total_rows': 1344094667, 'indexed_rows': 1344094667, 'pending_index_rows': 0} collection.num_entities 1466464075

kish5430 commented 2 months ago

Currently we have configured 30 querynodes and each querynode having 180GB memory. But still getting below OOM issues.

Error: [2024/08/15 14:53:47.268 +00:00] [INFO] [task/executor.go:136] ["execute action done, remove it"] [taskID=1723522881759] [step=0] [error="load segment failed, OOM if load, maxSegmentSize = 3628.4139099121094 MB, memUsage = 34450.08984375 MB, predictMemUsage = 278614.65684223175 MB, totalMem = 184320 MB thresholdFactor = 0.900000"]

yanliang567 commented 2 months ago

could you please share the metrics screenshots for querynode? also please share the collection schema info, i guess the delegator has too much workloads than the other. @kish5430

kish5430 commented 2 months ago

@yanliang567 Please find below details.

Collection Schema and entities:

collection.schema {'auto_id': False, 'description': 'Vista embeddings: 256 dimensions', 'fields': [{'name': 'uuid', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 50}, 'is_primary': True, 'auto_id': False}, {'name': 'phash', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 50}}, {'name': 'embedding', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 256}}, {'name': 'path', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 512}}, {'name': 'meta_data', 'description': '', 'type': <DataType.JSON: 23>}, {'name': 'dataset_id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 50}}], 'enable_dynamic_field': False}

collection.num_entities 1685465315

utility.index_building_progress("CVL_image_vfm","embedding_index") {'total_rows': 1528380301, 'indexed_rows': 1528380301, 'pending_index_rows': 0}

Querynode Metrics:

image image image
yanliang567 commented 2 months ago

please hover the cursor to Segment Loaded Num, let's check the growing/sealed segments. also please share the Runtime metrics screenshots.

kish5430 commented 2 months ago

Segment Loaded Num:

image

Runtime Metrics:

image image image
xiaofan-luan commented 2 months ago
image
xiaofan-luan commented 2 months ago

memory seems to be really balanced under this case.

I guess what you said "We noticed that one datanode and querynode reached 90% usage and crashed, while the other nodes were only using around 20%." is cpu usage.

I think you need to think of the following things to support 1.5 billion vectors. (Dimension also need to be considered)

  1. how many memories you need? if use hnsw you will roughly need 10TB memory, which will be huge. use DiskANN you need at least 1-2TB memories.
  2. for 1.5 billion vectors, we would recommend to create 4-8 shards in advance. 1 shard might be to small for it.
  3. You need more datanode and indexnode(4 datanode is recommend, indexnode the more the better)

I'd like to give you extra help if necessary. Contact me at james.luan@zilliz.com

kish5430 commented 2 months ago

HI @xiaofan-luan We are no longer seeing the issue where a datanode or querynode reaches 90% and crashes. Please find below details and suggest.

  1. We are using 'IVF_FLAT' .
  2. We currently have only the default shard, which is set to 1.
  3. We have configured the recommended index nodes. Please see the screenshot below for details. image
xiaofan-luan commented 2 months ago

HI @xiaofan-luan We are no longer seeing the issue where a datanode or querynode reaches 90% and crashes. Please find below details and suggest.

  1. We are using 'IVF_FLAT' .
  2. We currently have only the default shard, which is set to 1.
  3. We have configured the recommended index nodes. Please see the screenshot below for details.
image

what is the vector dimension? Let me assuming we are using 768(OpenAI embedding)

Quick math

you need roughly 5TB memory for IVFFLAT you have 30 *180GB ~= roughly 5.4TB memory. so I'm assuming memory is tight but should work.

I would suggest you to rebuild a cluster with 8 shards, and change the cluster with 32 query nodes.

Other than that, try

  1. increase the segment size to 4G to improve performance
  2. if you still have 20-30% extra memory, try HSNW index
  3. you probably need more cpus on index node
  4. you don't really need 5 datanodes, unless you have more shards. once you have 8 shards, 4 datanodes is recommended.
  5. increase the cpu of datacoord to at least 2c8g
  6. take care of the proxy. Some times proxy can become a performance bottleneck, especially under high QPS or large topk, you probably need HPA on top it.

Good luck with it!

kish5430 commented 2 months ago

HI @xiaofan-luan We are no longer seeing the issue where a datanode or querynode reaches 90% and crashes. Please find below details and suggest.

  1. We are using 'IVF_FLAT' .
  2. We currently have only the default shard, which is set to 1.
  3. We have configured the recommended index nodes. Please see the screenshot below for details.
image

what is the vector dimension? Let me assuming we are using 768(OpenAI embedding)

Quick math

you need roughly 5TB memory for IVFFLAT you have 30 *180GB ~= roughly 5.4TB memory. so I'm assuming memory is tight but should work.

I would suggest you to rebuild a cluster with 8 shards, and change the cluster with 32 query nodes.

Other than that, try

  1. increase the segment size to 4G to improve performance
  2. if you still have 20-30% extra memory, try HSNW index
  3. you probably need more cpus on index node
  4. you don't really need 5 datanodes, unless you have more shards. once you have 8 shards, 4 datanodes is recommended.
  5. increase the cpu of datacoord to at least 2c8g
  6. take care of the proxy. Some times proxy can become a performance bottleneck, especially under high QPS or large topk, you probably need HPA on top it.

Good luck with it!

@xiaofan-luan Here are some points for your review and suggestions:

  1. We're using 256 dimensions for this collection.
  2. You suggested rebuild the cluster with 8 shards and 32 query nodes. Should this be a new cluster or a new collection?
  3. I plan to add more CPUs to the index nodes.
  4. If we have multiple collections with different shard configurations, what's the recommended number of data nodes?
  5. I’ll increase the CPUs for the data coordinator.
  6. The current proxy resources are sufficient, but I’ll scale them up if needed.
  7. Additionally, I see multiple segment properties under the data coordinator and data node. Which properties should I tune to support a 4GB segment size?
  8. We have 3 other collections in same cluster which already loaded. Is updating the existing cluster sufficient, or would it be better to build a new cluster?

Thanks

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

yanliang567 commented 1 month ago

@kish5430 is it still an issue for you, or we can just track it in #36318?