Open kish5430 opened 2 months ago
replica_number=50
meaning milvus shall load same collection 50 times in different sets of querynodes. Quick question, does your cluster have enough querynodes to hold 50 replicas?
replica_number=50
meaning milvus shall load same collection 50 times in different sets of querynodes. Quick question, does your cluster have enough querynodes to hold 50 replicas?
Please ignore above process.
I attempted to load the collection without using replicas. During the process, one DataNode's memory usage reached 92%, causing it to restart. This pattern continued with other DataNodes, each reaching over 90% memory usage and then restarting. The load is not being evenly distributed across all DataNodes simultaneously, leading to the collection load failing.
Can I assume that your collection has only one shard? This means that the whole work load could be handled only be single datanode. The current solution might be scale-up your datanode memory, not scale-out
Can I assume that your collection has only one shard? This means that the whole work load could be handled only be single datanode. The current solution might be scale-up your datanode memory, not scale-out
Yes, this collection has only one shard, but no OOM errors have been found in the data nodes. Are you still recommending increasing the data node memory and reducing the number of data node replicas?
Can I assume that your collection has only one shard? This means that the whole work load could be handled only be single datanode. The current solution might be scale-up your datanode memory, not scale-out
Yes, this collection has only one shard, but no OOM errors have been found in the data nodes. Are you still recommending increasing the data node memory and reducing the number of data node replicas?
I increased the memory for the data node and reduced the number of replicas, but loading is still failing. Also i can see below segment failure messages during the load process.
Error: 2024/08/14 02:06:53.281 +00:00] [WARN] [querynodev2/services.go:315] ["failed to load growing segments"] [traceID=1777f0303ebb3f44ca3c721448bc9665] [collectionID=451400828653517657] [channel=isds-milvus-prod-east2-2-vista-rootcoord-dml_7_451400828653517657v0] [currentNodeID=3130] [error="load segment failed, OOM if load, maxSegmentSize = 1663.657211303711 MB, memUsage = 20965.7265625 MB, predictMemUsage = 366486.11928367615 MB, totalMem = 102400 MB thresholdFactor = 0.900000"]
Error: 2024/08/14 02:06:53.281 +00:00] [WARN] [querynodev2/services.go:315] ["failed to load growing segments"] [traceID=1777f0303ebb3f44ca3c721448bc9665] [collectionID=451400828653517657] [channel=isds-milvus-prod-east2-2-vista-rootcoord-dml_7_451400828653517657v0] [currentNodeID=3130] [error="load segment failed, OOM if load, maxSegmentSize = 1663.657211303711 MB, memUsage = 20965.7265625 MB, predictMemUsage = 366486.11928367615 MB, totalMem = 102400 MB thresholdFactor = 0.900000"]
This error means the memory of querynode is not enough to load even the growing segments. This problem is caused by too many growing segments left without flushing them. It shall be fixed when data node success to go up and serving. You could trigger Flush
operation to make sure all gorwing segment turned into flushed ones.
please also check the building index progress,
utility.index_building_progress(collection_name, index_name)
i guess you could add more resources for index nodes as well to speed up the building index progress.
Error: 2024/08/14 02:06:53.281 +00:00] [WARN] [querynodev2/services.go:315] ["failed to load growing segments"] [traceID=1777f0303ebb3f44ca3c721448bc9665] [collectionID=451400828653517657] [channel=isds-milvus-prod-east2-2-vista-rootcoord-dml_7_451400828653517657v0] [currentNodeID=3130] [error="load segment failed, OOM if load, maxSegmentSize = 1663.657211303711 MB, memUsage = 20965.7265625 MB, predictMemUsage = 366486.11928367615 MB, totalMem = 102400 MB thresholdFactor = 0.900000"]
This error means the memory of querynode is not enough to load even the growing segments. This problem is caused by too many growing segments left without flushing them. It shall be fixed when data node success to go up and serving. You could trigger
Flush
operation to make sure all gorwing segment turned into flushed ones.
When I tried to execute utility.flush_all(), it has been displaying the following messages without any progress and terminated. No pods have crashed." Please find below resources usage chart and error logs. ![Uploading image.png…]()
utility.flush_all() E0814 08:02:36.829494000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value E0814 08:03:36.955125000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value E0814 08:04:37.102384000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value E0814 08:05:37.351339000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:06:37.734810000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:5, cost: 0.81s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:07:38.662704000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:6, cost: 2.43s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:08:41.195734000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:7, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:09:44.313958000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:8, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:10:47.433086000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:9, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:11:50.544276000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:10, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:12:53.663535000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:11, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:13:56.782930000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:12, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:14:59.897084000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:13, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:16:03.024335000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:14, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:17:06.147184000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:15, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:18:09.260847000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:16, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:19:12.380210000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:17, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:20:15.492076000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:18, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:21:18.627204000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:19, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:22:21.742769000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:20, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:23:24.870537000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:21, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:24:28.111091000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:22, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:25:31.225968000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:23, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:26:34.351632000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:24, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:27:37.476089000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:25, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:28:40.599552000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:26, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:29:43.804847000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:27, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:30:46.952472000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:28, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:31:50.065584000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:29, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:32:53.185043000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:30, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:33:56.297080000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:31, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:34:59.514078000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:32, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:36:02.626102000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:33, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:37:05.768075000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:34, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:38:08.884363000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:35, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:39:12.013672000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:36, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:40:15.147660000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:37, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:41:18.262974000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:38, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:42:21.379996000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:39, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:43:24.574837000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:40, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:44:27.755674000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:41, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:45:30.887213000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:42, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:46:34.014123000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:43, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:47:37.198353000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:44, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:48:40.383641000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:45, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:49:43.506943000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:46, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:50:46.631468000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:47, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:51:49.783123000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:48, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:52:52.987083000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:49, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:53:56.105511000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:50, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:54:59.223372000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:51, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:56:02.338983000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:52, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:57:05.513138000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:53, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:58:08.633579000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:54, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 08:59:11.755881000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:55, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:00:14.886193000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:56, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:01:18.004054000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:57, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:02:21.118745000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:58, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:03:24.590322000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:59, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:04:27.728271000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:60, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:05:30.872908000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:61, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:06:33.998146000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:62, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:07:37.123477000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:63, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:08:40.308031000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:64, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:09:43.439978000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:65, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:10:46.566722000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:66, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:11:49.691795000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:67, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:12:52.872242000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:68, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:13:55.997147000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:69, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:14:59.120795000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:70, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:16:02.243846000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:71, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:17:05.370285000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:72, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:18:08.548712000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:73, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:19:11.665520000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:74, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:20:14.789971000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value [flush_all] retry:75, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed> E0814 09:21:17.904751000 6200356864 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value RPC error: [flush_all], <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "Stream removed" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T09:21:17.905191-06:00", grpc_status:2, grpc_message:"Stream removed"}" , message=Retry run out of 75 retry times, message=Stream removed)>, <Time:{'RPC start': '2024-08-14 08:01:36.717121', 'RPC error': '2024-08-14 09:21:17.907205'}> Traceback (most recent call last): File "
", line 1, in File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/orm/utility.py", line 1256, in flush_all return _get_connection(using).flush_all(timeout=timeout, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 147, in handler raise e from e File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 143, in handler return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 182, in handler return func(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kishorereddyannadi/Library/Python/3.12/lib/python/site-packages/pymilvus/decorators.py", line 93, in handler raise MilvusException(e.code, f"{to_msg}, message={e.details()}") from e pymilvus.exceptions.MilvusException: <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "Stream removed" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T09:21:17.905191-06:00", grpc_status:2, grpc_message:"Stream removed"}" , message=Retry run out of 75 retry times, message=Stream removed)>
please also check the building index progress,
utility.index_building_progress(collection_name, index_name)
i guess you could add more resources for index nodes as well to speed up the building index progress.
Please find below result and suggest
utility.index_building_progress("CVL_image_vfm","embedding_index") {'total_rows': 1344094667, 'indexed_rows': 1344094667, 'pending_index_rows': 0} collection.num_entities 1466464075
Currently we have configured 30 querynodes and each querynode having 180GB memory. But still getting below OOM issues.
Error: [2024/08/15 14:53:47.268 +00:00] [INFO] [task/executor.go:136] ["execute action done, remove it"] [taskID=1723522881759] [step=0] [error="load segment failed, OOM if load, maxSegmentSize = 3628.4139099121094 MB, memUsage = 34450.08984375 MB, predictMemUsage = 278614.65684223175 MB, totalMem = 184320 MB thresholdFactor = 0.900000"]
could you please share the metrics screenshots for querynode? also please share the collection schema info, i guess the delegator has too much workloads than the other. @kish5430
@yanliang567 Please find below details.
Collection Schema and entities:
collection.schema {'auto_id': False, 'description': 'Vista embeddings: 256 dimensions', 'fields': [{'name': 'uuid', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 50}, 'is_primary': True, 'auto_id': False}, {'name': 'phash', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 50}}, {'name': 'embedding', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 256}}, {'name': 'path', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 512}}, {'name': 'meta_data', 'description': '', 'type': <DataType.JSON: 23>}, {'name': 'dataset_id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 50}}], 'enable_dynamic_field': False}
collection.num_entities 1685465315
utility.index_building_progress("CVL_image_vfm","embedding_index") {'total_rows': 1528380301, 'indexed_rows': 1528380301, 'pending_index_rows': 0}
Querynode Metrics:
please hover the cursor to Segment Loaded Num, let's check the growing/sealed segments. also please share the Runtime metrics screenshots.
Segment Loaded Num:
Runtime Metrics:
memory seems to be really balanced under this case.
I guess what you said "We noticed that one datanode and querynode reached 90% usage and crashed, while the other nodes were only using around 20%." is cpu usage.
I think you need to think of the following things to support 1.5 billion vectors. (Dimension also need to be considered)
I'd like to give you extra help if necessary. Contact me at james.luan@zilliz.com
HI @xiaofan-luan We are no longer seeing the issue where a datanode or querynode reaches 90% and crashes. Please find below details and suggest.
HI @xiaofan-luan We are no longer seeing the issue where a datanode or querynode reaches 90% and crashes. Please find below details and suggest.
- We are using 'IVF_FLAT' .
- We currently have only the default shard, which is set to 1.
- We have configured the recommended index nodes. Please see the screenshot below for details.
what is the vector dimension? Let me assuming we are using 768(OpenAI embedding)
Quick math
you need roughly 5TB memory for IVFFLAT you have 30 *180GB ~= roughly 5.4TB memory. so I'm assuming memory is tight but should work.
I would suggest you to rebuild a cluster with 8 shards, and change the cluster with 32 query nodes.
Other than that, try
Good luck with it!
HI @xiaofan-luan We are no longer seeing the issue where a datanode or querynode reaches 90% and crashes. Please find below details and suggest.
- We are using 'IVF_FLAT' .
- We currently have only the default shard, which is set to 1.
- We have configured the recommended index nodes. Please see the screenshot below for details.
what is the vector dimension? Let me assuming we are using 768(OpenAI embedding)
Quick math
you need roughly 5TB memory for IVFFLAT you have 30 *180GB ~= roughly 5.4TB memory. so I'm assuming memory is tight but should work.
I would suggest you to rebuild a cluster with 8 shards, and change the cluster with 32 query nodes.
Other than that, try
- increase the segment size to 4G to improve performance
- if you still have 20-30% extra memory, try HSNW index
- you probably need more cpus on index node
- you don't really need 5 datanodes, unless you have more shards. once you have 8 shards, 4 datanodes is recommended.
- increase the cpu of datacoord to at least 2c8g
- take care of the proxy. Some times proxy can become a performance bottleneck, especially under high QPS or large topk, you probably need HPA on top it.
Good luck with it!
@xiaofan-luan Here are some points for your review and suggestions:
Thanks
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
@kish5430 is it still an issue for you, or we can just track it in #36318?
Is there an existing issue for this?
Environment
Current Behavior
We created a collection and inserted 1.5 billion vectors. When attempting to load the collection, the querynode and datanode pods crashed with OOM errors at 19% due to resource limitations. We unloaded the collection, increased the memory for the querynode and datanode pods, and restarted all pods. Upon retrying to load the collection, the progress jumped directly from 0% to 19% but then stalled. We noticed that one datanode and querynode reached 90% usage and crashed, while the other nodes were only using around 20%.
Then we tried to load the collection with 'collection.load(replica_number=50)' method. There was no pod crashes but still unable to load the collection
Expected Behavior
No response
Steps To Reproduce
Milvus Log
No response
Anything else?
No response