milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.89k stars 2.95k forks source link

[Bug]: MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering #36598

Open barzan-hayati opened 1 month ago

barzan-hayati commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4.4
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.4.3
- OS(Ubuntu or CentOS): Ubuntu 18.04
- CPU/Memory: Xeon(R) CPU E5-2630 v4 2.20GHZ / 300 G 
- GPU: 
- Others:

Current Behavior

I added 50 million vectors to a collection. Subsequently, I wanted to conduct a benchmark to evaluate various consistency levels and different batch sizes of vectors to assess the improvements gained from using a partition key. The vector batch sizes chosen were [1, 2, 4, 8, 16, 32, 64].

However, after some time (duration unknown), I encountered this error:

<MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>

Expected Behavior

No response

Steps To Reproduce

1. milvus configuration:

M = 64
ef = 128
efconstruction = 256
vector dimension = 512
metric = "COSINE"
index_type = "HNSW"


### Milvus Log

_No response_

### Anything else?

I also look into these issues:

1. [[Bug]: Milvus v2.4.10 failed to search: loaded collection do not found any channel in target, may be in recovery](https://github.com/milvus-io/milvus/issues/35877)
2. [[Bug]: search failed with error failed to search: node offline[node=-1]: channel not available[channel=by-dev-rootcoord-dml_xxv1] after minio pod kill chaos test](https://github.com/milvus-io/milvus/issues/31714)
3. [[Bug]: failed to search: collection not loaded](https://github.com/milvus-io/milvus/issues/30403)
4. [[Bug]: Search failed when we insert data into Milvus, both datanode and querynode keep crashing](https://github.com/milvus-io/milvus/issues/28588)
yanliang567 commented 1 month ago

@barzan-hayati please refer this doc to export the whole Milvus logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs. Also if would be very helpful if you could attach the etcd backup for investigation. Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher

/assign @barzan-hayati

barzan-hayati commented 1 month ago

Because the log file is too large to upload here, I have selected a portion that contains the relevant error.

milvus_log.zip

I haven't installed Go yet, so I was unable to log etcd through Birdwatcher.

xiaofan-luan commented 1 month ago

seems that you just don't have enough memory [2024/09/29 17:59:05.843 +00:00] [WARN] [delegator/delegator_data.go:472] ["worker failed to load segments"] [traceID=1a316b2e60e620141d3a6c576f755d0e] [collectionID=452785967535053061] [channel=by-dev-rootcoord-dml_1_452785967535053061v0] [replicaID=452785967859761153] [workID=3] [segments="[452785967535289170]"] [error="memory limit exceeded[predict=3.9962037e+11][limit=3.041807e+11]"]

xiaofan-luan commented 1 month ago

it seems you have 300GB memory(Weird number but you try to load 400GB data)

xiaofan-luan commented 1 month ago

a weird thing is that any of each segment should not take 100GB or more memory.

you need to use birdwatcher to check if there are large segment(more than 50GB in size)Then you will need to upgrade to 2.4.12 and there is a compaction help to reduce the maximum size of segments

barzan-hayati commented 1 month ago

seems that you just don't have enough memory [2024/09/29 17:59:05.843 +00:00] [WARN] [delegator/delegator_data.go:472] ["worker failed to load segments"] [traceID=1a316b2e60e620141d3a6c576f755d0e] [collectionID=452785967535053061] [channel=by-dev-rootcoord-dml_1_452785967535053061v0] [replicaID=452785967859761153] [workID=3] [segments="[452785967535289170]"] [error="memory limit exceeded[predict=3.9962037e+11][limit=3.041807e+11]"]

Milvus uses about half of the available memory, and I currently have 140 GB of free memory. Why should Milvus require additional memory during search operations? The collection has already been created, and indexing has been completed.

Please note that Milvus occasionally goes down after 8 hours and sometimes after just 4 hours.

I conducted another test, and when I encountered that error, I had to restart the Milvus Docker instance. It took approximately 3 minutes and 40 seconds to load the collection into RAM, after which I was able to perform searches again.

barzan-hayati commented 1 month ago

it seems you have 300GB memory(Weird number but you try to load 400GB data)

Why 400 G?

it seems you have 300GB memory(Weird number but you try to load 400GB data)

When I load the collection, it uses approximately 140 GB of RAM.

xiaofan-luan commented 1 month ago

3.9962037e+11][limit=3.041807e+11

based on the log it said 400G is requirement but your pod limit is 300GB.

This is my suggestion

  1. install birdwatcher
  2. check segmentinfo, see how many segments you have in your collection and what is the largest segment. My guess is that there is a large segment (More than 100G, while normal segment is usually )
  3. if that happens, you need to upgrade to 2.4.12 and trigger an compaction, it will split large segment into smaller ones https://github.com/milvus-io/milvus/issues/35584
  4. then the load overhead will be decreased dramatically
barzan-hayati commented 1 month ago

based on the log it said 400G is requirement but your pod limit is 300GB.


Let me present a scenario that demonstrates that memory usage remains relatively constant. I monitor the memory usage and log the details.

I conduct searches with various consistency levels and batch sizes. At the beginning of my program, the available memory is approximately 143 GB.

2024/09/30 13:46:26.659 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  100.000 total  283.290 available  143.899 percent 49.2 used  136.981 free  57.547 active  212.856 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.379 available_virtual_memory  50.796
2024/09/30 13:46:36.670 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.700 total  283.290 available  143.907 percent 49.2 used  136.973 free  57.555 active  212.850 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.379 available_virtual_memory  50.799
2024/09/30 13:46:46.682 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.200 total  283.290 available  143.902 percent 49.2 used  136.978 free  57.550 active  212.854 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.379 available_virtual_memory  50.797
2024/09/30 13:46:56.694 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.700 total  283.290 available  143.904 percent 49.2 used  136.977 free  57.551 active  212.854 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.797
2024/09/30 13:47:06.706 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  86.300 total  283.290 available  143.902 percent 49.2 used  136.979 free  57.549 active  212.854 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.797
2024/09/30 13:47:16.718 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.500 total  283.290 available  143.903 percent 49.2 used  136.978 free  57.550 active  212.854 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.797
2024/09/30 13:47:26.728 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.600 total  283.290 available  143.907 percent 49.2 used  136.974 free  57.554 active  212.853 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.798
2024/09/30 13:47:36.738 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  86.700 total  283.290 available  143.908 percent 49.2 used  136.972 free  57.555 active  212.851 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.799
2024/09/30 13:47:46.750 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.900 total  283.290 available  143.915 percent 49.2 used  136.966 free  57.562 active  212.846 inactive  6.359 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.801
2024/09/30 13:47:56.762 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.600 total  283.290 available  143.906 percent 49.2 used  136.974 free  57.553 active  212.850 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.798
2024/09/30 13:48:06.773 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  0.500 total  283.290 available  143.916 percent 49.2 used  136.964 free  57.564 active  212.837 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.802
2024/09/30 13:48:16.784 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  0.700 total  283.290 available  143.929 percent 49.2 used  136.951 free  57.576 active  212.831 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.806

After some time, as I mentioned, a search request made at 15:03:12 did not complete,

2024/09/30 15:03:12.264 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:03:22.272 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:03:32.285 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:03:42.296 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:03:52.308 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:02.320 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:12.336 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:22.349 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:32.364 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:42.379 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:52.395 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:02.410 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:12.426 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:22.442 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:32.472 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:42.486 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:52.501 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:02.518 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:12.532 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:22.548 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:32.561 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:42.577 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search

indicating that the Docker container had restarted (as confirmed by the Docker container uptime). I then checked the memory usage at that point. According to the log, the available memory before that time was around 143 GB, but after the Docker container restart, the available memory increased to 221 GB, then to 272 GB. Once the collection was fully loaded, the available memory returned to approximately 143 GB.

2024/09/30 15:02:01.973 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  80.900 total  283.290 available  143.891 percent 49.2 used  136.990 free  57.524 active  212.882 inactive  6.367 buffers  2.040 cached  86.736 shared  0.028 slab  5.383 available_virtual_memory  50.793
2024/09/30 15:02:11.985 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.500 total  283.290 available  143.893 percent 49.2 used  136.987 free  57.526 active  212.880 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.383 available_virtual_memory  50.794
2024/09/30 15:02:21.998 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.900 total  283.290 available  143.897 percent 49.2 used  136.983 free  57.530 active  212.876 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.383 available_virtual_memory  50.795
2024/09/30 15:02:32.010 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  78.500 total  283.290 available  143.890 percent 49.2 used  136.990 free  57.523 active  212.878 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.383 available_virtual_memory  50.793
2024/09/30 15:02:42.022 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  80.600 total  283.290 available  143.892 percent 49.2 used  136.989 free  57.525 active  212.875 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.383 available_virtual_memory  50.793
2024/09/30 15:02:52.034 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  46.200 total  283.290 available  221.237 percent 21.9 used  59.644 free  134.870 active  135.530 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.382 available_virtual_memory  78.098
2024/09/30 15:03:02.046 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  2.800 total  283.290 available  272.184 percent 3.9 used  8.696 free  185.825 active  84.930 inactive  6.360 buffers  2.040 cached  86.729 shared  0.028 slab  5.358 available_virtual_memory  96.080
2024/09/30 15:03:12.056 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  57.600 total  283.290 available  216.895 percent 23.4 used  63.985 free  130.532 active  140.056 inactive  6.363 buffers  2.040 cached  86.732 shared  0.028 slab  5.365 available_virtual_memory  76.563
2024/09/30 15:03:22.067 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  33.100 total  283.290 available  204.029 percent 28.0 used  76.852 free  117.655 active  152.924 inactive  6.374 buffers  2.040 cached  86.744 shared  0.028 slab  5.366 available_virtual_memory  72.021
2024/09/30 15:03:32.078 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.200 total  283.290 available  198.230 percent 30.0 used  82.651 free  111.850 active  158.705 inactive  6.380 buffers  2.040 cached  86.750 shared  0.028 slab  5.366 available_virtual_memory  69.974
2024/09/30 15:03:42.089 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.500 total  283.290 available  192.940 percent 31.9 used  87.940 free  106.554 active  163.984 inactive  6.386 buffers  2.040 cached  86.755 shared  0.028 slab  5.367 available_virtual_memory  68.107
2024/09/30 15:03:52.101 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  44.400 total  283.290 available  161.248 percent 43.1 used  119.632 free  74.858 active  195.619 inactive  6.391 buffers  2.040 cached  86.760 shared  0.028 slab  5.368 available_virtual_memory  56.920
2024/09/30 15:04:02.108 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.500 total  283.290 available  155.949 percent 45.0 used  124.931 free  69.554 active  200.900 inactive  6.396 buffers  2.040 cached  86.765 shared  0.028 slab  5.369 available_virtual_memory  55.049
2024/09/30 15:04:12.116 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.400 total  283.290 available  150.179 percent 47.0 used  130.701 free  63.779 active  206.665 inactive  6.400 buffers  2.040 cached  86.770 shared  0.028 slab  5.369 available_virtual_memory  53.012
2024/09/30 15:04:22.124 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.400 total  283.290 available  144.853 percent 48.9 used  136.027 free  58.448 active  211.974 inactive  6.405 buffers  2.040 cached  86.775 shared  0.028 slab  5.369 available_virtual_memory  51.133
2024/09/30 15:04:32.135 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.500 total  283.290 available  139.952 percent 50.6 used  140.929 free  53.542 active  216.870 inactive  6.409 buffers  2.040 cached  86.779 shared  0.028 slab  5.369 available_virtual_memory  49.402
2024/09/30 15:04:42.144 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  29.300 total  283.290 available  135.010 percent 52.3 used  145.869 free  48.596 active  221.795 inactive  6.414 buffers  2.040 cached  86.785 shared  0.028 slab  5.370 available_virtual_memory  47.658
2024/09/30 15:04:52.156 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  40.600 total  283.290 available  136.097 percent 52.0 used  144.783 free  49.678 active  220.683 inactive  6.419 buffers  2.040 cached  86.790 shared  0.028 slab  5.371 available_virtual_memory  48.041
2024/09/30 15:05:02.168 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  37.700 total  283.290 available  133.251 percent 53.0 used  147.629 free  46.826 active  223.525 inactive  6.424 buffers  2.040 cached  86.795 shared  0.028 slab  5.372 available_virtual_memory  47.037
2024/09/30 15:05:12.176 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  89.000 total  283.290 available  149.217 percent 47.3 used  131.664 free  62.787 active  207.548 inactive  6.429 buffers  2.040 cached  86.799 shared  0.028 slab  5.373 available_virtual_memory  52.674
2024/09/30 15:05:22.187 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  32.600 total  283.290 available  148.834 percent 47.5 used  132.046 free  62.399 active  207.931 inactive  6.434 buffers  2.040 cached  86.804 shared  0.028 slab  5.374 available_virtual_memory  52.537
2024/09/30 15:05:32.196 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  54.400 total  283.290 available  150.256 percent 47.0 used  130.624 free  63.816 active  206.511 inactive  6.439 buffers  2.040 cached  86.810 shared  0.028 slab  5.375 available_virtual_memory  53.040
2024/09/30 15:05:42.205 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.000 total  283.290 available  169.188 percent 40.3 used  111.692 free  82.744 active  187.592 inactive  6.443 buffers  2.040 cached  86.814 shared  0.028 slab  5.375 available_virtual_memory  59.724
2024/09/30 15:05:52.216 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  64.200 total  283.290 available  155.908 percent 45.0 used  124.973 free  69.457 active  200.870 inactive  6.448 buffers  2.040 cached  86.820 shared  0.028 slab  5.376 available_virtual_memory  55.034
2024/09/30 15:06:02.224 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  29.000 total  283.290 available  138.815 percent 51.0 used  142.065 free  52.365 active  217.930 inactive  6.449 buffers  2.040 cached  86.820 shared  0.028 slab  5.376 available_virtual_memory  49.001
2024/09/30 15:06:12.232 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  22.900 total  283.290 available  132.824 percent 53.1 used  148.056 free  46.373 active  223.928 inactive  6.449 buffers  2.040 cached  86.821 shared  0.028 slab  5.376 available_virtual_memory  46.886
2024/09/30 15:06:22.240 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  22.700 total  283.290 available  126.778 percent 55.2 used  154.102 free  40.327 active  229.964 inactive  6.450 buffers  2.040 cached  86.821 shared  0.028 slab  5.376 available_virtual_memory  44.752
2024/09/30 15:06:32.251 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  29.400 total  283.290 available  130.444 percent 54.0 used  150.436 free  43.992 active  226.291 inactive  6.451 buffers  2.040 cached  86.822 shared  0.028 slab  5.376 available_virtual_memory  46.046
2024/09/30 15:06:42.260 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  34.000 total  283.290 available  135.483 percent 52.2 used  145.397 free  49.030 active  221.250 inactive  6.451 buffers  2.040 cached  86.822 shared  0.028 slab  5.376 available_virtual_memory  47.824
2024/09/30 15:06:52.272 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.400 total  283.290 available  144.153 percent 49.1 used  136.727 free  57.699 active  212.572 inactive  6.452 buffers  2.040 cached  86.824 shared  0.028 slab  5.376 available_virtual_memory  50.885
2024/09/30 15:07:02.280 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.700 total  283.290 available  144.350 percent 49.0 used  136.531 free  57.895 active  212.398 inactive  6.453 buffers  2.040 cached  86.824 shared  0.028 slab  5.376 available_virtual_memory  50.955
2024/09/30 15:07:12.292 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.300 total  283.290 available  144.145 percent 49.1 used  136.736 free  57.690 active  212.607 inactive  6.453 buffers  2.040 cached  86.824 shared  0.028 slab  5.375 available_virtual_memory  50.882
2024/09/30 15:07:22.301 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  78.700 total  283.290 available  144.009 percent 49.2 used  136.872 free  57.554 active  212.742 inactive  6.453 buffers  2.040 cached  86.824 shared  0.028 slab  5.375 available_virtual_memory  50.834
2024/09/30 15:07:32.313 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  78.600 total  283.290 available  144.003 percent 49.2 used  136.877 free  57.549 active  212.752 inactive  6.453 buffers  2.040 cached  86.824 shared  0.028 slab  5.375 available_virtual_memory  50.832

This log file clearly shows that memory usage never utilizes the entire RAM. I also attempted to install Birdwatcher to examine the segment information.

xiaofan-luan commented 1 month ago

based on the log it said 400G is requirement but your pod limit is 300GB.

Let me present a scenario that demonstrates that memory usage remains relatively constant. I monitor the memory usage and log the details.

I conduct searches with various consistency levels and batch sizes. At the beginning of my program, the available memory is approximately 143 GB.

2024/09/30 13:46:26.659 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  100.000 total  283.290 available  143.899 percent 49.2 used  136.981 free  57.547 active  212.856 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.379 available_virtual_memory  50.796
2024/09/30 13:46:36.670 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.700 total  283.290 available  143.907 percent 49.2 used  136.973 free  57.555 active  212.850 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.379 available_virtual_memory  50.799
2024/09/30 13:46:46.682 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.200 total  283.290 available  143.902 percent 49.2 used  136.978 free  57.550 active  212.854 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.379 available_virtual_memory  50.797
2024/09/30 13:46:56.694 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.700 total  283.290 available  143.904 percent 49.2 used  136.977 free  57.551 active  212.854 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.797
2024/09/30 13:47:06.706 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  86.300 total  283.290 available  143.902 percent 49.2 used  136.979 free  57.549 active  212.854 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.797
2024/09/30 13:47:16.718 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.500 total  283.290 available  143.903 percent 49.2 used  136.978 free  57.550 active  212.854 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.797
2024/09/30 13:47:26.728 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.600 total  283.290 available  143.907 percent 49.2 used  136.974 free  57.554 active  212.853 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.798
2024/09/30 13:47:36.738 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  86.700 total  283.290 available  143.908 percent 49.2 used  136.972 free  57.555 active  212.851 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.799
2024/09/30 13:47:46.750 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  85.900 total  283.290 available  143.915 percent 49.2 used  136.966 free  57.562 active  212.846 inactive  6.359 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.801
2024/09/30 13:47:56.762 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.600 total  283.290 available  143.906 percent 49.2 used  136.974 free  57.553 active  212.850 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.798
2024/09/30 13:48:06.773 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  0.500 total  283.290 available  143.916 percent 49.2 used  136.964 free  57.564 active  212.837 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.802
2024/09/30 13:48:16.784 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  0.700 total  283.290 available  143.929 percent 49.2 used  136.951 free  57.576 active  212.831 inactive  6.358 buffers  2.038 cached  86.724 shared  0.028 slab  5.380 available_virtual_memory  50.806

After some time, as I mentioned, a search request made at 15:03:12 did not complete,

2024/09/30 15:03:12.264 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:03:22.272 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:03:32.285 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:03:42.296 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:03:52.308 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:02.320 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:12.336 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:22.349 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:32.364 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:42.379 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:04:52.395 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:02.410 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:12.426 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:22.442 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:32.472 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:42.486 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:05:52.501 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:02.518 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:12.532 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:22.548 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:32.561 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search
2024/09/30 15:06:42.577 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(694): male_female_student has been down in collection_search

indicating that the Docker container had restarted (as confirmed by the Docker container uptime). I then checked the memory usage at that point. According to the log, the available memory before that time was around 143 GB, but after the Docker container restart, the available memory increased to 221 GB, then to 272 GB. Once the collection was fully loaded, the available memory returned to approximately 143 GB.

2024/09/30 15:02:01.973 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  80.900 total  283.290 available  143.891 percent 49.2 used  136.990 free  57.524 active  212.882 inactive  6.367 buffers  2.040 cached  86.736 shared  0.028 slab  5.383 available_virtual_memory  50.793
2024/09/30 15:02:11.985 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.500 total  283.290 available  143.893 percent 49.2 used  136.987 free  57.526 active  212.880 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.383 available_virtual_memory  50.794
2024/09/30 15:02:21.998 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.900 total  283.290 available  143.897 percent 49.2 used  136.983 free  57.530 active  212.876 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.383 available_virtual_memory  50.795
2024/09/30 15:02:32.010 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  78.500 total  283.290 available  143.890 percent 49.2 used  136.990 free  57.523 active  212.878 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.383 available_virtual_memory  50.793
2024/09/30 15:02:42.022 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  80.600 total  283.290 available  143.892 percent 49.2 used  136.989 free  57.525 active  212.875 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.383 available_virtual_memory  50.793
2024/09/30 15:02:52.034 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  46.200 total  283.290 available  221.237 percent 21.9 used  59.644 free  134.870 active  135.530 inactive  6.367 buffers  2.040 cached  86.737 shared  0.028 slab  5.382 available_virtual_memory  78.098
2024/09/30 15:03:02.046 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  2.800 total  283.290 available  272.184 percent 3.9 used  8.696 free  185.825 active  84.930 inactive  6.360 buffers  2.040 cached  86.729 shared  0.028 slab  5.358 available_virtual_memory  96.080
2024/09/30 15:03:12.056 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  57.600 total  283.290 available  216.895 percent 23.4 used  63.985 free  130.532 active  140.056 inactive  6.363 buffers  2.040 cached  86.732 shared  0.028 slab  5.365 available_virtual_memory  76.563
2024/09/30 15:03:22.067 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  33.100 total  283.290 available  204.029 percent 28.0 used  76.852 free  117.655 active  152.924 inactive  6.374 buffers  2.040 cached  86.744 shared  0.028 slab  5.366 available_virtual_memory  72.021
2024/09/30 15:03:32.078 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.200 total  283.290 available  198.230 percent 30.0 used  82.651 free  111.850 active  158.705 inactive  6.380 buffers  2.040 cached  86.750 shared  0.028 slab  5.366 available_virtual_memory  69.974
2024/09/30 15:03:42.089 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.500 total  283.290 available  192.940 percent 31.9 used  87.940 free  106.554 active  163.984 inactive  6.386 buffers  2.040 cached  86.755 shared  0.028 slab  5.367 available_virtual_memory  68.107
2024/09/30 15:03:52.101 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  44.400 total  283.290 available  161.248 percent 43.1 used  119.632 free  74.858 active  195.619 inactive  6.391 buffers  2.040 cached  86.760 shared  0.028 slab  5.368 available_virtual_memory  56.920
2024/09/30 15:04:02.108 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.500 total  283.290 available  155.949 percent 45.0 used  124.931 free  69.554 active  200.900 inactive  6.396 buffers  2.040 cached  86.765 shared  0.028 slab  5.369 available_virtual_memory  55.049
2024/09/30 15:04:12.116 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.400 total  283.290 available  150.179 percent 47.0 used  130.701 free  63.779 active  206.665 inactive  6.400 buffers  2.040 cached  86.770 shared  0.028 slab  5.369 available_virtual_memory  53.012
2024/09/30 15:04:22.124 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.400 total  283.290 available  144.853 percent 48.9 used  136.027 free  58.448 active  211.974 inactive  6.405 buffers  2.040 cached  86.775 shared  0.028 slab  5.369 available_virtual_memory  51.133
2024/09/30 15:04:32.135 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.500 total  283.290 available  139.952 percent 50.6 used  140.929 free  53.542 active  216.870 inactive  6.409 buffers  2.040 cached  86.779 shared  0.028 slab  5.369 available_virtual_memory  49.402
2024/09/30 15:04:42.144 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  29.300 total  283.290 available  135.010 percent 52.3 used  145.869 free  48.596 active  221.795 inactive  6.414 buffers  2.040 cached  86.785 shared  0.028 slab  5.370 available_virtual_memory  47.658
2024/09/30 15:04:52.156 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  40.600 total  283.290 available  136.097 percent 52.0 used  144.783 free  49.678 active  220.683 inactive  6.419 buffers  2.040 cached  86.790 shared  0.028 slab  5.371 available_virtual_memory  48.041
2024/09/30 15:05:02.168 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  37.700 total  283.290 available  133.251 percent 53.0 used  147.629 free  46.826 active  223.525 inactive  6.424 buffers  2.040 cached  86.795 shared  0.028 slab  5.372 available_virtual_memory  47.037
2024/09/30 15:05:12.176 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  89.000 total  283.290 available  149.217 percent 47.3 used  131.664 free  62.787 active  207.548 inactive  6.429 buffers  2.040 cached  86.799 shared  0.028 slab  5.373 available_virtual_memory  52.674
2024/09/30 15:05:22.187 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  32.600 total  283.290 available  148.834 percent 47.5 used  132.046 free  62.399 active  207.931 inactive  6.434 buffers  2.040 cached  86.804 shared  0.028 slab  5.374 available_virtual_memory  52.537
2024/09/30 15:05:32.196 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  54.400 total  283.290 available  150.256 percent 47.0 used  130.624 free  63.816 active  206.511 inactive  6.439 buffers  2.040 cached  86.810 shared  0.028 slab  5.375 available_virtual_memory  53.040
2024/09/30 15:05:42.205 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.000 total  283.290 available  169.188 percent 40.3 used  111.692 free  82.744 active  187.592 inactive  6.443 buffers  2.040 cached  86.814 shared  0.028 slab  5.375 available_virtual_memory  59.724
2024/09/30 15:05:52.216 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  64.200 total  283.290 available  155.908 percent 45.0 used  124.973 free  69.457 active  200.870 inactive  6.448 buffers  2.040 cached  86.820 shared  0.028 slab  5.376 available_virtual_memory  55.034
2024/09/30 15:06:02.224 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  29.000 total  283.290 available  138.815 percent 51.0 used  142.065 free  52.365 active  217.930 inactive  6.449 buffers  2.040 cached  86.820 shared  0.028 slab  5.376 available_virtual_memory  49.001
2024/09/30 15:06:12.232 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  22.900 total  283.290 available  132.824 percent 53.1 used  148.056 free  46.373 active  223.928 inactive  6.449 buffers  2.040 cached  86.821 shared  0.028 slab  5.376 available_virtual_memory  46.886
2024/09/30 15:06:22.240 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  22.700 total  283.290 available  126.778 percent 55.2 used  154.102 free  40.327 active  229.964 inactive  6.450 buffers  2.040 cached  86.821 shared  0.028 slab  5.376 available_virtual_memory  44.752
2024/09/30 15:06:32.251 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  29.400 total  283.290 available  130.444 percent 54.0 used  150.436 free  43.992 active  226.291 inactive  6.451 buffers  2.040 cached  86.822 shared  0.028 slab  5.376 available_virtual_memory  46.046
2024/09/30 15:06:42.260 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  34.000 total  283.290 available  135.483 percent 52.2 used  145.397 free  49.030 active  221.250 inactive  6.451 buffers  2.040 cached  86.822 shared  0.028 slab  5.376 available_virtual_memory  47.824
2024/09/30 15:06:52.272 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  28.400 total  283.290 available  144.153 percent 49.1 used  136.727 free  57.699 active  212.572 inactive  6.452 buffers  2.040 cached  86.824 shared  0.028 slab  5.376 available_virtual_memory  50.885
2024/09/30 15:07:02.280 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.700 total  283.290 available  144.350 percent 49.0 used  136.531 free  57.895 active  212.398 inactive  6.453 buffers  2.040 cached  86.824 shared  0.028 slab  5.376 available_virtual_memory  50.955
2024/09/30 15:07:12.292 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  79.300 total  283.290 available  144.145 percent 49.1 used  136.736 free  57.690 active  212.607 inactive  6.453 buffers  2.040 cached  86.824 shared  0.028 slab  5.375 available_virtual_memory  50.882
2024/09/30 15:07:22.301 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  78.700 total  283.290 available  144.009 percent 49.2 used  136.872 free  57.554 active  212.742 inactive  6.453 buffers  2.040 cached  86.824 shared  0.028 slab  5.375 available_virtual_memory  50.834
2024/09/30 15:07:32.313 [INFO] [__main__] cpu_memory.check_cpu_memory_usage(21): cpu_percentage  78.600 total  283.290 available  144.003 percent 49.2 used  136.877 free  57.549 active  212.752 inactive  6.453 buffers  2.040 cached  86.824 shared  0.028 slab  5.375 available_virtual_memory  50.832

This log file clearly shows that memory usage never utilizes the entire RAM. I also attempted to install Birdwatcher to examine the segment information.

what error did you get? is there any pod restart? any error message here?

  1. are you tryin go issue many concurrent batch request? what is nq, topk and concurrency? Did you calcuclate how much memory it needs?
  2. if there any pod down, can you check there are some etcd session timeout log?

I would suggest to upgrade to 2.4.12 and retry

barzan-hayati commented 1 month ago

what error did you get?

I forgot to include the error log: here is the error log from that time.

2024/09/30 15:03:12.264 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:03:22.272 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:03:32.284 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:03:42.296 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:03:52.308 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:02.320 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:12.336 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:22.349 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:32.364 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:42.379 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:52.395 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:02.409 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:12.426 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:22.441 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:32.472 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:42.486 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:52.501 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:02.518 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:12.532 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:22.547 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:32.561 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:42.577 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
  1. what is nq, topk and concurrency?

I am looking for various nq values, which could be one from the set [1, 2, 4, 8, 16, 32, 64] with a topk value of 50 for all searches. I don't have any knowledge about the concurrency value, but a search is initiated each time.

  1. Did you calcuclate how much memory it needs?

I utilized Milvus Sizing Tool to calculate the necessary RAM and hard disk space, and it appears that I have sufficient memory and storage.

image

  1. if there any pod down, can you check there are some etcd session timeout log?

I don't utilize clusters or Kubernetes, and I think that pods are not applicable to my situation.

However, I'm attempting to check segment size using Birdwatcher, but it was unable to connect. I've also created an issue regarding this failed to connect to etcd: context deadline exceeded.

xiaofan-luan commented 1 month ago

what error did you get?

I forgot to include the error log: here is the error log from that time.

2024/09/30 15:03:12.264 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:03:22.272 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:03:32.284 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:03:42.296 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:03:52.308 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:02.320 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:12.336 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:22.349 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:32.364 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:42.379 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:04:52.395 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:02.409 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:12.426 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:22.441 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:32.472 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:42.486 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:05:52.501 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:02.518 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:12.532 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:22.547 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:32.561 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
2024/09/30 15:06:42.577 [INFO] [sqlite_db] insert_data_into_partitions_key.collection_search(693): Error: <MilvusException: (code=106, message=failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452785967535053061])>
  1. what is nq, topk and concurrency?

I am looking for various nq values, which could be one from the set [1, 2, 4, 8, 16, 32, 64] with a topk value of 50 for all searches. I don't have any knowledge about the concurrency value, but a search is initiated each time.

  1. Did you calcuclate how much memory it needs?

I utilized Milvus Sizing Tool to calculate the necessary RAM and hard disk space, and it appears that I have sufficient memory and storage.

image

  1. if there any pod down, can you check there are some etcd session timeout log?

I don't utilize clusters or Kubernetes, and I think that pods are not applicable to my situation.

However, I'm attempting to check segment size using Birdwatcher, but it was unable to connect. I've also created an issue regarding this failed to connect to etcd: context deadline exceeded.

So you are running on a standalone node with 283.29 GB memory?

is it a docker deployment or a docker compose deployment?

If it's docker compose you should be able to connect with birdwatcher and see segment info.

It's weird that your memory usage is 150GB GB(From the log you offered) but the estimation is more than 300GB.

Here is some guess of what happened:

  1. Your collection has many deletes
  2. Your segment is huge (more than 50GB)

If you can list S3 directory then we can probably do some further analysis.

My recommendation is:

  1. Try to increase memory to 500GB see if load can succeed
  2. upgrade to 2.4.12
  3. analyze with birdwatcher. the memory usage function is at getResourceUsageEstimateOfSegment @ Segment loader.go
barzan-hayati commented 1 month ago

So you are running on a standalone node with 283.29 GB memory?

Yes.

is it a docker deployment or a docker compose deployment?

I am using this docker compose file.

If it's docker compose you should be able to connect with birdwatcher and see segment info.

But I could not connect in birwatcher.

Your collection has many deletes

I don't perform any delete operations on the Milvus collection; I only add data to it at the moment.

Your segment is huge (more than 50GB)

I have no information on that. I was unable to check the segment size using Birdwatcher or obtain query segment information via get_query_segment_info() function.

If you can list S3 directory then we can probably do some further analysis.

Where can I find the S3 directory? I couldn't locate it.

upgrade to 2.4.12

Currently, I am using this version.

It's weird that your memory usage is 150GB GB(From the log you offered) but the estimation is more than 300GB.

It's strange. I have 10 files, each containing 5 million rows with 512 dimensions. When I try to add them all at once, I run into issues. However, if I add them one by one, they can be successfully added to the collection. After adding each set of 5 million records, I perform indexing before attempting to add the next one.

xiaofan-luan commented 1 month ago

g before attempting to add the next one.

So you are using import and try to import 10 files, each with 5million 512 dim data?

We need segment distribution info, probably the reason you can not connect to etcd is you need to export 2379 port of etcd to your host machine or your laptop.

barzan-hayati commented 1 month ago

So you are using import and try to import 10 files, each with 5million 512 dim data?

Thats right.

We need segment distribution info, probably the reason you can not connect to etcd is you need to export 2379 port of etcd to your host machine or your laptop.

let me chack it.

xiaofan-luan commented 1 month ago

we need more evidence on how large is segment, even if you are trying to import multiple large files milvus should split all the files into small segments

"When I try to add them all at once, I run into issues. However, if I add them one by one, they can be successfully added to the collection" Another guess might be since you import too much file at the same time, index build can not catch up. And we you try to load multiple segment without index, the resource estimation maybe higher than actual usage.

try to change queryNode.segcore.interimIndex.enableIndex = false see if it helps

stale[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.