milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.25k stars 2.81k forks source link

[Bug]: [Nightly] Milvus cluster compaction frequently failed for timeout #33511

Closed NicoYuan1986 closed 4 weeks ago

NicoYuan1986 commented 3 months ago

Is there an existing issue for this?

Environment

- Milvus version: 23dedc2
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar && kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Milvus cluster compaction frequently failed for timeout.

[2024-05-30T19:42:24.457Z]         collection_w.load()
[2024-05-30T19:42:24.457Z]         cost = 180
[2024-05-30T19:42:24.457Z]         start = time()
[2024-05-30T19:42:24.457Z]         while True:
[2024-05-30T19:42:24.457Z]             sleep(1)
[2024-05-30T19:42:24.457Z]             segments_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]
[2024-05-30T19:42:24.457Z]     
[2024-05-30T19:42:24.457Z]             # verify segments reaches threshold, auto-merge ten segments into one
[2024-05-30T19:42:24.457Z]             if len(segments_info) == 1:
[2024-05-30T19:42:24.457Z]                 break
[2024-05-30T19:42:24.457Z]             end = time()
[2024-05-30T19:42:24.457Z]             if end - start > cost:
[2024-05-30T19:42:24.457Z] >               raise MilvusException(1, "Compact merge multiple segments more than 180s")
[2024-05-30T19:42:24.457Z] E               pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Compact merge multiple segments more than 180s)>
[2024-05-30T19:42:24.457Z] 
[2024-05-30T19:42:24.457Z] testcases/test_compaction.py:928: MilvusException

Expected Behavior

compact successfully

Steps To Reproduce

1.create with shard_num=1
2.insert one and flush (less than threshold)
3.compact
4.load and search

Milvus Log

  1. link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/758/pipeline/275
  2. log: artifacts-milvus-distributed-kafka-nightly-758-pymilvus-e2e-logs.tar.gz
  3. collection name: compact_juqe5eND
  4. failed time: [2024-05-30T18:14:51.267Z] [gw2] [ 21%] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_merge_multi_segments
  5. other failed cases:
    [2024-05-30T19:42:24.492Z] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_merge_multi_segments - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Compact merge multiple segments more than 180s)>
    [2024-05-30T19:42:24.492Z] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_insert - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Waiting more than 180s for the new target segment to load)>
    [2024-05-30T19:42:24.492Z] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_merge_two_segments - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Compact merge two segments more than 180s)>

Anything else?

No response

xiaofan-luan commented 3 months ago

How fast does the flush happend. My guess it this might be due to there a lot of segment and compaction is blocked. I think e2e is better not to be designed with performance issues

yanliang567 commented 3 months ago

this is not a performance test, it just expects to complete a compaction task and verify the results /assign @XuanYang-cn /unassign

NicoYuan1986 commented 2 months ago

reproduce. failed cases:

[2024-06-25T22:22:36.995Z] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_after_binary_index - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Handoff after compact and index cost more than 180s)>
[2024-06-25T22:22:36.995Z] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_merge_multi_segments - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Compact merge multiple segments more than 180s)>
[2024-06-25T22:22:36.995Z] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_during_insert - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Waiting more than 240s for the new segment indexed)>
[2024-06-25T22:22:36.995Z] FAILED testcases/test_compaction.py::TestCompactionOperation::test_compact_merge_two_segments - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Compact merge two segments more than 180s)>
[2024-06-25T22:22:36.995Z] FAILED testcases/test_query.py::TestQueryCount::test_count_compact_merge - assert 2 == 1
[2024-06-25T22:22:36.996Z]  +  where 2 = len([segmentID: 450716539565349536\ncollectionID: 450716539565349525\npartitionID: 450716539565349526\nnum_rows: 100\nstate: Sealed\nnodeIds: 3\n, segmentID: 450716539565349700\ncollectionID: 450716539565349525\npartitionID: 450716539565349526\nnum_rows: 100\nstate: Sealed\nnodeIds: 9\n])

take the case test_compact_during_insert for example:

  1. insert entities into multi segments
  2. start a thread to load and search
  3. compact collection -> timeout
[2024-06-25T22:22:36.962Z]         # waitting for new segment index and compact
[2024-06-25T22:22:36.962Z]         index_cost = 240
[2024-06-25T22:22:36.962Z]         start = time()
[2024-06-25T22:22:36.962Z]         while True:
[2024-06-25T22:22:36.962Z]             sleep(10)
[2024-06-25T22:22:36.962Z]             collection_w.load()
[2024-06-25T22:22:36.962Z]             # new segment compacted
[2024-06-25T22:22:36.962Z]             seg_info = self.utility_wrap.get_query_segment_info(collection_w.name)[0]
[2024-06-25T22:22:36.962Z]             if len(seg_info) == 2:
[2024-06-25T22:22:36.962Z]                 break
[2024-06-25T22:22:36.962Z]             end = time()
[2024-06-25T22:22:36.962Z]             collection_w.release()
[2024-06-25T22:22:36.963Z]             if end - start > index_cost:
[2024-06-25T22:22:36.963Z] >               raise MilvusException(1, f"Waiting more than {index_cost}s for the new segment indexed")
[2024-06-25T22:22:36.963Z] E               pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Waiting more than 240s for the new segment indexed)>
  1. link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/2.4/81/pipeline/271/
  2. log: artifacts-milvus-distributed-kafka-nightly-81-pymilvus-e2e-logs.tar.gz
  3. collection name: compact_F0vo8ehh
stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.