milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.42k stars 2.92k forks source link

[Bug]: When deleting a varchar collection, deleteBufferSize expands 7 times and compaction is not triggered in time #37582

Open ThreadDao opened 2 days ago

ThreadDao commented 2 days ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4-20241106-20534a3f-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server config

qn: 5*8c32g

    dataCoord:
      compaction:
        taskPrioritizer: default
      enableActiveStandby: true
      segment:
        expansionRate: 1.15
        maxSize: 2048
        sealProportion: 0.12
    queryNode:
      levelZeroForwardPolicy: RemoteLoad
      streamingDeltaForwardPolicy: Direct
    quotaAndLimits:
      dml:
        deleteRate:
          max: 2
        enabled: true
        insertRate:
          max: 16
      limitWriting:
        deleteBufferRowCountProtection:
          enabled: true
          highWaterLevel: 25000000
          lowWaterLevel: 12000000
        deleteBufferSizeProtection:
          enabled: true
          highWaterLevel: 1073741824 (1GB) 
          lowWaterLevel: 268435456 (256MB)
        growingSegmentsSizeProtection:
          enabled: true
          highWaterLevel: 0.2
          lowWaterLevel: 0.1
          minRateRatio: 0.5
        l0SegmentsRowCountProtection:
          enabled: true
          highWaterLevel: 50000000
          lowWaterLevel: 25000000
        memProtection:
          dataNodeMemoryHighWaterLevel: 0.85
          dataNodeMemoryLowWaterLevel: 0.75
          queryNodeMemoryHighWaterLevel: 0.85
          queryNodeMemoryLowWaterLevel: 0.75
      limits:
        complexDeleteLimitEnable: true
        maxOutputSize: 209715200

test steps

  1. collection has a varchar field of length 64 and a vector field of dim 128.
  2. delete 60million pks of total 100m with batch 60,000 during concurrent search

test results

  1. querynode memory usage During the target update, the qn memory fluctuated by 30%, about 10GiB. Please help confirm whether this is in line with expectations? Can it be optimized? FIY, the levelZeroForwardPolicy is RemoteLoad, segment maxSize is 2048 图片

  2. compaction trigger The compaction was triggered 18 minutes after the deletion started? Why does it take so long? 图片

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

pods:

compact-opt-rate-100m-1-milvus-datanode-677f6dfd9f-2s7fd          1/1     Running                  0                5d3h    10.104.24.14    4am-node29   <none>           <none>
compact-opt-rate-100m-1-milvus-indexnode-86d4bbc5f6-kmxt7         1/1     Running                  0                5d3h    10.104.4.224    4am-node11   <none>           <none>
compact-opt-rate-100m-1-milvus-indexnode-86d4bbc5f6-nsw9r         1/1     Running                  0                5d3h    10.104.15.2     4am-node20   <none>           <none>
compact-opt-rate-100m-1-milvus-mixcoord-78fd7d5865-skv2v          1/1     Running                  0                5d3h    10.104.13.63    4am-node16   <none>           <none>
compact-opt-rate-100m-1-milvus-proxy-567b6694bf-l4ttz             1/1     Running                  0                5d3h    10.104.1.97     4am-node10   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-5qxl8       1/1     Running                  1 (3d15h ago)    5d3h    10.104.30.179   4am-node38   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-82ddx       1/1     Running                  0                26h     10.104.18.10    4am-node25   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-c9pw4       1/1     Running                  0                2d20h   10.104.16.219   4am-node21   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-gg5n2       1/1     Running                  12 (3d15h ago)   5d3h    10.104.25.45    4am-node30   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-npv9j       1/1     Running                  4 (3d15h ago)    5d3h    10.104.17.169   4am-node23   <none>           <none>

Anything else?

No response

ThreadDao commented 2 days ago

/assign @XuanYang-cn Please help investigate

XuanYang-cn commented 2 days ago

Acutually, L0 compaction executes too fast that compaction task num metrics add and sub one in 30s. It's not shown in the compaction task num metrics, but latency and logs can prove that image image

XuanYang-cn commented 2 days ago

Triggers and executes too fast, but L0 segment number cannot be controled

Picked 2 segments out of 37 segments. Might need some config changes for varchar image

XuanYang-cn commented 1 day ago

For a uuid string(36), the actual size of PrimaryKey is 7 times of expected.

=== RUN   TestVarCharPrimaryKey/size
    primary_key_test.go:19: 
            Error Trace:    /home/yangxuan/Github/milvus/internal/storage/primary_key_test.go:19
            Error:          Not equal: 
                            expected: int(44)
                            actual  : int64(296)
            Test:           TestVarCharPrimaryKey/size
            Messages:       uuid: f99f07ce-b546-4639-a24a-013929475a99