[Bug]: When deleting a varchar collection, deleteBufferSize expands 7 times and compaction is not triggered in time

ThreadDao commented 2 days ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.4-20241106-20534a3f-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server config

qn: 5*8c32g

    dataCoord:
      compaction:
        taskPrioritizer: default
      enableActiveStandby: true
      segment:
        expansionRate: 1.15
        maxSize: 2048
        sealProportion: 0.12
    queryNode:
      levelZeroForwardPolicy: RemoteLoad
      streamingDeltaForwardPolicy: Direct
    quotaAndLimits:
      dml:
        deleteRate:
          max: 2
        enabled: true
        insertRate:
          max: 16
      limitWriting:
        deleteBufferRowCountProtection:
          enabled: true
          highWaterLevel: 25000000
          lowWaterLevel: 12000000
        deleteBufferSizeProtection:
          enabled: true
          highWaterLevel: 1073741824 (1GB) 
          lowWaterLevel: 268435456 (256MB)
        growingSegmentsSizeProtection:
          enabled: true
          highWaterLevel: 0.2
          lowWaterLevel: 0.1
          minRateRatio: 0.5
        l0SegmentsRowCountProtection:
          enabled: true
          highWaterLevel: 50000000
          lowWaterLevel: 25000000
        memProtection:
          dataNodeMemoryHighWaterLevel: 0.85
          dataNodeMemoryLowWaterLevel: 0.75
          queryNodeMemoryHighWaterLevel: 0.85
          queryNodeMemoryLowWaterLevel: 0.75
      limits:
        complexDeleteLimitEnable: true
        maxOutputSize: 209715200

test steps

collection has a varchar field of length 64 and a vector field of dim 128.
delete 60million pks of total 100m with batch 60,000 during concurrent search

test results

(metrics of compact-opt-rate-100m-1)[https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=P1809F7CD0C75ACF3&var-namespace=qa-milvus&var-instance=compact-opt-rate-100m-1&var-collection=All&var-app_name=milvus&from=1731295244000&to=1731299335348]
1. deleteBufferSize From the queryNode metrics, we can see that deleteBufferSize is too high. It has expanded by 7 times according to the actual deleteBufferRowCount. the size according to the count: 2000000*(64+8)/1024/1024～=137MB, actual deeteBufferSize=1GB

querynode memory usage During the target update, the qn memory fluctuated by 30%, about 10GiB. Please help confirm whether this is in line with expectations? Can it be optimized? FIY, the levelZeroForwardPolicy is RemoteLoad, segment maxSize is 2048
compaction trigger The compaction was triggered 18 minutes after the deletion started? Why does it take so long?

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

pods:

compact-opt-rate-100m-1-milvus-datanode-677f6dfd9f-2s7fd          1/1     Running                  0                5d3h    10.104.24.14    4am-node29   <none>           <none>
compact-opt-rate-100m-1-milvus-indexnode-86d4bbc5f6-kmxt7         1/1     Running                  0                5d3h    10.104.4.224    4am-node11   <none>           <none>
compact-opt-rate-100m-1-milvus-indexnode-86d4bbc5f6-nsw9r         1/1     Running                  0                5d3h    10.104.15.2     4am-node20   <none>           <none>
compact-opt-rate-100m-1-milvus-mixcoord-78fd7d5865-skv2v          1/1     Running                  0                5d3h    10.104.13.63    4am-node16   <none>           <none>
compact-opt-rate-100m-1-milvus-proxy-567b6694bf-l4ttz             1/1     Running                  0                5d3h    10.104.1.97     4am-node10   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-5qxl8       1/1     Running                  1 (3d15h ago)    5d3h    10.104.30.179   4am-node38   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-82ddx       1/1     Running                  0                26h     10.104.18.10    4am-node25   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-c9pw4       1/1     Running                  0                2d20h   10.104.16.219   4am-node21   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-gg5n2       1/1     Running                  12 (3d15h ago)   5d3h    10.104.25.45    4am-node30   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-npv9j       1/1     Running                  4 (3d15h ago)    5d3h    10.104.17.169   4am-node23   <none>           <none>

Anything else?

No response

ThreadDao commented 2 days ago

/assign @XuanYang-cn Please help investigate

XuanYang-cn commented 2 days ago

Acutually, L0 compaction executes too fast that compaction task num metrics add and sub one in 30s. It's not shown in the compaction task num metrics, but latency and logs can prove that

XuanYang-cn commented 2 days ago

Triggers and executes too fast, but L0 segment number cannot be controled

Picked 2 segments out of 37 segments. Might need some config changes for varchar

XuanYang-cn commented 1 day ago

For a uuid string(36), the actual size of PrimaryKey is 7 times of expected.

=== RUN   TestVarCharPrimaryKey/size
    primary_key_test.go:19: 
            Error Trace:    /home/yangxuan/Github/milvus/internal/storage/primary_key_test.go:19
            Error:          Not equal: 
                            expected: int(44)
                            actual  : int64(296)
            Test:           TestVarCharPrimaryKey/size
            Messages:       uuid: f99f07ce-b546-4639-a24a-013929475a99

milvus-io / milvus