milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.95k stars 2.95k forks source link

[Bug]: DataNode keeps restarting due to error: failed to serialize merged stats log: shall not serialize zero length statslog list #36723

Open ThreadDao opened 1 month ago

ThreadDao commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: cardinal-milvus-io-2.4-9a07c1bca9-20240929
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server

An milvus cluster that has been running for a long time, which has 2 dataNodes and 4 queryNodes

test steps

  1. collection laion_stable_9 has 100m-768d entities,
    {'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'int64_pk_5b', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'varchar_caption', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'varchar_NSFW', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'float64_similarity', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'int64_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'varchar_md5', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}], 'enable_dynamic_field': True}
  2. In the previous test, many L0 segments (1.8k) were left behind and there was no time to perform L0 compaction. In addition, queryNode memory was tight.
  3. concurrent requests: Flush + load + search + query + upsert
  4. dataNode rkeeps restarting with error image
  5. L0 compaction seems to no longer be triggered image

links

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

pods:

laion1b-test-2-etcd-0                                             1/1     Running     1 (4d23h ago)    38d    10.104.25.207   4am-node30   <none>           <none>
laion1b-test-2-etcd-1                                             1/1     Running     0                97d    10.104.30.186   4am-node38   <none>           <none>
laion1b-test-2-etcd-2                                             1/1     Running     0                299d   10.104.34.225   4am-node37   <none>           <none>
laion1b-test-2-milvus-datanode-7b8f94796b-9wb45                   1/1     Running     95 (11h ago)     10d    10.104.1.226    4am-node10   <none>           <none>
laion1b-test-2-milvus-datanode-7b8f94796b-hrvq2                   1/1     Running     54 (4d12h ago)   10d    10.104.20.103   4am-node22   <none>           <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-cfgkh                  1/1     Running     2 (8d ago)       10d    10.104.19.46    4am-node28   <none>           <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-m7xcf                  1/1     Running     0                10d    10.104.30.115   4am-node38   <none>           <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-mf42d                  1/1     Running     0                10d    10.104.32.231   4am-node39   <none>           <none>
laion1b-test-2-milvus-indexnode-7fc94494bd-q88hf                  1/1     Running     0                10d    10.104.16.70    4am-node21   <none>           <none>
laion1b-test-2-milvus-mixcoord-b484b7777-ggxcn                    1/1     Running     1 (8d ago)       10d    10.104.30.114   4am-node38   <none>           <none>
laion1b-test-2-milvus-proxy-787965c494-kzlrp                      1/1     Running     0                10d    10.104.32.230   4am-node39   <none>           <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-24wr8                1/1     Running     3 (5d15h ago)    10d    10.104.26.61    4am-node32   <none>           <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-5wp66                1/1     Running     2 (4d12h ago)    10d    10.104.24.214   4am-node29   <none>           <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-8gztc                1/1     Running     2 (4d23h ago)    10d    10.104.15.164   4am-node20   <none>           <none>
laion1b-test-2-milvus-querynode-1-7b7568c78b-g8lsb                1/1     Running     3 (4d23h ago)    10d    10.104.27.130   4am-node31   <none>           <none>
laion1b-test-2-pulsar-bookie-0                                    1/1     Running     0                299d   10.104.33.107   4am-node36   <none>           <none>
laion1b-test-2-pulsar-bookie-1                                    1/1     Running     0                102d   10.104.18.97    4am-node25   <none>           <none>
laion1b-test-2-pulsar-bookie-2                                    1/1     Running     0                38d    10.104.25.206   4am-node30   <none>           <none>
laion1b-test-2-pulsar-broker-0                                    1/1     Running     1 (171d ago)     180d   10.104.1.147    4am-node10   <none>           <none>
laion1b-test-2-pulsar-proxy-0                                     1/1     Running     0                168d   10.104.32.209   4am-node39   <none>           <none>
laion1b-test-2-pulsar-recovery-0                                  1/1     Running     1 (168d ago)     200d   10.104.31.87    4am-node34   <none>           <none>
laion1b-test-2-pulsar-zookeeper-0                                 1/1     Running     0                299d   10.104.29.87    4am-node35   <none>           <none>
laion1b-test-2-pulsar-zookeeper-1                                 1/1     Running     0                180d   10.104.21.196   4am-node24   <none>           <none>
laion1b-test-2-pulsar-zookeeper-2                                 1/1     Running     0                299d   10.104.34.229   4am-node37   <none>           <none>

Anything else?

No response

yanliang567 commented 1 month ago

/assign @XuanYang-cn /unassign

XuanYang-cn commented 1 month ago

Some segments stay in "Sealed" state, but cp has been advanced. And DataNode failed to flush those sealed segments and keeps panicking itself.

After rebooting DataCoord, those sealed segment becomes flushed, and no panic since.

XuanYang-cn commented 1 month ago

image

ThreadDao commented 1 week ago

hard to find the root cause