milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.74k stars 2.85k forks source link

[Bug]: datanode is keeping OOM, starting, OOM #24640

Closed altonchuzhan closed 1 year ago

altonchuzhan commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:
2.2.8, using chart 4.0.22
- Deployment mode(standalone or cluster):
cluster
- MQ type(rocksmq, pulsar or kafka):    
pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
100G Limit, 80G required for datanode
- GPU: 
none
- Others:

Current Behavior

the datanode is keeping OOM (> 100G).

every query hangs to timeout after this issue happened.

the problem starts after I do some data inserting. I have tried restart datanode, reload collection. and I double checked every other component is healthy and didnt find suspicious log.

Expected Behavior

world peace

is there a way to clear the message storage? I can afford the newly data loss.

Steps To Reproduce

not sure how to reproduce it

Milvus Log

the end of oom killed datanode log:

.... [2023/06/02 19:03:46.978 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203155864783] ["stats bin logs"=1] [2023/06/02 19:03:46.985 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203124604278] [time=268.333275ms] [size=1256086] [2023/06/02 19:03:46.985 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=439652203217157725] [numRows=113576] [2023/06/02 19:03:46.985 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203169329753] [collectionID=439652203113153329] [partitionID=439652203144399841] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:46.985 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203169329753] ["stats bin logs"=1] [2023/06/02 19:03:46.993 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203124403758] [time=245.403582ms] [size=1250617] [2023/06/02 19:03:46.993 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=440633149184427333] [numRows=113853] [2023/06/02 19:03:46.993 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203194246102] [collectionID=439652203113153329] [partitionID=439652203174754870] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:46.993 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203194246102] ["stats bin logs"=1] [2023/06/02 19:03:47.022 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203168927997] [time=163.419482ms] [size=1251146] [2023/06/02 19:03:47.022 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=440633149176606994] [numRows=113316] [2023/06/02 19:03:47.022 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203217157725] [collectionID=439652203113153329] [partitionID=439652203174754914] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.022 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203217157725] ["stats bin logs"=1] [2023/06/02 19:03:47.043 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203224592172] [time=216.219345ms] [size=1246845] [2023/06/02 19:03:47.043 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203224592392] [numRows=113313] [2023/06/02 19:03:47.043 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=440633149184427333] [collectionID=439652203113153329] [partitionID=439652203221176155] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.043 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=440633149184427333] ["stats bin logs"=1] [2023/06/02 19:03:47.089 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203187815593] [time=173.266836ms] [size=1248554] [2023/06/02 19:03:47.089 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=441893573302587254] [numRows=137446] [2023/06/02 19:03:47.089 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=440633149176606994] [collectionID=439652203113153329] [partitionID=439652203221176111] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.089 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=440633149176606994] ["stats bin logs"=1] [2023/06/02 19:03:47.098 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203147817657] [time=209.936665ms] [size=1251852] [2023/06/02 19:03:47.098 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203174353192] [numRows=113277] [2023/06/02 19:03:47.098 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203224592392] [collectionID=439652203113153329] [partitionID=439652203221176220] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.099 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203224592392] ["stats bin logs"=1] [2023/06/02 19:03:47.117 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=440633149197862091] [time=164.506157ms] [size=1253219] [2023/06/02 19:03:47.117 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=439652203198065412] [numRows=113751] [2023/06/02 19:03:47.117 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=441893573302587254] [collectionID=439652203113153329] [partitionID=439652203135753710] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.117 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=441893573302587254] ["stats bin logs"=1] [2023/06/02 19:03:47.139 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203130030523] [time=206.226141ms] [size=1248345] [2023/06/02 19:03:47.139 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=441893573302582326] [numRows=97733] [2023/06/02 19:03:47.139 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203174353192] [collectionID=439652203113153329] [partitionID=439652203144399812] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.139 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203174353192] ["stats bin logs"=1] [2023/06/02 19:03:47.164 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203139374573] [time=245.167139ms] [size=1255050] [2023/06/02 19:03:47.164 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=440633149194452631] [numRows=113748] [2023/06/02 19:03:47.164 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203198065412] [collectionID=439652203113153329] [partitionID=439652203174754894] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.164 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203198065412] ["stats bin logs"=1] [2023/06/02 19:03:47.172 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203169329753] [time=186.825724ms] [size=1249911] [2023/06/02 19:03:47.172 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203140982928] [numRows=113213] [2023/06/02 19:03:47.172 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=441893573302582326] [collectionID=439652203113153329] [partitionID=439652203128020577] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.172 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=441893573302582326] ["stats bin logs"=1] [2023/06/02 19:03:47.174 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203155864783] [time=196.229443ms] [size=1246713] [2023/06/02 19:03:47.174 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=440633149205281463] [numRows=113334] [2023/06/02 19:03:47.174 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=440633149194452631] [collectionID=439652203113153329] [partitionID=439652203221176101] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.174 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=440633149194452631] ["stats bin logs"=1] [2023/06/02 19:03:47.186 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=440633149184427333] [time=143.103271ms] [size=1255546] [2023/06/02 19:03:47.186 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203192437809] [numRows=113255] [2023/06/02 19:03:47.186 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203140982928] [collectionID=439652203113153329] [partitionID=439652203135753710] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.186 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203140982928] ["stats bin logs"=1] [2023/06/02 19:03:47.192 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203194246102] [time=198.462651ms] [size=1247860] [2023/06/02 19:03:47.192 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=439652203164306134] [numRows=113351] [2023/06/02 19:03:47.192 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=440633149205281463] [collectionID=439652203113153329] [partitionID=440633149199466349] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.192 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=440633149205281463] ["stats bin logs"=1] [2023/06/02 19:03:47.256 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203217157725] [time=233.759768ms] [size=1252491] [2023/06/02 19:03:47.256 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203199270735] [numRows=113826] [2023/06/02 19:03:47.256 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203192437809] [collectionID=439652203113153329] [partitionID=439652203174754870] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.256 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203192437809] ["stats bin logs"=1] [2023/06/02 19:03:47.274 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=440633149176606994] [time=185.152807ms] [size=1249624] [2023/06/02 19:03:47.274 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=439652203113154286] [numRows=60] [2023/06/02 19:03:47.274 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203164306134] [collectionID=439652203113153329] [partitionID=439652203144399774] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.274 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203164306134] ["stats bin logs"=1] [2023/06/02 19:03:47.311 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=440633149194452631] [time=136.927388ms] [size=1254388] [2023/06/02 19:03:47.311 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=441890336846682663] [numRows=151186] [2023/06/02 19:03:47.311 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203199270735] [collectionID=439652203113153329] [partitionID=439652203174754870] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.311 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203199270735] ["stats bin logs"=1] [2023/06/02 19:03:47.322 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=441893573302582326] [time=149.664934ms] [size=1077778] [2023/06/02 19:03:47.322 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=440633149189640635] [numRows=113358] [2023/06/02 19:03:47.322 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203113154286] [collectionID=439652203113153329] [partitionID=439652203113154277] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.322 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203113154286] ["stats bin logs"=1] [2023/06/02 19:03:47.324 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203224592392] [time=225.852233ms] [size=1249591] [2023/06/02 19:03:47.324 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203214746646] [numRows=113431] [2023/06/02 19:03:47.324 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=441890336846682663] [collectionID=439652203113153329] [partitionID=440633149199466391] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.324 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=441890336846682663] ["stats bin logs"=1] [2023/06/02 19:03:47.333 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=441893573302587254] [time=215.73827ms] [size=1515724] [2023/06/02 19:03:47.333 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [segmentID=439652203215349565] [numRows=113501] [2023/06/02 19:03:47.333 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=440633149189640635] [collectionID=439652203113153329] [partitionID=439652203221176175] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.333 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=440633149189640635] ["stats bin logs"=1] [2023/06/02 19:03:47.340 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203113154286] [time=18.239296ms] [size=662] [2023/06/02 19:03:47.340 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=440633149170388339] [numRows=96193] [2023/06/02 19:03:47.340 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203214746646] [collectionID=439652203113153329] [partitionID=439652203174754851] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.341 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203214746646] ["stats bin logs"=1] [2023/06/02 19:03:47.349 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203174353192] [time=210.054169ms] [size=1249194] [2023/06/02 19:03:47.349 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203215349565] [collectionID=439652203113153329] [partitionID=439652203174754807] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [startPosition=] [endPosition=] [recoverTs=441904113015586817] [importing=false] [2023/06/02 19:03:47.349 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203215349565] ["stats bin logs"=1] [2023/06/02 19:03:47.373 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203140982928] [time=186.429922ms] [size=1248488] [2023/06/02 19:03:47.373 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203213138891] [numRows=113654] [2023/06/02 19:03:47.373 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=440633149170388339] [collectionID=439652203113153329] [partitionID=439652203221176135] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.373 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=440633149170388339] ["stats bin logs"=1] [2023/06/02 19:03:47.386 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=440633149205281463] [time=193.955631ms] [size=1249823] [2023/06/02 19:03:47.386 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203146209619] [numRows=113176] [2023/06/02 19:03:47.386 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203213138891] [collectionID=439652203113153329] [partitionID=439652203174754793] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.386 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203213138891] ["stats bin logs"=1] [2023/06/02 19:03:47.422 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203164306134] [time=147.750548ms] [size=1250010] [2023/06/02 19:03:47.422 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203191030572] [numRows=113177] [2023/06/02 19:03:47.422 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203146209619] [collectionID=439652203113153329] [partitionID=439652203144399841] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.422 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203146209619] ["stats bin logs"=1] [2023/06/02 19:03:47.457 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203198065412] [time=292.417849ms] [size=1254421] [2023/06/02 19:03:47.457 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=441893573302565962] [numRows=98215] [2023/06/02 19:03:47.457 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203191030572] [collectionID=439652203113153329] [partitionID=439652203174754914] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.457 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203191030572] ["stats bin logs"=1] [2023/06/02 19:03:47.459 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203192437809] [time=202.725354ms] [size=1248951] [2023/06/02 19:03:47.459 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=439652203197462357] [numRows=112846] [2023/06/02 19:03:47.459 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=441893573302565962] [collectionID=439652203113153329] [partitionID=439652203128020563] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.459 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=441893573302565962] ["stats bin logs"=1] [2023/06/02 19:03:47.482 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=440633149170388339] [time=109.417318ms] [size=1060795] [2023/06/02 19:03:47.482 +00:00] [INFO] [datanode/data_sync_service.go:249] ["recover sealed segments form checkpoints"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [segmentID=440633149179414343] [numRows=113524] [2023/06/02 19:03:47.482 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=439652203197462357] [collectionID=439652203113153329] [partitionID=439652203174754914] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.482 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=439652203197462357] ["stats bin logs"=1] [2023/06/02 19:03:47.482 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=440633149189640635] [time=149.049023ms] [size=1250087] [2023/06/02 19:03:47.482 +00:00] [INFO] [datanode/channel_meta.go:199] ["adding segment"] [type=Flushed] [segmentID=440633149179414343] [collectionID=439652203113153329] [partitionID=439652203221176200] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [startPosition=] [endPosition=] [recoverTs=441904101481512961] [importing=false] [2023/06/02 19:03:47.482 +00:00] [INFO] [datanode/channel_meta.go:299] ["begin to init pk bloom filter"] [segmentID=440633149179414343] ["stats bin logs"=1] [2023/06/02 19:03:47.509 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203214746646] [time=168.161764ms] [size=1250892] [2023/06/02 19:03:47.515 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203199270735] [time=203.922071ms] [size=1255248] [2023/06/02 19:03:47.522 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203215349565] [time=172.385263ms] [size=1251664] [2023/06/02 19:03:47.531 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203213138891] [time=145.321288ms] [size=1253351] [2023/06/02 19:03:47.586 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=441890336846682663] [time=261.747424ms] [size=1667246] time="2023-06-02T19:03:47Z" level=info msg="[Connected consumer]" consumerID=5 name=hlgto subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" time="2023-06-02T19:03:47Z" level=info msg="[Created consumer]" consumerID=5 name=hlgto subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" [2023/06/02 19:03:47.609 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:56] ["datanode begin to seek"] ["seek msgID"="\u0008\ufffd\ufffd\u0006\u0010\ufffdH\u0018\u0000 \u0000"] [pchannel=by-dev-rootcoord-dml_13] [vchannel=by-dev-rootcoord-dml_13_439652203113153329v1] [position=2023/06/02 18:26:51.589 +00:00] [tsLag=36m56.02024883s] ["collection ID"=439652203113153329] [2023/06/02 19:03:47.609 +00:00] [INFO] [msgstream/mq_msgstream.go:880] ["MsgStream begin to seek start msg: "] [channel=by-dev-rootcoord-dml_13] [MessageID="CJjmBhDaSBgAIAA="] time="2023-06-02T19:03:47Z" level=info msg="Broker notification of Closed consumer: 5" local_addr="172.20.35.67:34516" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-02T19:03:47Z" level=info msg="[Reconnecting to broker in 107.189728ms]" consumerID=5 name=hlgto subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" [2023/06/02 19:03:47.615 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203146209619] [time=192.886442ms] [size=1248080] [2023/06/02 19:03:47.631 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=441893573302565962] [time=172.415622ms] [size=1083093] [2023/06/02 19:03:47.660 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203197462357] [time=178.206527ms] [size=1244441] [2023/06/02 19:03:47.697 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=440633149179414343] [time=214.33075ms] [size=1251918] time="2023-06-02T19:03:47Z" level=info msg="[Connected consumer]" consumerID=5 name=hlgto subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" time="2023-06-02T19:03:47Z" level=info msg="[Reconnected consumer to broker]" consumerID=5 name=hlgto subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" [2023/06/02 19:03:47.719 +00:00] [INFO] [msgstream/mq_msgstream.go:890] ["MsgStream seek finished"] [channel=by-dev-rootcoord-dml_13] [2023/06/02 19:03:47.725 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:67] ["datanode seek successfully"] ["seek msgID"="\u0008\ufffd\ufffd\u0006\u0010\ufffdH\u0018\u0000 \u0000"] [pchannel=by-dev-rootcoord-dml_13] [vchannel=by-dev-rootcoord-dml_13_439652203113153329v1] [position=2023/06/02 18:26:51.589 +00:00] [tsLag=36m56.136456496s] ["collection ID"=439652203113153329] [elapse=116.209618ms] [2023/06/02 19:03:47.725 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:79] ["datanode AsConsumer"] ["physical channel"=by-dev-rootcoord-dml_13] [subName=by-dev-dataNode-2557-by-dev-rootcoord-dml_13_439652203113153329v1] ["collection ID"=439652203113153329] [2023/06/02 19:03:47.725 +00:00] [INFO] [datanode/flow_graph_dd_node.go:376] ["ddNode convert vChannel to pChannel"] [vChannelName=by-dev-rootcoord-dml_13_439652203113153329v1] [pChannelName=by-dev-rootcoord-dml_13] time="2023-06-02T19:03:47Z" level=info msg="[Created producer]" cnx="172.20.35.67:34516 -> 192.168.220.30:6650" producerID=5 producer_name=milvus-mining-pulsar-19-13666 topic="persistent://public/default/by-dev-rootcoord-delta_13" [2023/06/02 19:03:47.729 +00:00] [INFO] [datanode/flow_graph_dd_node.go:388] ["datanode AsProducer"] [DeltaChannelName=by-dev-rootcoord-delta_13] [2023/06/02 19:03:47.729 +00:00] [INFO] [datanode/flow_graph_dd_node.go:413] ["ddNode add sealed and growing segments"] [collectionID=439652203113153329] ["No. sealed segments"=663] ["No. growing segments"=0] time="2023-06-02T19:03:47Z" level=info msg="[Created producer]" cnx="172.20.35.67:34516 -> 192.168.220.30:6650" producerID=6 producer_name=milvus-mining-pulsar-19-13667 topic="persistent://public/default/by-dev-datacoord-timetick-channel" [2023/06/02 19:03:47.731 +00:00] [INFO] [datanode/flow_graph_insert_buffer_node.go:683] ["datanode AsProducer"] [TimeTickChannelName=by-dev-datacoord-timetick-channel] [2023/06/02 19:03:47.731 +00:00] [INFO] [datanode/data_sync_service.go:141] ["dataSyncService starting flow graph"] [collectionID=439652203113153329] [vChanName=by-dev-rootcoord-dml_13_439652203113153329v1] [2023/06/02 19:03:47.731 +00:00] [INFO] [datanode/data_node.go:410] ["handle put event: new data sync service success"] [vChanName=by-dev-rootcoord-dml_13_439652203113153329v1] [2023/06/02 19:03:47.739 +00:00] [INFO] [datanode/data_node.go:442] ["handle put event success"] [key=channelwatch/2557/by-dev-rootcoord-dml_13_439652203113153329v1] [state=WatchSuccess] [vChanName=by-dev-rootcoord-dml_13_439652203113153329v1] [2023/06/02 19:03:47.739 +00:00] [INFO] [datanode/data_node.go:345] ["DataNode received a PUT event with an end State"] [state=WatchSuccess] [2023/06/02 19:03:47.812 +00:00] [INFO] [datanode/channel_meta.go:358] ["Successfully load pk stats"] [segmentID=439652203191030572] [time=355.601121ms] [size=1248091] time="2023-06-02T19:03:47Z" level=info msg="[Connected consumer]" consumerID=6 name=felne subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" time="2023-06-02T19:03:47Z" level=info msg="[Created consumer]" consumerID=6 name=felne subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" [2023/06/02 19:03:47.845 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:56] ["datanode begin to seek"] ["seek msgID"="\u0008\ufffd\ufffd\u0006\u0010\ufffdG\u0018\u0000 \u0000"] [pchannel=by-dev-rootcoord-dml_12] [vchannel=by-dev-rootcoord-dml_12_439652203113153329v0] [position=2023/06/02 18:26:07.590 +00:00] [tsLag=37m40.255337754s] ["collection ID"=439652203113153329] [2023/06/02 19:03:47.845 +00:00] [INFO] [msgstream/mq_msgstream.go:880] ["MsgStream begin to seek start msg: "] [channel=by-dev-rootcoord-dml_12] [MessageID="CI7mBhCERxgAIAA="] time="2023-06-02T19:03:47Z" level=info msg="Broker notification of Closed consumer: 6" local_addr="172.20.35.67:34516" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-02T19:03:47Z" level=info msg="[Reconnecting to broker in 112.017806ms]" consumerID=6 name=felne subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" time="2023-06-02T19:03:47Z" level=info msg="[Connected consumer]" consumerID=6 name=felne subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" time="2023-06-02T19:03:47Z" level=info msg="[Reconnected consumer to broker]" consumerID=6 name=felne subscription=by-dev-dataNode-2557-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" [2023/06/02 19:03:47.976 +00:00] [INFO] [msgstream/mq_msgstream.go:890] ["MsgStream seek finished"] [channel=by-dev-rootcoord-dml_12] [2023/06/02 19:03:48.018 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:67] ["datanode seek successfully"] ["seek msgID"="\u0008\ufffd\ufffd\u0006\u0010\ufffdG\u0018\u0000 \u0000"] [pchannel=by-dev-rootcoord-dml_12] [vchannel=by-dev-rootcoord-dml_12_439652203113153329v0] [position=2023/06/02 18:26:07.590 +00:00] [tsLag=37m40.428154485s] ["collection ID"=439652203113153329] [elapse=172.818703ms] [2023/06/02 19:03:48.018 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:79] ["datanode AsConsumer"] ["physical channel"=by-dev-rootcoord-dml_12] [subName=by-dev-dataNode-2557-by-dev-rootcoord-dml_12_439652203113153329v0] ["collection ID"=439652203113153329] [2023/06/02 19:03:48.018 +00:00] [INFO] [datanode/flow_graph_dd_node.go:376] ["ddNode convert vChannel to pChannel"] [vChannelName=by-dev-rootcoord-dml_12_439652203113153329v0] [pChannelName=by-dev-rootcoord-dml_12] time="2023-06-02T19:03:48Z" level=info msg="[Created producer]" cnx="172.20.35.67:34516 -> 192.168.220.30:6650" producerID=7 producer_name=milvus-mining-pulsar-19-13668 topic="persistent://public/default/by-dev-rootcoord-delta_12" [2023/06/02 19:03:48.075 +00:00] [INFO] [datanode/flow_graph_dd_node.go:388] ["datanode AsProducer"] [DeltaChannelName=by-dev-rootcoord-delta_12] [2023/06/02 19:03:48.075 +00:00] [INFO] [datanode/flow_graph_dd_node.go:413] ["ddNode add sealed and growing segments"] [collectionID=439652203113153329] ["No. sealed segments"=660] ["No. growing segments"=0] time="2023-06-02T19:03:48Z" level=info msg="[Created producer]" cnx="172.20.35.67:34516 -> 192.168.220.30:6650" producerID=8 producer_name=milvus-mining-pulsar-19-13669 topic="persistent://public/default/by-dev-datacoord-timetick-channel" [2023/06/02 19:03:48.130 +00:00] [INFO] [datanode/flow_graph_insert_buffer_node.go:683] ["datanode AsProducer"] [TimeTickChannelName=by-dev-datacoord-timetick-channel] [2023/06/02 19:03:48.130 +00:00] [INFO] [datanode/data_sync_service.go:141] ["dataSyncService starting flow graph"] [collectionID=439652203113153329] [vChanName=by-dev-rootcoord-dml_12_439652203113153329v0] [2023/06/02 19:03:48.130 +00:00] [INFO] [datanode/data_node.go:410] ["handle put event: new data sync service success"] [vChanName=by-dev-rootcoord-dml_12_439652203113153329v0] [2023/06/02 19:03:48.137 +00:00] [INFO] [datanode/data_node.go:442] ["handle put event success"] [key=channelwatch/2557/by-dev-rootcoord-dml_12_439652203113153329v0] [state=WatchSuccess] [vChanName=by-dev-rootcoord-dml_12_439652203113153329v0] [2023/06/02 19:03:48.137 +00:00] [INFO] [datanode/data_node.go:345] ["DataNode received a PUT event with an end State"] [state=WatchSuccess] [2023/06/02 19:03:48.221 +00:00] [INFO] [datanode/flow_graph_time_tick_node.go:115] ["UpdateChannelCheckpoint success"] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [cpTs=441904101533941761] [cpTime=2023/06/02 18:26:07.790 +00:00]

Anything else?

No response

altonchuzhan commented 1 year ago

add a graph of memory: image

altonchuzhan commented 1 year ago

add pprof image heap.zip

altonchuzhan commented 1 year ago

image

xiaofan-luan commented 1 year ago

@altonchuzhan

  1. How many collections/partitions
  2. How do you create pulsar topics?
  3. Can we have logs of datacoord and datanode to check why flush is not triggered? By default flush is triggered if: segment reaches 0.23 * segment size every 10 minutes under global memory pressue.

To figure out this issue, you should look at the code at dataSyncService at datanode

xiaofan-luan commented 1 year ago

/assign @bigsheeper

xiaofan-luan commented 1 year ago

from the datanode log, everything seems to be working as expected. I'm doubting it is caused by too many collections/partitions

altonchuzhan commented 1 year ago

from the datanode log, everything seems to be working as expected. I'm doubting it is caused by too many collections/partitions

one big collection with 78 partition with 1~10 million vector in each partition

about pulsar topic, you remind me of one action I have done to solve "time="2023-06-02T02:00:51Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" topic="persistent://public/default/by-dev-rootcoord-dml_12"

the command is bin/pulsar-admin topics unsubscribe "persistent://public/default/by-dev-rootcoord-dml_1" -s by-dev-dataNode-2131-by-dev-rootcoord-dml_1_440633149338133674v1

maybe I should have a try to restart rootcoord or datacoord

altonchuzhan commented 1 year ago

summary the log, after restart , the "Cannot create producer on topic with backlog quota exceeded" log back and accompany with many connection error.

rootcoord: time="2023-06-04T05:16:51Z" level=info msg="Broker notification of Closed producer: 1" local_addr="172.20.35.245:49102" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-04T05:16:51Z" level=warning msg="[Connection was closed]" cnx="172.20.35.245:49102 -> 192.168.220.30:6650" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:51Z" level=info msg="[Reconnecting to broker in 108.824695ms]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:51Z" level=info msg="[Reconnecting to broker in 109.140546ms]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:51Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:51Z" level=info msg="[Reconnecting to broker in 223.465899ms]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:51Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:51Z" level=info msg="[Reconnecting to broker in 208.253412ms]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:52Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:52Z" level=info msg="[Reconnecting to broker in 453.367203ms]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:52Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:52Z" level=info msg="[Reconnecting to broker in 440.545209ms]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:52Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:52Z" level=info msg="[Reconnecting to broker in 889.086429ms]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:52Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:52Z" level=info msg="[Reconnecting to broker in 883.781628ms]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:53Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:53Z" level=info msg="[Reconnecting to broker in 1.767344256s]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:53Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:53Z" level=info msg="[Reconnecting to broker in 1.744770488s]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" [2023/06/04 05:16:54.582 +00:00] [INFO] [rootcoord/root_coord.go:988] ["received request to describe collection"] [traceID=1532067647936bbf] ["collection name"=] [id=440633149273728439] [msgID=0] [ts=18446744073709551615] [allowUnavailable=true] [2023/06/04 05:16:54.582 +00:00] [INFO] [rootcoord/root_coord.go:1021] ["done to describe collection"] [traceID=1532067647936bbf] ["collection name"=] [id=440633149273728439] [msgID=0] [ts=18446744073709551615] [allowUnavailable=true] [collection_id=440633149273728439] time="2023-06-04T05:16:55Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:55Z" level=info msg="[Reconnecting to broker in 3.741955681s]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:55Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:55Z" level=info msg="[Reconnecting to broker in 3.446566097s]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:58Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:58Z" level=info msg="[Reconnecting to broker in 7.040909526s]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:58Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:16:58Z" level=info msg="[Reconnecting to broker in 6.549790947s]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:05Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:05Z" level=info msg="[Reconnecting to broker in 14.259479049s]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:05Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:17:05Z" level=info msg="[Reconnecting to broker in 14.816015229s]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:17:19Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:19Z" level=info msg="[Reconnecting to broker in 26.521887314s]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:20Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:17:20Z" level=info msg="[Reconnecting to broker in 28.902466475s]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" [2023/06/04 05:17:24.582 +00:00] [INFO] [rootcoord/root_coord.go:988] ["received request to describe collection"] [traceID=265fbdab0c3c1ba0] ["collection name"=] [id=440633149273728439] [msgID=0] [ts=18446744073709551615] [allowUnavailable=true] [2023/06/04 05:17:24.582 +00:00] [INFO] [rootcoord/root_coord.go:1021] ["done to describe collection"] [traceID=265fbdab0c3c1ba0] ["collection name"=] [id=440633149273728439] [msgID=0] [ts=18446744073709551615] [allowUnavailable=true] [collection_id=440633149273728439] [2023/06/04 05:17:37.664 +00:00] [INFO] [rootcoord/import_manager.go:177] ["(in cleanupLoop) trying to expire old tasks from memory and Etcd"] [2023/06/04 05:17:37.668 +00:00] [INFO] [rootcoord/import_manager.go:180] ["(in cleanupLoop) start removing bad import segments"] [2023/06/04 05:17:37.670 +00:00] [INFO] [rootcoord/import_manager.go:182] ["(in cleanupLoop) start cleaning hanging busy DataNode"] time="2023-06-04T05:17:46Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:46Z" level=info msg="[Reconnecting to broker in 54.198483753s]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:49Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:17:49Z" level=info msg="[Reconnecting to broker in 52.753638008s]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" [2023/06/04 05:17:54.581 +00:00] [INFO] [rootcoord/root_coord.go:988] ["received request to describe collection"] [traceID=679ba61455826d3] ["collection name"=] [id=440633149273728439] [msgID=0] [ts=18446744073709551615] [allowUnavailable=true] [2023/06/04 05:17:54.581 +00:00] [INFO] [rootcoord/root_coord.go:1021] ["done to describe collection"] [traceID=679ba61455826d3] ["collection name"=] [id=440633149273728439] [msgID=0] [ts=18446744073709551615] [allowUnavailable=true] [collection_id=440633149273728439] [2023/06/04 05:18:24.582 +00:00] [INFO] [rootcoord/root_coord.go:988] ["received request to describe collection"] [traceID=7827ec310da29039] ["collection name"=] [id=440633149273728439] [msgID=0] [ts=18446744073709551615] [allowUnavailable=true] [2023/06/04 05:18:24.582 +00:00] [INFO] [rootcoord/root_coord.go:1021] ["done to describe collection"] [traceID=7827ec310da29039] ["collection name"=] [id=440633149273728439] [msgID=0] [ts=18446744073709551615] [allowUnavailable=true] [collection_id=440633149273728439] time="2023-06-04T05:18:40Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:18:40Z" level=info msg="[Reconnecting to broker in 1m9.280931565s]" producerID=1 producer_name=milvus-mining-pulsar-19-16277 topic="persistent://public/default/by-dev-rootcoord-dml_0" [2023/06/04 05:18:40.672 +00:00] [WARN] [rootcoord/quota_center.go:608] ["QuotaCenter: DataNode memory to low water level, limit writing rate"] [Node=datanode-2860] [UsedMem=95216676864] [TotalMem=107374182400] [memoryWaterLevel=0.8867744064331055] [memoryLowWaterLevel=0.85] time="2023-06-04T05:18:42Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:18:42Z" level=info msg="[Reconnecting to broker in 1m2.347003086s]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" [2023/06/04 05:18:43.672 +00:00] [WARN] [rootcoord/quota_center.go:608] ["QuotaCenter: DataNode memory to low water level, limit writing rate"] [Node=datanode-2860] [UsedMem=99093413888] [TotalMem=107374182400] [memoryWaterLevel=0.9228793334960937] [memoryLowWaterLevel=0.85]

datacoord:

[2023/06/04 05:17:53.824 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:17:53.825 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=440633149273728439] [partitionID=-1] [2023/06/04 05:17:54.580 +00:00] [INFO] [datacoord/services.go:784] ["get recovery info request received"] [collectionID=440633149273728439] [partitionIDs="[]"] [2023/06/04 05:17:54.583 +00:00] [INFO] [datacoord/handler.go:119] [GetQueryVChanPositions] [collectionID=440633149273728439] [channel=by-dev-rootcoord-dml_2_440633149273728439v0] [numOfSegments=21] [2023/06/04 05:17:54.583 +00:00] [INFO] [datacoord/handler.go:261] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_2_440633149273728439v0] [posTs=441907383838703617] [posTime=2023/06/02 21:54:48.789 +00:00] [2023/06/04 05:17:54.583 +00:00] [INFO] [datacoord/services.go:819] ["datacoord append channelInfo in GetRecoveryInfo"] [collectionID=440633149273728439] [partitionIDs="[]"] [channelInfo="collectionID:440633149273728439 channelName:\"by-dev-rootcoord-dml_2_440633149273728439v0\" seek_position:<channel_name:\"by-dev-rootcoord-dml_2_440633149273728439v0\" msgID:\"\010\364\353\006\020\370\324\001\030\000 \000\" msgGroup:\"by-dev-dataNode-2860-by-dev-rootcoord-dml_2_440633149273728439v0\" timestamp:441907383838703617 > flushedSegmentIds:440633149274231606 flushedSegmentIds:440633149274232803 flushedSegmentIds:440633149274239744 flushedSegmentIds:440633149274224900 flushedSegmentIds:440633149274241602 flushedSegmentIds:440633149274240590 flushedSegmentIds:440633149274223740 flushedSegmentIds:440633149274229761 flushedSegmentIds:440633149274226364 flushedSegmentIds:440633149274239596 flushedSegmentIds:440633149274228660 flushedSegmentIds:440633149274229377 flushedSegmentIds:440633149274235170 flushedSegmentIds:440633149274238653 flushedSegmentIds:440633149274231458 flushedSegmentIds:440633149274222372 flushedSegmentIds:440633149274221592 flushedSegmentIds:440633149274221459 flushedSegmentIds:440633149274242302 flushedSegmentIds:440633149274230796 flushedSegmentIds:440633149274224334 "] [2023/06/04 05:17:54.584 +00:00] [INFO] [datacoord/handler.go:119] [GetQueryVChanPositions] [collectionID=440633149273728439] [channel=by-dev-rootcoord-dml_3_440633149273728439v1] [numOfSegments=21] [2023/06/04 05:17:54.584 +00:00] [INFO] [datacoord/handler.go:261] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_3_440633149273728439v1] [posTs=441907320871976961] [posTime=2023/06/02 21:50:48.590 +00:00] [2023/06/04 05:17:54.584 +00:00] [INFO] [datacoord/services.go:819] ["datacoord append channelInfo in GetRecoveryInfo"] [collectionID=440633149273728439] [partitionIDs="[]"] [channelInfo="collectionID:440633149273728439 channelName:\"by-dev-rootcoord-dml_3_440633149273728439v1\" seek_position:<channel_name:\"by-dev-rootcoord-dml_3_440633149273728439v1\" msgID:\"\010\377\353\006\020\323\301\001\030\000 \000\" msgGroup:\"by-dev-dataNode-2860-by-dev-rootcoord-dml_3_440633149273728439v1\" timestamp:441907320871976961 > flushedSegmentIds:440633149274231596 flushedSegmentIds:440633149274230795 flushedSegmentIds:440633149274238663 flushedSegmentIds:440633149274221460 flushedSegmentIds:440633149274238650 flushedSegmentIds:440633149274233402 flushedSegmentIds:440633149274226880 flushedSegmentIds:440633149274222371 flushedSegmentIds:440633149274240589 flushedSegmentIds:440633149274242132 flushedSegmentIds:440633149274239743 flushedSegmentIds:440633149274224340 flushedSegmentIds:440633149274229378 flushedSegmentIds:440633149274231605 flushedSegmentIds:440633149274223711 flushedSegmentIds:440633149274240599 flushedSegmentIds:440633149274230198 flushedSegmentIds:440633149274234882 flushedSegmentIds:440633149274224901 flushedSegmentIds:440633149274228821 flushedSegmentIds:440633149274221591 "] [2023/06/04 05:17:59.894 +00:00] [INFO] [datacoord/channel_checker.go:152] ["remove timer for channel"] [channel=by-dev-rootcoord-dml_0_440633149338133674v0] [timerCount=0] [2023/06/04 05:17:59.894 +00:00] [WARN] [datacoord/channel_checker.go:121] ["timeout and stop timer: wait for channel ACK timeout"] ["watch state"=ToWatch] [nodeID=2860] ["channel name"=by-dev-rootcoord-dml_0_440633149338133674v0] ["timeout interval"=2m0s] [runningTimerCount=0] [2023/06/04 05:17:59.894 +00:00] [INFO] [datacoord/channel_manager.go:677] ["receive timeout acks from state watcher"] [state=3] [nodeID=2860] ["channel name"=by-dev-rootcoord-dml_0_440633149338133674v0] [2023/06/04 05:17:59.894 +00:00] [WARN] [datacoord/channel_manager.go:627] ["datanode watch channel failed or timeout, will release"] [nodeID=2860] [channel=by-dev-rootcoord-dml_0_440633149338133674v0] [2023/06/04 05:17:59.894 +00:00] [INFO] [datacoord/channel_checker.go:152] ["remove timer for channel"] [channel=by-dev-rootcoord-dml_1_440633149338133674v1] [timerCount=1] [2023/06/04 05:17:59.894 +00:00] [WARN] [datacoord/channel_checker.go:121] ["timeout and stop timer: wait for channel ACK timeout"] ["watch state"=ToWatch] [nodeID=2860] ["channel name"=by-dev-rootcoord-dml_1_440633149338133674v1] ["timeout interval"=2m0s] [runningTimerCount=0] [2023/06/04 05:17:59.894 +00:00] [INFO] [datacoord/handler.go:61] [GetDataVChanPositions] [collectionID=440633149338133674] [channel=by-dev-rootcoord-dml_0_440633149338133674v0] [numOfSegments=84] [2023/06/04 05:17:59.895 +00:00] [INFO] [datacoord/handler.go:235] ["NEITHER segment position or channel start position are found, setting channel seek position to collection start position"] [channel=by-dev-rootcoord-dml_0_440633149338133674v0] [posTs=441873538680094725] [posTime=2023/06/01 10:02:59.755 +00:00] [2023/06/04 05:17:59.895 +00:00] [INFO] [datacoord/handler.go:279] ["channel seek position set from collection start position"] [channel=by-dev-rootcoord-dml_0_440633149338133674v0] [posTs=441873538680094725] [posTime=2023/06/01 10:02:59.755 +00:00] [2023/06/04 05:17:59.895 +00:00] [INFO] [datacoord/channel_checker.go:110] ["timer started"] ["watch state"=ToRelease] [nodeID=2860] ["channel name"=by-dev-rootcoord-dml_0_440633149338133674v0] ["check interval"=2m0s] [2023/06/04 05:17:59.900 +00:00] [INFO] [datacoord/channel_manager.go:677] ["receive timeout acks from state watcher"] [state=3] [nodeID=2860] ["channel name"=by-dev-rootcoord-dml_1_440633149338133674v1] [2023/06/04 05:17:59.900 +00:00] [WARN] [datacoord/channel_manager.go:627] ["datanode watch channel failed or timeout, will release"] [nodeID=2860] [channel=by-dev-rootcoord-dml_1_440633149338133674v1] [2023/06/04 05:17:59.900 +00:00] [INFO] [datacoord/handler.go:61] [GetDataVChanPositions] [collectionID=440633149338133674] [channel=by-dev-rootcoord-dml_1_440633149338133674v1] [numOfSegments=84] [2023/06/04 05:17:59.900 +00:00] [INFO] [datacoord/handler.go:235] ["NEITHER segment position or channel start position are found, setting channel seek position to collection start position"] [channel=by-dev-rootcoord-dml_1_440633149338133674v1] [posTs=441873538680094725] [posTime=2023/06/01 10:02:59.755 +00:00] [2023/06/04 05:17:59.900 +00:00] [INFO] [datacoord/handler.go:279] ["channel seek position set from collection start position"] [channel=by-dev-rootcoord-dml_1_440633149338133674v1] [posTs=441873538680094725] [posTime=2023/06/01 10:02:59.755 +00:00] [2023/06/04 05:17:59.900 +00:00] [INFO] [datacoord/channel_checker.go:110] ["timer started"] ["watch state"=ToRelease] [nodeID=2860] ["channel name"=by-dev-rootcoord-dml_1_440633149338133674v1] ["check interval"=2m0s] [2023/06/04 05:17:59.904 +00:00] [INFO] [datacoord/channel_manager.go:720] ["tickle update, timer delay"] [channel=by-dev-rootcoord-dml_0_440633149338133674v0] [progress=0] [2023/06/04 05:17:59.904 +00:00] [INFO] [datacoord/channel_manager.go:720] ["tickle update, timer delay"] [channel=by-dev-rootcoord-dml_1_440633149338133674v1] [progress=0] [2023/06/04 05:18:24.581 +00:00] [INFO] [datacoord/services.go:784] ["get recovery info request received"] [collectionID=440633149273728439] [partitionIDs="[]"] [2023/06/04 05:18:24.583 +00:00] [INFO] [datacoord/handler.go:119] [GetQueryVChanPositions] [collectionID=440633149273728439] [channel=by-dev-rootcoord-dml_2_440633149273728439v0] [numOfSegments=21] [2023/06/04 05:18:24.583 +00:00] [INFO] [datacoord/handler.go:261] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_2_440633149273728439v0] [posTs=441907383838703617] [posTime=2023/06/02 21:54:48.789 +00:00] [2023/06/04 05:18:24.583 +00:00] [INFO] [datacoord/services.go:819] ["datacoord append channelInfo in GetRecoveryInfo"] [collectionID=440633149273728439] [partitionIDs="[]"] [channelInfo="collectionID:440633149273728439 channelName:\"by-dev-rootcoord-dml_2_440633149273728439v0\" seek_position:<channel_name:\"by-dev-rootcoord-dml_2_440633149273728439v0\" msgID:\"\010\364\353\006\020\370\324\001\030\000 \000\" msgGroup:\"by-dev-dataNode-2860-by-dev-rootcoord-dml_2_440633149273728439v0\" timestamp:441907383838703617 > flushedSegmentIds:440633149274224900 flushedSegmentIds:440633149274221592 flushedSegmentIds:440633149274232803 flushedSegmentIds:440633149274240590 flushedSegmentIds:440633149274228660 flushedSegmentIds:440633149274242302 flushedSegmentIds:440633149274241602 flushedSegmentIds:440633149274231606 flushedSegmentIds:440633149274221459 flushedSegmentIds:440633149274239596 flushedSegmentIds:440633149274229377 flushedSegmentIds:440633149274231458 flushedSegmentIds:440633149274223740 flushedSegmentIds:440633149274230796 flushedSegmentIds:440633149274229761 flushedSegmentIds:440633149274239744 flushedSegmentIds:440633149274224334 flushedSegmentIds:440633149274222372 flushedSegmentIds:440633149274235170 flushedSegmentIds:440633149274226364 flushedSegmentIds:440633149274238653 "] [2023/06/04 05:18:24.585 +00:00] [INFO] [datacoord/handler.go:119] [GetQueryVChanPositions] [collectionID=440633149273728439] [channel=by-dev-rootcoord-dml_3_440633149273728439v1] [numOfSegments=21] [2023/06/04 05:18:24.585 +00:00] [INFO] [datacoord/handler.go:261] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_3_440633149273728439v1] [posTs=441907320871976961] [posTime=2023/06/02 21:50:48.590 +00:00] [2023/06/04 05:18:24.585 +00:00] [INFO] [datacoord/services.go:819] ["datacoord append channelInfo in GetRecoveryInfo"] [collectionID=440633149273728439] [partitionIDs="[]"] [channelInfo="collectionID:440633149273728439 channelName:\"by-dev-rootcoord-dml_3_440633149273728439v1\" seek_position:<channel_name:\"by-dev-rootcoord-dml_3_440633149273728439v1\" msgID:\"\010\377\353\006\020\323\301\001\030\000 \000\" msgGroup:\"by-dev-dataNode-2860-by-dev-rootcoord-dml_3_440633149273728439v1\" timestamp:441907320871976961 > flushedSegmentIds:440633149274222371 flushedSegmentIds:440633149274239743 flushedSegmentIds:440633149274238663 flushedSegmentIds:440633149274228821 flushedSegmentIds:440633149274221591 flushedSegmentIds:440633149274224340 flushedSegmentIds:440633149274231596 flushedSegmentIds:440633149274231605 flushedSegmentIds:440633149274240599 flushedSegmentIds:440633149274224901 flushedSegmentIds:440633149274233402 flushedSegmentIds:440633149274242132 flushedSegmentIds:440633149274238650 flushedSegmentIds:440633149274240589 flushedSegmentIds:440633149274230198 flushedSegmentIds:440633149274234882 flushedSegmentIds:440633149274229378 flushedSegmentIds:440633149274230795 flushedSegmentIds:440633149274223711 flushedSegmentIds:440633149274221460 flushedSegmentIds:440633149274226880 "] [2023/06/04 05:18:28.188 +00:00] [INFO] [datacoord/channel_manager.go:273] ["ChannelManager is not silent, skip channel balance this round"] [2023/06/04 05:18:28.418 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.475 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.529 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.579 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.658 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.707 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.756 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.813 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.862 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.910 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:28.961 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:29.010 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:29.060 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:29.113 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1] [2023/06/04 05:18:29.171 +00:00] [INFO] [datacoord/services.go:885] ["received get flushed segments request"] [collectionID=439652203113153329] [partitionID=-1]

datanode: [2023/06/04 05:16:30.096 +00:00] [INFO] [datanode/data_node.go:350] ["DataNode received a PUT event with tickler update progress"] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [version=3] [2023/06/04 05:16:30.097 +00:00] [INFO] [datanode/data_node.go:350] ["DataNode received a PUT event with tickler update progress"] [channel=by-dev-rootcoord-dml_12_439652203113153329v0] [version=3] time="2023-06-04T05:16:31Z" level=info msg="Failing 1 messages" producerID=2 producer_name=milvus-mining-pulsar-19-16307 topic="persistent://public/default/by-dev-rootcoord-delta_2" time="2023-06-04T05:16:31Z" level=info msg="Failing 1 messages" producerID=1 producer_name=milvus-mining-pulsar-19-16308 topic="persistent://public/default/by-dev-rootcoord-delta_3" [2023/06/04 05:16:31.673 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="message send timeout: TimeoutError"] [2023/06/04 05:16:31.673 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="message send timeout: TimeoutError"] time="2023-06-04T05:16:31Z" level=info msg="Failing 1 messages" producerID=4 producer_name=milvus-mining-pulsar-19-16309 topic="persistent://public/default/by-dev-datacoord-timetick-channel" time="2023-06-04T05:16:31Z" level=info msg="Failing 1 messages" producerID=3 producer_name=milvus-mining-pulsar-19-16310 topic="persistent://public/default/by-dev-datacoord-timetick-channel" [2023/06/04 05:16:31.688 +00:00] [ERROR] [datanode/flow_graph_time_ticker.go:130] ["send hard time tick failed"] [error="message send timeout: TimeoutError"] [stack="github.com/milvus-io/milvus/internal/datanode.(mergedTimeTickerSender).work\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_time_ticker.go:130"] [2023/06/04 05:16:31.688 +00:00] [ERROR] [datanode/flow_graph_time_ticker.go:130] ["send hard time tick failed"] [error="message send timeout: TimeoutError"] [stack="github.com/milvus-io/milvus/internal/datanode.(mergedTimeTickerSender).work\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_time_ticker.go:130"] time="2023-06-04T05:16:53Z" level=error msg="[Failed to create consumer]" consumerID=5 error="request timed out" name=ebftb subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" [2023/06/04 05:16:53.016 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="request timed out"] time="2023-06-04T05:16:53Z" level=error msg="[Failed to create consumer]" consumerID=5 error="request timed out" name=ebftb subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" time="2023-06-04T05:16:53Z" level=error msg="[Failed to create consumer]" consumerID=6 error="request timed out" name=ijjgq subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" time="2023-06-04T05:16:53Z" level=error msg="[Failed to create consumer]" consumerID=6 error="request timed out" name=ijjgq subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" [2023/06/04 05:16:53.188 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="request timed out"] time="2023-06-04T05:17:01Z" level=info msg="Failing 1 messages" producerID=1 producer_name=milvus-mining-pulsar-19-16308 topic="persistent://public/default/by-dev-rootcoord-delta_3" time="2023-06-04T05:17:01Z" level=info msg="Failing 1 messages" producerID=2 producer_name=milvus-mining-pulsar-19-16307 topic="persistent://public/default/by-dev-rootcoord-delta_2" time="2023-06-04T05:17:23Z" level=error msg="[Failed to create consumer]" consumerID=7 error="request timed out" name=phemg subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" time="2023-06-04T05:17:23Z" level=error msg="[Failed to create consumer]" consumerID=7 error="request timed out" name=phemg subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" time="2023-06-04T05:17:23Z" level=error msg="[Failed to create consumer]" consumerID=8 error="request timed out" name=uexgh subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" time="2023-06-04T05:17:23Z" level=error msg="[Failed to create consumer]" consumerID=8 error="request timed out" name=uexgh subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" time="2023-06-04T05:17:29Z" level=warning msg="[Detected stale connection to broker]" local_addr="172.20.34.13:46652" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-04T05:17:29Z" level=warning msg="[Connection was closed]" cnx="172.20.34.13:46652 -> 192.168.220.30:6650" producerID=2 producer_name=milvus-mining-pulsar-19-16307 topic="persistent://public/default/by-dev-rootcoord-delta_2" time="2023-06-04T05:17:29Z" level=warning msg="[Connection was closed]" cnx="172.20.34.13:46652 -> 192.168.220.30:6650" producerID=1 producer_name=milvus-mining-pulsar-19-16308 topic="persistent://public/default/by-dev-rootcoord-delta_3" time="2023-06-04T05:17:29Z" level=warning msg="[Connection was closed]" cnx="172.20.34.13:46652 -> 192.168.220.30:6650" producerID=4 producer_name=milvus-mining-pulsar-19-16309 topic="persistent://public/default/by-dev-datacoord-timetick-channel" time="2023-06-04T05:17:29Z" level=warning msg="[Connection was closed]" cnx="172.20.34.13:46652 -> 192.168.220.30:6650" producerID=3 producer_name=milvus-mining-pulsar-19-16310 topic="persistent://public/default/by-dev-datacoord-timetick-channel" time="2023-06-04T05:17:29Z" level=info msg="[Reconnecting to broker in 114.957814ms]" consumerID=4 name=sbdlf subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_3_440633149273728439v1 topic="persistent://public/default/by-dev-rootcoord-dml_3" time="2023-06-04T05:17:29Z" level=info msg="[Reconnecting to broker in 117.93067ms]" consumerID=3 name=qnctf subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_2_440633149273728439v0 topic="persistent://public/default/by-dev-rootcoord-dml_2" time="2023-06-04T05:17:29Z" level=info msg="[Reconnecting to broker in 118.14378ms]" producerID=1 producer_name=milvus-mining-pulsar-19-16308 topic="persistent://public/default/by-dev-rootcoord-delta_3" time="2023-06-04T05:17:29Z" level=info msg="[Reconnecting to broker in 118.780906ms]" consumerID=1 name=wkdbk subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_1_440633149338133674v1 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:17:29Z" level=info msg="[Reconnecting to broker in 112.124947ms]" producerID=2 producer_name=milvus-mining-pulsar-19-16307 topic="persistent://public/default/by-dev-rootcoord-delta_2" time="2023-06-04T05:17:29Z" level=info msg="[Reconnecting to broker in 109.349966ms]" consumerID=2 name=fqkun subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_0_440633149338133674v0 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:29Z" level=info msg="[Reconnecting to broker in 104.605582ms]" producerID=3 producer_name=milvus-mining-pulsar-19-16310 topic="persistent://public/default/by-dev-datacoord-timetick-channel" time="2023-06-04T05:17:29Z" level=info msg="[Reconnecting to broker in 119.557477ms]" producerID=4 producer_name=milvus-mining-pulsar-19-16309 topic="persistent://public/default/by-dev-datacoord-timetick-channel" time="2023-06-04T05:17:30Z" level=info msg="[Connecting to broker]" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-04T05:17:30Z" level=info msg="[TCP connection established]" local_addr="172.20.34.13:46852" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-04T05:17:30Z" level=info msg="[Connection is ready]" local_addr="172.20.34.13:46852" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-04T05:17:30Z" level=info msg="[Connected consumer]" consumerID=2 name=fqkun subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_0_440633149338133674v0 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:30Z" level=info msg="[Reconnected producer to broker]" cnx="172.20.34.13:46852 -> 192.168.220.30:6650" producerID=3 producer_name=milvus-mining-pulsar-19-16310 topic="persistent://public/default/by-dev-datacoord-timetick-channel" time="2023-06-04T05:17:30Z" level=info msg="[Reconnected consumer to broker]" consumerID=2 name=fqkun subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_0_440633149338133674v0 topic="persistent://public/default/by-dev-rootcoord-dml_0" time="2023-06-04T05:17:30Z" level=info msg="Resending 1 pending batches" producerID=2 producer_name=milvus-mining-pulsar-19-16307 topic="persistent://public/default/by-dev-rootcoord-delta_2" time="2023-06-04T05:17:30Z" level=info msg="[Reconnected producer to broker]" cnx="172.20.34.13:46852 -> 192.168.220.30:6650" producerID=2 producer_name=milvus-mining-pulsar-19-16307 topic="persistent://public/default/by-dev-rootcoord-delta_2" time="2023-06-04T05:17:30Z" level=warning msg="[Got unexpected message: ledgerId:112127 entryId:29094 partition:-1 ]" consumerID=4 local_addr="172.20.34.13:46652" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-04T05:17:30Z" level=warning msg="[Got unexpected message: ledgerId:112127 entryId:29095 partition:-1 ]" consumerID=4 local_addr="172.20.34.13:46652" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-04T05:17:30Z" level=warning msg="[Failed to write on connection]" error="write tcp 172.20.34.13:46652->192.168.220.30:6650: use of closed network connection" local_addr="172.20.34.13:46652" remote_addr="pulsar://milvus-mining-pulsar-proxy:6650" time="2023-06-04T05:17:30Z" level=error msg="[Failed to create consumer]" consumerID=10 error="connection closed" name=usmrf subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" time="2023-06-04T05:17:30Z" level=error msg="[Failed to create consumer]" consumerID=10 error="connection closed" name=usmrf subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_12_439652203113153329v0 topic="persistent://public/default/by-dev-rootcoord-dml_12" time="2023-06-04T05:17:30Z" level=error msg="[Failed to create consumer]" consumerID=9 error="connection closed" name=gyapr subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13" time="2023-06-04T05:17:30Z" level=error msg="[Failed to create consumer]" consumerID=9 error="connection closed" name=gyapr subscription=by-dev-dataNode-2860-by-dev-rootcoord-dml_13_439652203113153329v1 topic="persistent://public/default/by-dev-rootcoord-dml_13"

altonchuzhan commented 1 year ago

I am quite sure no one is writing/inserting any data, who is the producer, why it keep creating producer, and also why backlog quota exceeded.

the remaining data should be in the pulsar as committed log(which didn't covert to the sealed segment), why datanode need that much memory, the inserting data should be put in brand new partition.

Is there a "reset" button to clear the pulsar state, I can afford lossing the newly inserted data. the search service is blocking now, else I may have to restart the world.

xiaofan-luan commented 1 year ago

Seems like this is a reason

time="2023-06-04T05:16:51Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1" time="2023-06-04T05:16:51Z" level=info msg="[Reconnecting to broker in 223.465899ms]" producerID=2 producer_name=milvus-mining-pulsar-19-16278 topic="persistent://public/default/by-dev-rootcoord-dml_1"

xiaofan-luan commented 1 year ago

@LoveEachDay @congqixia I remember I add some expire time for pulsar consumer right? How do we verify the expire time works?

altonchuzhan commented 1 year ago

Good news: I deleted all topic of pulsar in public/default, previous errors gone. and my test collection is ready for search.

Bad news: the big collection is facing a new problem: fail to search on all shard leaders, err=All attempts results: attempt #1:code: UnexpectedError, error: fail to Search, QueryNode ID=2450, reason=Search 2450 failed, reason Timestamp lag too large lag(36h8m1.814s) max(24h0m0s) err %!w() attempt #2:context canceled

xiaofan-luan commented 1 year ago

Seems that milvus is still catching up to recover from previous logs.

To recover, change dataCoord.channel.watchTimeoutInterval to 600 ms should be good enough

altonchuzhan commented 1 year ago

the dataCoord.channel.watchTimeoutInterval is int (seconds), I use config like this: image

then a lot of timeout error..

[2023/06/04 14:33:28.414 +00:00] [INFO] [datacoord/channel_manager.go:677] ["receive timeout acks from state watcher"] [state=3] [nodeID=2921] ["channel name"=by-dev-rootcoord-dml_13_439652203113153329v1]
[2023/06/04 14:33:28.414 +00:00] [WARN] [datacoord/channel_manager.go:627] ["datanode watch channel failed or timeout, will release"] [nodeID=2921] [channel=by-dev-rootcoord-dml_13_439652203113153329v1]
[2023/06/04 14:33:28.414 +00:00] [INFO] [datacoord/handler.go:61] [GetDataVChanPositions] [collectionID=439652203113153329] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [numOfSegments=663]
[2023/06/04 14:33:28.414 +00:00] [INFO] [datacoord/handler.go:261] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [posTs=441945553313464322] [posTime=2023/06/04 14:21:33.780 +00:00]
[2023/06/04 14:33:28.414 +00:00] [INFO] [datacoord/channel_checker.go:110] ["timer started"] ["watch state"=ToRelease] [nodeID=2921] ["channel name"=by-dev-rootcoord-dml_13_439652203113153329v1] ["check interval"=1s]
[2023/06/04 14:33:28.422 +00:00] [INFO] [datacoord/channel_manager.go:720] ["tickle update, timer delay"] [channel=by-dev-rootcoord-dml_13_439652203113153329v1] [progress=0]
[2023/06/04 14:33:28.739 +00:00] [INFO] [datacoord/channel_checker.go:152] ["remove timer for channel"] [channel=by-dev-rootcoord-dml_3_440633149273728439v1] [timerCount=3]
[2023/06/04 14:33:28.739 +00:00] [WARN] [datacoord/channel_checker.go:121] ["timeout and stop timer: wait for channel ACK timeout"] ["watch state"=ToWatch] [nodeID=2921] ["channel name"=by-dev-rootcoord-dml_3_440633149273728439v1] ["timeout interval"=1s] [runningTimerCount=3]
[2023/06/04 14:33:28.739 +00:00] [INFO] [datacoord/channel_manager.go:677] ["receive timeout acks from state watcher"] [state=3] [nodeID=2921] ["channel name"=by-dev-rootcoord-dml_3_440633149273728439v1]
[2023/06/04 14:33:28.739 +00:00] [WARN] [datacoord/channel_manager.go:627] ["datanode watch channel failed or timeout, will release"] [nodeID=2921] [channel=by-dev-rootcoord-dml_3_440633149273728439v1]
[2023/06/04 14:33:28.739 +00:00] [INFO] [datacoord/handler.go:61] [GetDataVChanPositions] [collectionID=440633149273728439] [channel=by-dev-rootcoord-dml_3_440633149273728439v1] [numOfSegments=21]
[2023/06/04 14:33:28.739 +00:00] [INFO] [datacoord/handler.go:261] ["channel seek position set from channel checkpoint meta"] [channel=by-dev-rootcoord-dml_3_440633149273728439v1] [posTs=441945553313464322] [posTime=2023/06/04 14:21:33.780 +00:00]
[2023/06/04 14:33:28.739 +00:00] [INFO] [datacoord/channel_checker.go:110] ["timer started"] ["watch state"=ToRelease] [nodeID=2921] ["channel name"=by-dev-rootcoord-dml_3_440633149273728439v1] ["check interval"=1s]
[2023/06/04 14:33:28.779 +00:00] [INFO] [datacoord/channel_manager.go:720] ["tickle update, timer delay"] [channel=by-dev-rootcoord-dml_3_440633149273728439v1] [progress=0]
altonchuzhan commented 1 year ago

changed it back, then everything seems working as expect

Could you elaborate more about this parameter, I am confusing about the timeout logic.

xiaofan-luan commented 1 year ago

changed it back, then everything seems working as expect

Could you elaborate more about this parameter, I am confusing about the timeout logic.

Good to know. During milvus crash, it take some time to read from kafka and load data from S3. if timeout reaches coordinator will reassign to a new node and restart. So the timeout better to be larger.

xiaofan-luan commented 1 year ago

But the default 10s should be good for most cases.

bigsheeper commented 1 year ago

https://github.com/milvus-io/milvus/pull/24656 with go payload wirter, datanode oom issue may be resolved.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.