[Bug]: [chaos]Flush hangs at e2e test after deleting pulsar pod kill chaos

zhuwenxing commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: master-20211225-132b3c2
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.0.0rc9.dev24
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Flush hangs at e2e test after deleting datacoord or rootcoord pod kill chaos

Expected Behavior

all operations work well after deleting chaos

Steps To Reproduce

# reproduce on local k8s env
cd `tests/python_client/chaos`
modify milvus tag in `cluster-values.yaml` to `master-20211225-132b3c2`
run command `bash chaos_test.sh rootcoord pod_kill`

# run GitHub action
https://github.com/milvus-io/milvus/actions/workflows/pod-kill-chaos-test.yaml

Anything else?

datacoord log: https://github.com/milvus-io/milvus/suites/4764821710/artifacts/132476087

rootcoord log: https://github.com/milvus-io/milvus/suites/4764821710/artifacts/132476097

log path is k8s_logs/chaos_test

zhuwenxing commented 2 years ago

/assign @sunby

sunby commented 2 years ago

duplicate with #13256 ?

zhuwenxing commented 2 years ago

Not the same as it! In #13256, it will raise errors and fail directly. but in this issue, it was hanging at the flush step.

sunby commented 2 years ago

/assign @zhuwenxing please verify it.

zhuwenxing commented 2 years ago

@sunby In https://github.com/milvus-io/milvus/actions/runs/1627935493, datacoord, indexcoord,querycoord work well. Rootcoord failed,but it was failed at search when running hello_milvus.py. Please check out this job https://github.com/milvus-io/milvus/runs/4644257749?check_suite_focus=true It has the same behavior (hangs at flush)as this issue, but it was caused by killing pulsar pod.

zhuwenxing commented 2 years ago

/assign @sunby

XuanYang-cn commented 2 years ago

/assign /unassign @sunby

XuanYang-cn commented 2 years ago

stuck at Seek

zhuwenxing commented 2 years ago

hangs at flush when running test_e2e.py

hangs at flush when running hello_milvus.py

zhuwenxing commented 2 years ago

hangs at flush when running test_e2e.py caused by minio pod kill https://github.com/milvus-io/milvus/runs/4661088464?check_suite_focus=true

But we can also see that rootcoord pod restarted once.

zhuwenxing commented 2 years ago

hangs at flush when running test_e2e.py caused by pulsar pod kill https://github.com/milvus-io/milvus/runs/4669003081?check_suite_focus=true

zhuwenxing commented 2 years ago

hangs at flush when running test_e2e.py caused by pulsar pod kill https://github.com/zhuwenxing/milvus/runs/4671387919?check_suite_focus=true

hangs at flush when running test_e2e.py caused by querynode pod kill https://github.com/zhuwenxing/milvus/runs/4671387853?check_suite_focus=true

zhuwenxing commented 2 years ago

hangs at flush when running test_e2e.py caused by indexcoord pod kill https://github.com/milvus-io/milvus/runs/4685511374?check_suite_focus=true hangs at flush when running test_e2e.py caused by pulsar pod kill https://github.com/milvus-io/milvus/runs/4685511462?check_suite_focus=true

zhuwenxing commented 2 years ago

hangs at flush when running test_e2e.py caused by pulsar pod kill https://github.com/milvus-io/milvus/runs/4718761096?check_suite_focus=true

XuanYang-cn commented 2 years ago

For pulsar pod kill:

Why flush hang: Timetick proceeding very very slow. 30s a tt.

Why tt slow: rootcoord broadcast tt 30s a time

Why rootcoord broadcast tt slow: there are 2 dml_channel seemed never recover from pulsar, producing on those are always timeout, the default produce timeout is 30s.

https://github.com/milvus-io/milvus/blob/8df158a9c4e1736d0a7e1fe05020c5c2546f7136/internal/rootcoord/timeticksync.go#L335

XuanYang-cn commented 2 years ago

4457 [2022/01/05 19:01:44.361 +00:00] [DEBUG] [timeticksync.go:330] ["SendTimeTickToChannel fail"] [error="message send timeout: TimeoutError"]

4458 [2022/01/05 19:01:44.361 +00:00] [ERROR] [dml_channels.go:109] ["Broadcast failed"] [chanName=by-dev-rootcoord-dml_10] [stack="github.com/milvus-io/milvus/internal/rootcoord.(dmlChannels).broadcast\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/dml_channe ls.go:109\ngithub.com/milvus-io/milvus/internal/rootcoord.(timetickSync).sendTimeTickToChannel\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/timeticksync.go:368\ngithub.com/milvus-io/milvus/internal/rootcoord.(*timetickSync).startWatch.func1\n\t/go/src/g ithub.com/milvus-io/milvus/internal/rootcoord/timeticksync.go:329"]

XuanYang-cn commented 2 years ago

2 roads:

why pulsar recover yet the topic is unavailable( produce timeout )
Why indexnode-pod-kill causes flush hang

XuanYang-cn commented 2 years ago

https://github.com/milvus-io/milvus/runs/4730662241?check_suite_focus=true

This is also caused by producing timeout: hello_milvus's time tick channel && delta channel producing timeout.

This timeout will block flowgraph, 22 tt costs 11 mins, thus DN seemed stucked

XuanYang-cn commented 2 years ago

[x] https://github.com/milvus-io/milvus/runs/4730662365?check_suite_focus=true @sunby

RoorCoord not available yet, when DN handles a flushed segment recovery, this etcd-put event failed.

DataCoord put a WatchDmChannel Etcd event
DataNode detected the put event and started to new data Sync Service
new data sync service failed because
- DN restarted, there are flushed segment.
- DN failed to get the collection schema of the flushed segment, because RootCoord is in Chaos
  1. DataNode handleChannelInfo stopped with error and didn't put this event as Completed
  2. DataCoordinator will never put again, even the event will never be set to Completed.
  3. DataNode couldn't subscribe to the flushed segment vChannel forever, causing flush hanging.

Solutions: DataCoord needs to check repeatedly if a WatchDmChannel Etcd event status becomes Completed. If it's not completed for a long time, DataCoord needs to re-put this event.

zhuwenxing commented 2 years ago

https://github.com/milvus-io/milvus/runs/4779791136?check_suite_focus=true Flush hang caused by rootcoord pod kill. Note that datanode restart once for some reason.

XuanYang-cn commented 2 years ago

https://github.com/milvus-io/milvus/runs/4779791136?check_suite_focus=true Flush hang caused by rootcoord pod kill. Note that datanode restart once for some reason.

This will be fixed by #15102

zhuwenxing commented 2 years ago

The last three pod kill chaos tests have all passed, so remove the critical label.

zhuwenxing commented 2 years ago

The last three pod kill chaos tests have all passed, so remove the critical label.

The action all passed is caused by chaos injection failure, not because of the fix. After fixing the chaos injection failure, the issue still exists. See: https://github.com/milvus-io/milvus/runs/4844812108?check_suite_focus=true

XuanYang-cn commented 2 years ago

  1 [2022/01/17 18:50:26.760 +00:00] [INFO] [data_node.go:310] ["DataNode is handling watchInfo put event"] [key=by-dev/meta/channelwatch/6/by-dev-rootcoord-dml_54_430557200390553601v0]
...
  5 [2022/01/17 18:50:26.839 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:47] ["datanode AsConsumer"] ["physical channel"=by-dev-rootcoord-dml_54] [subName=by-dev-dataNode-6-430557200390553601]
  6 [2022/01/17 18:50:26.839 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:52] ["datanode begin to seek"] ["Channel Name"=by-dev-rootcoord-dml_54]
...
 10 [2022/01/17 18:59:26.607 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:57] ["datanode Seek successfully"] ["Channel Name"=by-dev-rootcoord-dml_54] [elapse=8m59.768128063s]

xiaofan-luan commented 2 years ago

8m59.768128063s

What does the log mean? retrieve data from pulsar takes 10 mins?

xiaofan-luan commented 2 years ago

[x] https://github.com/milvus-io/milvus/runs/4730662365?check_suite_focus=true @sunby

RoorCoord not available yet, when DN handles a flushed segment recovery, this etcd-put event failed.

DataCoord put a WatchDmChannel Etcd event

DataNode detected the put event and started to new data Sync Service

new data sync service failed because

DN restarted, there are flushed segment.

DN failed to get the collection schema of the flushed segment, because RootCoord is in Chaos

DataNode handleChannelInfo stopped with error and didn't put this event as Completed

DataCoordinator will never put again, even the event will never be set to Completed.

DataNode couldn't subscribe to the flushed segment vChannel forever, causing flush hanging.

Solutions: DataCoord needs to check repeatedly if a WatchDmChannel Etcd event status becomes Completed. If it's not completed for a long time, DataCoord needs to re-put this event.

Seems that under step 3, if datanode failed to handle, we should just mark the task as failed and wait for datacoord retry

sunby commented 2 years ago

[x] https://github.com/milvus-io/milvus/runs/4730662365?check_suite_focus=true @sunby

RoorCoord not available yet, when DN handles a flushed segment recovery, this etcd-put event failed.

DataCoord put a WatchDmChannel Etcd event

DataNode detected the put event and started to new data Sync Service

new data sync service failed because

DN restarted, there are flushed segment.

DN failed to get the collection schema of the flushed segment, because RootCoord is in Chaos

DataNode handleChannelInfo stopped with error and didn't put this event as Completed

DataCoordinator will never put again, even the event will never be set to Completed.

DataNode couldn't subscribe to the flushed segment vChannel forever, causing flush hanging.

Solutions: DataCoord needs to check repeatedly if a WatchDmChannel Etcd event status becomes Completed. If it's not completed for a long time, DataCoord needs to re-put this event.

Seems that under step 3, if datanode failed to handle, we should just mark the task as failed and wait for datacoord retry

Yes, for now the DataNode will retry to wantch the channel if the step3 failed and it can handle this situation. DataCoord have also implemented the retry logic and I want to enable it after GA release.

sunby commented 2 years ago

  1 [2022/01/17 18:50:26.760 +00:00] [INFO] [data_node.go:310] ["DataNode is handling watchInfo put event"] [key=by-dev/meta/channelwatch/6/by-dev-rootcoord-dml_54_430557200390553601v0]
...
  5 [2022/01/17 18:50:26.839 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:47] ["datanode AsConsumer"] ["physical channel"=by-dev-rootcoord-dml_54] [subName=by-dev-dataNode-6-430557200390553601]
  6 [2022/01/17 18:50:26.839 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:52] ["datanode begin to seek"] ["Channel Name"=by-dev-rootcoord-dml_54]
...
 10 [2022/01/17 18:59:26.607 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:57] ["datanode Seek successfully"] ["Channel Name"=by-dev-rootcoord-dml_54] [elapse=8m59.768128063s]

@xiaofan-luan It seems that DataNode spent too much time to seek after pulsar restarted.

XuanYang-cn commented 2 years ago

Seek cost 9mins

sunby commented 2 years ago

Seek cost 9mins

@XuanYang-cn Would you help to figure out the reason why it cost so long? 👍

XuanYang-cn commented 2 years ago

Here's a reproduce: https://github.com/XuanYang-cn/milvus/actions/runs/1716709139

 07:36:10.307 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-61dd82f746] Subscribing to topic on cnx [id: 0x2c59f11e, L:/10.244.0.62:55598 - R:10.244.0.62/10.24      4.0.62:6650], consumerId 49

According to the broker logs, pulsar will try to recover every consumer's healthcheck channel in order from consumerID=1 to consumerID=48.

And all 48 consumers takes 9 minutes finishing recover to this channel persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck

Util consumerID=48 recover to healthcheck channel, Seek of consumerID=48 stucked.

XuanYang-cn commented 2 years ago

Now investigating the logs of broker of passing tests.

XuanYang-cn commented 2 years ago

A complete log for one consumer recovering to healthcheck channel: 30s

 9229 07:33:40.318 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Subscribing to topic on cnx [id: 0x2c59f11e, L:/10.244.0.62:55598 - R:10.244.0.62/10.24      4.0.62:6650], consumerId 39
 9230 07:33:40.319 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Subscribing on topic persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck / reader-a34c3b0d25
 9231 07:33:40.323 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Created new producer: Producer{topic=PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck}, clien      t=/10.244.0.62:55598, producerName=test-pulsar-pod-kill-2-49, producerId=39}
 9232 07:33:40.324 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck] [test-pulsar-pod-kill-2-49] Created producer on cnx [id: 0x2c59f11e, L:/10.244.0.62:55598 - R:10.244.0.62/      10.244.0.62:6650]
 9233 07:33:40.341 [Thread-256] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Creating non-durable subscription at msg id 9223372036854775807:92233720368547758      07:-1:-1
 9234 07:33:40.341 [Thread-256] INFO  org.apache.bookkeeper.mledger.impl.NonDurableCursorImpl - [pulsar/test-pulsar-pod-kill/10.244.0.62:8080/persistent/healthcheck] Created non-durable cursor read-position=178:39 mark-delete-position=178:38
 9235 07:33:40.341 [Thread-256] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/test-pulsar-pod-kill/10.244.0.62:8080/persistent/healthcheck] Opened new cursor: NonDurableCursorImpl{ledger=pulsar/test-pulsar-pod-kill/10.244.0.62:8080/persistent/health      check, ackPos=178:38, readPos=178:39}
 9236 07:33:40.342 [Thread-256] INFO  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [pulsar/test-pulsar-pod-kill/10.244.0.62:8080/persistent/healthcheck-reader-a34c3b0d25] Rewind from 178:39 to 178:39
 9237 07:33:40.342 [Thread-256] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck] There are no replicated subscriptions on the topic
 9238 07:33:40.342 [Thread-256] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Created new subscription for 39
 9239 07:33:40.342 [Thread-256] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Created subscription on topic persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck / reader-a34c3b0d25
 9240 07:33:40.343 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Subscribed to topic on 10.244.0.62/10.244.0.62:6650 -- consumer: 39
 9241 07:33:40.361 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck}][test-pulsar-pod-kill-2-49] Closing producer on cnx /10.244.0.62:55598. producerId=39
 9242 07:33:40.361 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck}][test-pulsar-pod-kill-2-49] Closed producer on cnx /10.244.0.62:55598. producerId=39
 9243 07:33:40.362 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Closing consumer: consumerId=39
 9244 07:33:40.366 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck] [test-pulsar-pod-kill-2-49] Closed Producer
 9245 07:33:40.367 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.AbstractDispatcherSingleActiveConsumer - Removing consumer Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck, name=reader-a34c      3b0d25}, consumerId=39, consumerName=de00b, address=/10.244.0.62:55598}
 9246 07:33:40.367 [pulsar-external-listener-44-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [19/Jan/2022:07:33:40 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 200 2 "-" "curl/7.64.0" 88
 9247 07:33:40.367 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Successfully closed subscription [NonDurableCursorImpl{ledger=pulsar/t      est-pulsar-pod-kill/10.244.0.62:8080/persistent/healthcheck, ackPos=178:38, readPos=178:40}]
 9248 07:33:40.368 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Successfully closed dispatcher for reader
 9249 07:33:40.368 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Closed consumer, consumerId=39
 9250 07:33:40.368 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck] [reader-a34c3b0d25] Closed consumer
 9251 07:34:04.074 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Only 1 broker available: no load shedding will be performed
 9252 07:34:10.242 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - Starting Pulsar consumer status recorder with config: {
 9253   "topicNames" : [ "persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck" ],                 
 9254   "topicsPattern" : null,                                                                                     
 9255   "subscriptionName" : "reader-5196f39992",                                                                   
 9256   "subscriptionType" : "Exclusive",                                                                           
 9257   "subscriptionMode" : "NonDurable",                                                                          
 9258   "receiverQueueSize" : 1000,                                                                                 
 9259   "acknowledgementsGroupTimeMicros" : 100000,                                                                 
 9260   "negativeAckRedeliveryDelayMicros" : 60000000,                                                              
 9261   "maxTotalReceiverQueueSizeAcrossPartitions" : 50000,                                                        
 9262   "consumerName" : null,                                                                                      
 9263   "ackTimeoutMillis" : 0,                                                                                     
 9264   "tickDurationMillis" : 1000,                                                                                
 9265   "priorityLevel" : 0,                                                                                        
 9266   "maxPendingChuckedMessage" : 10,                                                                            
 9267   "autoAckOldestChunkedMessageOnQueueFull" : false,                                                           
 9268   "expireTimeOfIncompleteChunkedMessageMillis" : 60000,                                                       
 9269   "cryptoFailureAction" : "FAIL",                                                                             
 9270   "properties" : { },                                                                                         
 9271   "readCompacted" : false,                                                                                    
 9272   "subscriptionInitialPosition" : "Latest",                                                                   
 9273   "patternAutoDiscoveryPeriod" : 60,                                                                          
 9274   "regexSubscriptionMode" : "PersistentOnly",                                                                 
 9275   "deadLetterPolicy" : null,                                                                                  
 9276   "retryEnable" : false,                                                                                      
 9277   "autoUpdatePartitions" : true,                                                                              
 9278   "autoUpdatePartitionsIntervalSeconds" : 60,                                                                 
 9279   "replicateSubscriptionState" : false,                                                                       
 9280   "resetIncludeHead" : false,   
   9281   "keySharedPolicy" : null,                                                                                   
 9282   "batchIndexAckEnabled" : false                                                                              
 9283 }                                                                                                             
 9284 07:34:10.243 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - Pulsar client config: {
 9285   "serviceUrl" : "pulsar://10.244.0.62:6650",                                                                 
 9286   "authPluginClassName" : "org.apache.pulsar.client.impl.auth.AuthenticationDisabled",                        
 9287   "operationTimeoutMs" : 30000,                                                                               
 9288   "statsIntervalSeconds" : 60,                                                                                
 9289   "numIoThreads" : 1,                                                                                         
 9290   "numListenerThreads" : 1,                                                                                   
 9291   "connectionsPerBroker" : 1,                                                                                 
 9292   "useTcpNoDelay" : true,                                                                                     
 9293   "useTls" : false,                                                                                           
 9294   "tlsTrustCertsFilePath" : null,                                                                             
 9295   "tlsAllowInsecureConnection" : false,                                                                       
 9296   "tlsHostnameVerificationEnable" : false,                                                                    
 9297   "concurrentLookupRequest" : 5000,                                                                           
 9298   "maxLookupRequest" : 50000,                                                                                 
 9299   "maxLookupRedirects" : 20,                                                                                  
 9300   "maxNumberOfRejectedRequestPerConnection" : 50,                                                             
 9301   "keepAliveIntervalSeconds" : 30,                                                                            
 9302   "connectionTimeoutMs" : 10000,                                                                              
 9303   "requestTimeoutMs" : 60000,                                                                                 
 9304   "initialBackoffIntervalNanos" : 100000000,                                                                  
 9305   "maxBackoffIntervalNanos" : 60000000000,                                                                    
 9306   "listenerName" : null,                                                                                      
 9307   "useKeyStoreTls" : false,                                                                                   
 9308   "sslProvider" : null,                                                                                       
 9309   "tlsTrustStoreType" : "JKS",                                                                                
 9310   "tlsTrustStorePath" : null,                                                                                 
 9311   "tlsTrustStorePassword" : null,                                                                             
 9312   "tlsCiphers" : [ ],                                                                                         
 9313   "tlsProtocols" : [ ],                                                                                       
 9314   "proxyServiceUrl" : null,                                                                                   
 9315   "proxyProtocol" : null,                                                                                     
 9316   "enableTransaction" : false                                                                                 
 9317 }                                                                                                             
 9318 07:34:10.249 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - Starting Pulsar producer perf with config: {
 9319   "topicName" : "persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck",                      
 9320   "producerName" : null,                                                                                      
 9321   "sendTimeoutMs" : 30000,                                                                                    
 9322   "blockIfQueueFull" : false,                                                                                 
 9323   "maxPendingMessages" : 1000,                                                                                
 9324   "maxPendingMessagesAcrossPartitions" : 50000,                                                               
 9325   "messageRoutingMode" : "RoundRobinPartition",                                                               
 9326   "hashingScheme" : "JavaStringHash",                                                                         
 9327   "cryptoFailureAction" : "FAIL",                                                                             
 9328   "batchingMaxPublishDelayMicros" : 1000,                                                                     
 9329   "batchingPartitionSwitchFrequencyByPublishDelay" : 10,                                                      
 9330   "batchingMaxMessages" : 1000,                                                                               
 9331   "batchingMaxBytes" : 131072,                                                                                
 9332   "batchingEnabled" : true,                                                                                   
 9333   "chunkingEnabled" : false,                                                                                  
 9334   "compressionType" : "NONE",                                                                                 
 9335   "initialSequenceId" : null,                                                                                 
 9336   "autoUpdatePartitions" : true,                                                                              
 9337   "autoUpdatePartitionsIntervalSeconds" : 60,                                                                 
 9338   "multiSchema" : true,                                                                                       
 9339   "properties" : { }                                                                                          
 9340 }                                                                         
      9341 07:34:10.250 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - Pulsar client config: {
 9342   "serviceUrl" : "pulsar://10.244.0.62:6650",                                                                 
 9343   "authPluginClassName" : "org.apache.pulsar.client.impl.auth.AuthenticationDisabled",                        
 9344   "operationTimeoutMs" : 30000,                                                                               
 9345   "statsIntervalSeconds" : 60,                                                                                
 9346   "numIoThreads" : 1,                                                                                         
 9347   "numListenerThreads" : 1,                                                                                   
 9348   "connectionsPerBroker" : 1,                                                                                 
 9349   "useTcpNoDelay" : true,                                                                                     
 9350   "useTls" : false,                                                                                           
 9351   "tlsTrustCertsFilePath" : null,                                                                             
 9352   "tlsAllowInsecureConnection" : false,                                                                       
 9353   "tlsHostnameVerificationEnable" : false,                                                                    
 9354   "concurrentLookupRequest" : 5000,                                                                           
 9355   "maxLookupRequest" : 50000,                                                                                 
 9356   "maxLookupRedirects" : 20,                                                                                  
 9357   "maxNumberOfRejectedRequestPerConnection" : 50,                                                             
 9358   "keepAliveIntervalSeconds" : 30,                                                                            
 9359   "connectionTimeoutMs" : 10000,                                                                              
 9360   "requestTimeoutMs" : 60000,                                                                                 
 9361   "initialBackoffIntervalNanos" : 100000000,                                                                  
 9362   "maxBackoffIntervalNanos" : 60000000000,                                                                    
 9363   "listenerName" : null,                                                                                      
 9364   "useKeyStoreTls" : false,                                                                                   
 9365   "sslProvider" : null,                                                                                       
 9366   "tlsTrustStoreType" : "JKS",                                                                                
 9367   "tlsTrustStorePath" : null,                                                                                 
 9368   "tlsTrustStorePassword" : null,                                                                             
 9369   "tlsCiphers" : [ ],                                                                                         
 9370   "tlsProtocols" : [ ],                                                                                       
 9371   "proxyServiceUrl" : null,                                                                                   
 9372   "proxyProtocol" : null,                                                                                     
 9373   "enableTransaction" : false                                                                                 
 9374 }

XuanYang-cn commented 2 years ago

9231 07:33:40.323 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Created new producer: Producer{topic=PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck}, clien      t=/10.244.0.62:55598, producerName=test-pulsar-pod-kill-2-49, producerId=39}

XuanYang-cn commented 2 years ago

With zk loadbanlancing

15502 10:20:55.632 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.57:41306] Closed consumer, consumerId=15
15503 10:20:55.633 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck] [reader-227a782993] Closed consumer                                                                       
15504 10:20:55.639 [pulsar-external-listener-44-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [19/Jan/2022:10:20:55 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 200 2 "-" "curl/7.64.0" 111
15505                                                                                                                                                                                                                                                                           
15506 10:21:08.907 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to ZooKeeper because maximum change 11.049842834472656% exceeded threshold 10%; time since last report written is 19.458 seconds       
15507 10:21:08.920 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO  org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x1000009c374000b local:/10.244.0.57:55094 remoteserver:test-pulsar-pod-kill-zookeeper-ca/10.96.178.164:2181       lastZxid:14389 xid:3201 sent:3201 recv:3506 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.244.0.57:8080
15508 10:21:13.907 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to ZooKeeper because maximum change 90.1116830929489% exceeded threshold 10%; time since last report written is 5.0 seconds            
15509 10:21:13.911 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO  org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x1000009c374000b local:/10.244.0.57:55094 remoteserver:test-pulsar-pod-kill-zookeeper-ca/10.96.178.164:2181       lastZxid:14390 xid:3205 sent:3205 recv:3511 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.244.0.57:8080                                                  
15510 10:21:18.514 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Only 1 broker available: no load shedding will be performed
15511 10:21:25.464 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - Starting Pulsar consumer status recorder with config: {
15512   "topicNames" : [ "persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck" ],

Without zk load balancing

15334 10:20:55.486 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck][reader-3bd1c12c85] Subscribing to topic on cnx [id: 0x1e9b14ef, L:/10.244.0.57:41306 - R:10.244.0.57/10.24      4.0.57:6650], consumerId 14
15335 10:20:55.487 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.57:41306] Subscribing on topic persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck / reader-3bd1c12c85
15336 10:20:55.489 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck] [null] Creating producer on cnx [id: 0x1e9b14ef, L:/10.244.0.57:41306 - R:10.244.0.57/10.244.0.57:6650]
15337 10:20:55.489 [pulsar-io-23-2] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.57:41306][persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck] Creating producer. producerId=14
15338 10:20:55.521 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.57:41306] Created new producer: Producer{topic=PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck}, clien      t=/10.244.0.57:41306, producerName=test-pulsar-pod-kill-1-570, producerId=14}
15339 10:20:55.522 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck] [test-pulsar-pod-kill-1-570] Created producer on cnx [id: 0x1e9b14ef, L:/10.244.0.57:41306 - R:10.244.0.57      /10.244.0.57:6650]
15340 10:20:55.540 [pulsar-client-io-43-1] INFO  org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - Starting Pulsar consumer status recorder with config: {
15341   "topicNames" : [ "persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck" ],                 
15342   "topicsPattern" : null,                                                                                     
15343   "subscriptionName" : "reader-227a782993",

XuanYang-cn commented 2 years ago

Seems like the ModularLoadManagerImpl is expensive

15504 10:20:55.639 [pulsar-external-listener-44-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [19/Jan/2022:10:20:55 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 200 2 "-" "curl/7.64.0" 111
15505
15506 10:21:08.907 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to ZooKeeper because maximum change 11.049842834472656% exceeded threshold 10%; time since last report written is 19.458 seconds       
15507 10:21:08.920 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO  org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x1000009c374000b local:/10.244.0.57:55094 remoteserver:test-pulsar-pod-kill-zookeeper-ca/10.96.178.164:2181       lastZxid:14389 xid:3201 sent:3201 recv:3506 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.244.0.57:8080
15508 10:21:13.907 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to ZooKeeper because maximum change 90.1116830929489% exceeded threshold 10%; time since last report written is 5.0 seconds            
15509 10:21:13.911 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO  org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x1000009c374000b local:/10.244.0.57:55094 remoteserver:test-pulsar-pod-kill-zookeeper-ca/10.96.178.164:2181       lastZxid:14390 xid:3205 sent:3205 recv:3511 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.244.0.57:8080

XuanYang-cn commented 2 years ago

The above is wrong, health check channel won't block new channels from creating and seeking. The healthcheck channel just keep going.

XuanYang-cn commented 2 years ago

https://github.com/XuanYang-cn/milvus/actions/runs/1726757451

zhuwenxing commented 2 years ago

https://github.com/milvus-io/milvus/runs/4914182804?check_suite_focus=true

sunby commented 2 years ago

@zhuwenxing @XuanYang-cn This issue will be fixed after #15343

xiaofan-luan commented 2 years ago

any progress on this issue? fixed?

zhuwenxing commented 2 years ago

Not reproduced yet, so remove the urgent label.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

milvus-io / milvus