Closed zhuwenxing closed 2 years ago
/assign @sunby
duplicate with #13256 ?
Not the same as it! In #13256, it will raise errors and fail directly. but in this issue, it was hanging at the flush step.
/assign @zhuwenxing please verify it.
@sunby
In https://github.com/milvus-io/milvus/actions/runs/1627935493, datacoord, indexcoord,querycoord work well.
Rootcoord failed,but it was failed at search when running hello_milvus.py
.
Please check out this job https://github.com/milvus-io/milvus/runs/4644257749?check_suite_focus=true
It has the same behavior (hangs at flush)as this issue, but it was caused by killing pulsar pod.
/assign @sunby
/assign /unassign @sunby
stuck at Seek
hangs at flush when running test_e2e.py
hangs at flush when running hello_milvus.py
hangs at flush when running test_e2e.py caused by minio pod kill https://github.com/milvus-io/milvus/runs/4661088464?check_suite_focus=true
But we can also see that rootcoord pod restarted once.
hangs at flush when running test_e2e.py caused by pulsar pod kill https://github.com/milvus-io/milvus/runs/4669003081?check_suite_focus=true
hangs at flush when running test_e2e.py caused by pulsar pod kill https://github.com/zhuwenxing/milvus/runs/4671387919?check_suite_focus=true
hangs at flush when running test_e2e.py caused by querynode pod kill https://github.com/zhuwenxing/milvus/runs/4671387853?check_suite_focus=true
hangs at flush when running test_e2e.py caused by indexcoord
pod kill
https://github.com/milvus-io/milvus/runs/4685511374?check_suite_focus=true
hangs at flush when running test_e2e.py caused by pulsar
pod kill
https://github.com/milvus-io/milvus/runs/4685511462?check_suite_focus=true
hangs at flush when running test_e2e.py caused by pulsar pod kill https://github.com/milvus-io/milvus/runs/4718761096?check_suite_focus=true
For pulsar pod kill:
Why flush hang: Timetick proceeding very very slow. 30s a tt.
Why tt slow: rootcoord broadcast tt 30s a time
Why rootcoord broadcast tt slow: there are 2 dml_channel seemed never recover from pulsar, producing on those are always timeout, the default produce timeout is 30s.
4457 [2022/01/05 19:01:44.361 +00:00] [DEBUG] [timeticksync.go:330] ["SendTimeTickToChannel fail"] [error="message send timeout: TimeoutError"]
4458 [2022/01/05 19:01:44.361 +00:00] [ERROR] [dml_channels.go:109] ["Broadcast failed"] [chanName=by-dev-rootcoord-dml_10] [stack="github.com/milvus-io/milvus/internal/rootcoord.(dmlChannels).broadcast\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/dml_channe ls.go:109\ngithub.com/milvus-io/milvus/internal/rootcoord.(timetickSync).sendTimeTickToChannel\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/timeticksync.go:368\ngithub.com/milvus-io/milvus/internal/rootcoord.(*timetickSync).startWatch.func1\n\t/go/src/g ithub.com/milvus-io/milvus/internal/rootcoord/timeticksync.go:329"]
2 roads:
https://github.com/milvus-io/milvus/runs/4730662241?check_suite_focus=true
This is also caused by producing timeout: hello_milvus's time tick channel && delta channel producing timeout.
This timeout will block flowgraph, 22 tt costs 11 mins, thus DN seemed stucked
RoorCoord not available yet, when DN handles a flushed segment recovery, this etcd-put event failed.
WatchDmChannel
Etcd eventput
event and started to new data Sync ServiceCompleted
Completed
.Solutions: DataCoord needs to check repeatedly if a WatchDmChannel
Etcd event status becomes Completed
. If it's not
completed for a long time, DataCoord needs to re-put
this event.
https://github.com/milvus-io/milvus/runs/4779791136?check_suite_focus=true Flush hang caused by rootcoord pod kill. Note that datanode restart once for some reason.
https://github.com/milvus-io/milvus/runs/4779791136?check_suite_focus=true Flush hang caused by rootcoord pod kill. Note that datanode restart once for some reason.
This will be fixed by #15102
The last three pod kill chaos tests have all passed, so remove the critical label.
The last three pod kill chaos tests have all passed, so remove the critical label.
The action all passed is caused by chaos injection failure, not because of the fix. After fixing the chaos injection failure, the issue still exists. See: https://github.com/milvus-io/milvus/runs/4844812108?check_suite_focus=true
1 [2022/01/17 18:50:26.760 +00:00] [INFO] [data_node.go:310] ["DataNode is handling watchInfo put event"] [key=by-dev/meta/channelwatch/6/by-dev-rootcoord-dml_54_430557200390553601v0]
...
5 [2022/01/17 18:50:26.839 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:47] ["datanode AsConsumer"] ["physical channel"=by-dev-rootcoord-dml_54] [subName=by-dev-dataNode-6-430557200390553601]
6 [2022/01/17 18:50:26.839 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:52] ["datanode begin to seek"] ["Channel Name"=by-dev-rootcoord-dml_54]
...
10 [2022/01/17 18:59:26.607 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:57] ["datanode Seek successfully"] ["Channel Name"=by-dev-rootcoord-dml_54] [elapse=8m59.768128063s]
8m59.768128063s
What does the log mean? retrieve data from pulsar takes 10 mins?
RoorCoord not available yet, when DN handles a flushed segment recovery, this etcd-put event failed.
- DataCoord put a
WatchDmChannel
Etcd event- DataNode detected the
put
event and started to new data Sync Servicenew data sync service failed because
- DN restarted, there are flushed segment.
- DN failed to get the collection schema of the flushed segment, because RootCoord is in Chaos
- DataNode handleChannelInfo stopped with error and didn't put this event as
Completed
- DataCoordinator will never put again, even the event will never be set to
Completed
.- DataNode couldn't subscribe to the flushed segment vChannel forever, causing flush hanging.
Solutions: DataCoord needs to check repeatedly if a
WatchDmChannel
Etcd event status becomesCompleted
. If it's not completed for a long time, DataCoord needs tore-put
this event.
Seems that under step 3, if datanode failed to handle, we should just mark the task as failed and wait for datacoord retry
RoorCoord not available yet, when DN handles a flushed segment recovery, this etcd-put event failed.
- DataCoord put a
WatchDmChannel
Etcd event- DataNode detected the
put
event and started to new data Sync Servicenew data sync service failed because
- DN restarted, there are flushed segment.
- DN failed to get the collection schema of the flushed segment, because RootCoord is in Chaos
- DataNode handleChannelInfo stopped with error and didn't put this event as
Completed
- DataCoordinator will never put again, even the event will never be set to
Completed
.- DataNode couldn't subscribe to the flushed segment vChannel forever, causing flush hanging.
Solutions: DataCoord needs to check repeatedly if a
WatchDmChannel
Etcd event status becomesCompleted
. If it's not completed for a long time, DataCoord needs tore-put
this event.Seems that under step 3, if datanode failed to handle, we should just mark the task as failed and wait for datacoord retry
Yes, for now the DataNode will retry to wantch the channel if the step3 failed and it can handle this situation. DataCoord have also implemented the retry logic and I want to enable it after GA release.
1 [2022/01/17 18:50:26.760 +00:00] [INFO] [data_node.go:310] ["DataNode is handling watchInfo put event"] [key=by-dev/meta/channelwatch/6/by-dev-rootcoord-dml_54_430557200390553601v0] ... 5 [2022/01/17 18:50:26.839 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:47] ["datanode AsConsumer"] ["physical channel"=by-dev-rootcoord-dml_54] [subName=by-dev-dataNode-6-430557200390553601] 6 [2022/01/17 18:50:26.839 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:52] ["datanode begin to seek"] ["Channel Name"=by-dev-rootcoord-dml_54] ... 10 [2022/01/17 18:59:26.607 +00:00] [DEBUG] [flow_graph_dmstream_input_node.go:57] ["datanode Seek successfully"] ["Channel Name"=by-dev-rootcoord-dml_54] [elapse=8m59.768128063s]
@xiaofan-luan It seems that DataNode spent too much time to seek after pulsar restarted.
Seek cost 9mins
Seek cost 9mins
@XuanYang-cn Would you help to figure out the reason why it cost so long? 👍
Here's a reproduce: https://github.com/XuanYang-cn/milvus/actions/runs/1716709139
07:36:10.307 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-61dd82f746] Subscribing to topic on cnx [id: 0x2c59f11e, L:/10.244.0.62:55598 - R:10.244.0.62/10.24 4.0.62:6650], consumerId 49
According to the broker logs, pulsar will try to recover every consumer's healthcheck channel in order from consumerID=1 to consumerID=48.
And all 48 consumers takes 9 minutes finishing recover to this channel persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck
Util consumerID=48 recover to healthcheck channel, Seek of consumerID=48 stucked.
Now investigating the logs of broker of passing tests.
A complete log for one consumer recovering to healthcheck channel: 30s
9229 07:33:40.318 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Subscribing to topic on cnx [id: 0x2c59f11e, L:/10.244.0.62:55598 - R:10.244.0.62/10.24 4.0.62:6650], consumerId 39
9230 07:33:40.319 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Subscribing on topic persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck / reader-a34c3b0d25
9231 07:33:40.323 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Created new producer: Producer{topic=PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck}, clien t=/10.244.0.62:55598, producerName=test-pulsar-pod-kill-2-49, producerId=39}
9232 07:33:40.324 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck] [test-pulsar-pod-kill-2-49] Created producer on cnx [id: 0x2c59f11e, L:/10.244.0.62:55598 - R:10.244.0.62/ 10.244.0.62:6650]
9233 07:33:40.341 [Thread-256] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Creating non-durable subscription at msg id 9223372036854775807:92233720368547758 07:-1:-1
9234 07:33:40.341 [Thread-256] INFO org.apache.bookkeeper.mledger.impl.NonDurableCursorImpl - [pulsar/test-pulsar-pod-kill/10.244.0.62:8080/persistent/healthcheck] Created non-durable cursor read-position=178:39 mark-delete-position=178:38
9235 07:33:40.341 [Thread-256] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/test-pulsar-pod-kill/10.244.0.62:8080/persistent/healthcheck] Opened new cursor: NonDurableCursorImpl{ledger=pulsar/test-pulsar-pod-kill/10.244.0.62:8080/persistent/health check, ackPos=178:38, readPos=178:39}
9236 07:33:40.342 [Thread-256] INFO org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [pulsar/test-pulsar-pod-kill/10.244.0.62:8080/persistent/healthcheck-reader-a34c3b0d25] Rewind from 178:39 to 178:39
9237 07:33:40.342 [Thread-256] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck] There are no replicated subscriptions on the topic
9238 07:33:40.342 [Thread-256] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Created new subscription for 39
9239 07:33:40.342 [Thread-256] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Created subscription on topic persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck / reader-a34c3b0d25
9240 07:33:40.343 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Subscribed to topic on 10.244.0.62/10.244.0.62:6650 -- consumer: 39
9241 07:33:40.361 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck}][test-pulsar-pod-kill-2-49] Closing producer on cnx /10.244.0.62:55598. producerId=39
9242 07:33:40.361 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck}][test-pulsar-pod-kill-2-49] Closed producer on cnx /10.244.0.62:55598. producerId=39
9243 07:33:40.362 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Closing consumer: consumerId=39
9244 07:33:40.366 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck] [test-pulsar-pod-kill-2-49] Closed Producer
9245 07:33:40.367 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.AbstractDispatcherSingleActiveConsumer - Removing consumer Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck, name=reader-a34c 3b0d25}, consumerId=39, consumerName=de00b, address=/10.244.0.62:55598}
9246 07:33:40.367 [pulsar-external-listener-44-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [19/Jan/2022:07:33:40 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 200 2 "-" "curl/7.64.0" 88
9247 07:33:40.367 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Successfully closed subscription [NonDurableCursorImpl{ledger=pulsar/t est-pulsar-pod-kill/10.244.0.62:8080/persistent/healthcheck, ackPos=178:38, readPos=178:40}]
9248 07:33:40.368 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck][reader-a34c3b0d25] Successfully closed dispatcher for reader
9249 07:33:40.368 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Closed consumer, consumerId=39
9250 07:33:40.368 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck] [reader-a34c3b0d25] Closed consumer
9251 07:34:04.074 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Only 1 broker available: no load shedding will be performed
9252 07:34:10.242 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - Starting Pulsar consumer status recorder with config: {
9253 "topicNames" : [ "persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck" ],
9254 "topicsPattern" : null,
9255 "subscriptionName" : "reader-5196f39992",
9256 "subscriptionType" : "Exclusive",
9257 "subscriptionMode" : "NonDurable",
9258 "receiverQueueSize" : 1000,
9259 "acknowledgementsGroupTimeMicros" : 100000,
9260 "negativeAckRedeliveryDelayMicros" : 60000000,
9261 "maxTotalReceiverQueueSizeAcrossPartitions" : 50000,
9262 "consumerName" : null,
9263 "ackTimeoutMillis" : 0,
9264 "tickDurationMillis" : 1000,
9265 "priorityLevel" : 0,
9266 "maxPendingChuckedMessage" : 10,
9267 "autoAckOldestChunkedMessageOnQueueFull" : false,
9268 "expireTimeOfIncompleteChunkedMessageMillis" : 60000,
9269 "cryptoFailureAction" : "FAIL",
9270 "properties" : { },
9271 "readCompacted" : false,
9272 "subscriptionInitialPosition" : "Latest",
9273 "patternAutoDiscoveryPeriod" : 60,
9274 "regexSubscriptionMode" : "PersistentOnly",
9275 "deadLetterPolicy" : null,
9276 "retryEnable" : false,
9277 "autoUpdatePartitions" : true,
9278 "autoUpdatePartitionsIntervalSeconds" : 60,
9279 "replicateSubscriptionState" : false,
9280 "resetIncludeHead" : false,
9281 "keySharedPolicy" : null,
9282 "batchIndexAckEnabled" : false
9283 }
9284 07:34:10.243 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - Pulsar client config: {
9285 "serviceUrl" : "pulsar://10.244.0.62:6650",
9286 "authPluginClassName" : "org.apache.pulsar.client.impl.auth.AuthenticationDisabled",
9287 "operationTimeoutMs" : 30000,
9288 "statsIntervalSeconds" : 60,
9289 "numIoThreads" : 1,
9290 "numListenerThreads" : 1,
9291 "connectionsPerBroker" : 1,
9292 "useTcpNoDelay" : true,
9293 "useTls" : false,
9294 "tlsTrustCertsFilePath" : null,
9295 "tlsAllowInsecureConnection" : false,
9296 "tlsHostnameVerificationEnable" : false,
9297 "concurrentLookupRequest" : 5000,
9298 "maxLookupRequest" : 50000,
9299 "maxLookupRedirects" : 20,
9300 "maxNumberOfRejectedRequestPerConnection" : 50,
9301 "keepAliveIntervalSeconds" : 30,
9302 "connectionTimeoutMs" : 10000,
9303 "requestTimeoutMs" : 60000,
9304 "initialBackoffIntervalNanos" : 100000000,
9305 "maxBackoffIntervalNanos" : 60000000000,
9306 "listenerName" : null,
9307 "useKeyStoreTls" : false,
9308 "sslProvider" : null,
9309 "tlsTrustStoreType" : "JKS",
9310 "tlsTrustStorePath" : null,
9311 "tlsTrustStorePassword" : null,
9312 "tlsCiphers" : [ ],
9313 "tlsProtocols" : [ ],
9314 "proxyServiceUrl" : null,
9315 "proxyProtocol" : null,
9316 "enableTransaction" : false
9317 }
9318 07:34:10.249 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - Starting Pulsar producer perf with config: {
9319 "topicName" : "persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck",
9320 "producerName" : null,
9321 "sendTimeoutMs" : 30000,
9322 "blockIfQueueFull" : false,
9323 "maxPendingMessages" : 1000,
9324 "maxPendingMessagesAcrossPartitions" : 50000,
9325 "messageRoutingMode" : "RoundRobinPartition",
9326 "hashingScheme" : "JavaStringHash",
9327 "cryptoFailureAction" : "FAIL",
9328 "batchingMaxPublishDelayMicros" : 1000,
9329 "batchingPartitionSwitchFrequencyByPublishDelay" : 10,
9330 "batchingMaxMessages" : 1000,
9331 "batchingMaxBytes" : 131072,
9332 "batchingEnabled" : true,
9333 "chunkingEnabled" : false,
9334 "compressionType" : "NONE",
9335 "initialSequenceId" : null,
9336 "autoUpdatePartitions" : true,
9337 "autoUpdatePartitionsIntervalSeconds" : 60,
9338 "multiSchema" : true,
9339 "properties" : { }
9340 }
9341 07:34:10.250 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - Pulsar client config: {
9342 "serviceUrl" : "pulsar://10.244.0.62:6650",
9343 "authPluginClassName" : "org.apache.pulsar.client.impl.auth.AuthenticationDisabled",
9344 "operationTimeoutMs" : 30000,
9345 "statsIntervalSeconds" : 60,
9346 "numIoThreads" : 1,
9347 "numListenerThreads" : 1,
9348 "connectionsPerBroker" : 1,
9349 "useTcpNoDelay" : true,
9350 "useTls" : false,
9351 "tlsTrustCertsFilePath" : null,
9352 "tlsAllowInsecureConnection" : false,
9353 "tlsHostnameVerificationEnable" : false,
9354 "concurrentLookupRequest" : 5000,
9355 "maxLookupRequest" : 50000,
9356 "maxLookupRedirects" : 20,
9357 "maxNumberOfRejectedRequestPerConnection" : 50,
9358 "keepAliveIntervalSeconds" : 30,
9359 "connectionTimeoutMs" : 10000,
9360 "requestTimeoutMs" : 60000,
9361 "initialBackoffIntervalNanos" : 100000000,
9362 "maxBackoffIntervalNanos" : 60000000000,
9363 "listenerName" : null,
9364 "useKeyStoreTls" : false,
9365 "sslProvider" : null,
9366 "tlsTrustStoreType" : "JKS",
9367 "tlsTrustStorePath" : null,
9368 "tlsTrustStorePassword" : null,
9369 "tlsCiphers" : [ ],
9370 "tlsProtocols" : [ ],
9371 "proxyServiceUrl" : null,
9372 "proxyProtocol" : null,
9373 "enableTransaction" : false
9374 }
9231 07:33:40.323 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.62:55598] Created new producer: Producer{topic=PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.62:8080/healthcheck}, clien t=/10.244.0.62:55598, producerName=test-pulsar-pod-kill-2-49, producerId=39}
With zk loadbanlancing
15502 10:20:55.632 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.57:41306] Closed consumer, consumerId=15
15503 10:20:55.633 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck] [reader-227a782993] Closed consumer
15504 10:20:55.639 [pulsar-external-listener-44-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [19/Jan/2022:10:20:55 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 200 2 "-" "curl/7.64.0" 111
15505
15506 10:21:08.907 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to ZooKeeper because maximum change 11.049842834472656% exceeded threshold 10%; time since last report written is 19.458 seconds
15507 10:21:08.920 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x1000009c374000b local:/10.244.0.57:55094 remoteserver:test-pulsar-pod-kill-zookeeper-ca/10.96.178.164:2181 lastZxid:14389 xid:3201 sent:3201 recv:3506 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.244.0.57:8080
15508 10:21:13.907 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to ZooKeeper because maximum change 90.1116830929489% exceeded threshold 10%; time since last report written is 5.0 seconds
15509 10:21:13.911 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x1000009c374000b local:/10.244.0.57:55094 remoteserver:test-pulsar-pod-kill-zookeeper-ca/10.96.178.164:2181 lastZxid:14390 xid:3205 sent:3205 recv:3511 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.244.0.57:8080
15510 10:21:18.514 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Only 1 broker available: no load shedding will be performed
15511 10:21:25.464 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - Starting Pulsar consumer status recorder with config: {
15512 "topicNames" : [ "persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck" ],
Without zk load balancing
15334 10:20:55.486 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck][reader-3bd1c12c85] Subscribing to topic on cnx [id: 0x1e9b14ef, L:/10.244.0.57:41306 - R:10.244.0.57/10.24 4.0.57:6650], consumerId 14
15335 10:20:55.487 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.57:41306] Subscribing on topic persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck / reader-3bd1c12c85
15336 10:20:55.489 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck] [null] Creating producer on cnx [id: 0x1e9b14ef, L:/10.244.0.57:41306 - R:10.244.0.57/10.244.0.57:6650]
15337 10:20:55.489 [pulsar-io-23-2] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.57:41306][persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck] Creating producer. producerId=14
15338 10:20:55.521 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.244.0.57:41306] Created new producer: Producer{topic=PersistentTopic{topic=persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck}, clien t=/10.244.0.57:41306, producerName=test-pulsar-pod-kill-1-570, producerId=14}
15339 10:20:55.522 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck] [test-pulsar-pod-kill-1-570] Created producer on cnx [id: 0x1e9b14ef, L:/10.244.0.57:41306 - R:10.244.0.57 /10.244.0.57:6650]
15340 10:20:55.540 [pulsar-client-io-43-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - Starting Pulsar consumer status recorder with config: {
15341 "topicNames" : [ "persistent://pulsar/test-pulsar-pod-kill/10.244.0.57:8080/healthcheck" ],
15342 "topicsPattern" : null,
15343 "subscriptionName" : "reader-227a782993",
Seems like the ModularLoadManagerImpl
is expensive
15504 10:20:55.639 [pulsar-external-listener-44-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [19/Jan/2022:10:20:55 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 200 2 "-" "curl/7.64.0" 111
15505
15506 10:21:08.907 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to ZooKeeper because maximum change 11.049842834472656% exceeded threshold 10%; time since last report written is 19.458 seconds
15507 10:21:08.920 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x1000009c374000b local:/10.244.0.57:55094 remoteserver:test-pulsar-pod-kill-zookeeper-ca/10.96.178.164:2181 lastZxid:14389 xid:3201 sent:3201 recv:3506 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.244.0.57:8080
15508 10:21:13.907 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to ZooKeeper because maximum change 90.1116830929489% exceeded threshold 10%; time since last report written is 5.0 seconds
15509 10:21:13.911 [pulsar-ordered-OrderedExecutor-6-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x1000009c374000b local:/10.244.0.57:55094 remoteserver:test-pulsar-pod-kill-zookeeper-ca/10.96.178.164:2181 lastZxid:14390 xid:3205 sent:3205 recv:3511 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.244.0.57:8080
The above is wrong, health check channel won't block new channels from creating and seeking. The healthcheck channel just keep going.
@zhuwenxing @XuanYang-cn This issue will be fixed after #15343
any progress on this issue? fixed?
Not reproduced yet, so remove the urgent label.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
Is there an existing issue for this?
Environment
Current Behavior
Flush hangs at e2e test after deleting datacoord or rootcoord pod kill chaos
Expected Behavior
all operations work well after deleting chaos
Steps To Reproduce
Anything else?
datacoord log: https://github.com/milvus-io/milvus/suites/4764821710/artifacts/132476087
rootcoord log: https://github.com/milvus-io/milvus/suites/4764821710/artifacts/132476097
log path is
k8s_logs/chaos_test