strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.86k stars 1.3k forks source link

strimzi JMX configuration issue #3121

Closed pashafirdous closed 4 years ago

pashafirdous commented 4 years ago

Hello!

we are observing a weird issue since we have enabled a few settings for montoring on JMX port.

the cluster continuously reconciles and gives errors. we also see the kafka-brokers get into unavailable state when monitored and the operator gives below error. however we dont see any restarts happenning at pod level.

need help to understand the issue and address it.

2020-05-29 16:26:00 INFO  AbstractOperator:262 - Reconciliation #0(watch) Kafka(kafka-1/cluster1): reconciled
2020-05-29 16:26:00 INFO  OperatorWatcher:40 - Reconciliation #1(watch) Kafka(kafka-1/cluster1): Kafka cluster1 in namespace kafka-1 was MODIFIED
2020-05-29 16:26:00 INFO  AbstractOperator:122 - Reconciliation #1(watch) Kafka(kafka-1/cluster1): Kafka cluster1 should be created or updated
2020-05-29T16:26:01.511+0000: 112.950: [GC (Allocation Failure) 2020-05-29T16:26:01.511+0000: 112.950: [DefNew: 12468K->1194K(12608K), 0.0071226 secs] 50701K->40106K(56384K), 0.0072056 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2020-05-29T16:26:02.235+0000: 113.673: [GC (Allocation Failure) 2020-05-29T16:26:02.235+0000: 113.673: [DefNew: 12458K->117K(12608K), 0.0039529 secs] 51370K->39442K(56384K), 0.0040352 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2020-05-29T16:26:02.366+0000: 113.805: [GC (Allocation Failure) 2020-05-29T16:26:02.366+0000: 113.805: [DefNew: 11381K->148K(12608K), 0.0029042 secs] 50706K->39473K(56384K), 0.0029832 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2020-05-29T16:26:03.837+0000: 115.275: [GC (Allocation Failure) 2020-05-29T16:26:03.837+0000: 115.275: [DefNew: 11398K->993K(12608K), 0.0034647 secs] 50723K->40318K(56384K), 0.0035515 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2020-05-29 16:26:04 INFO  AbstractOperator:262 - Reconciliation #1(watch) Kafka(kafka-1/cluster1): reconciled
2020-05-29 16:26:12 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace kafka-1...
2020-05-29 16:26:12 WARN  VersionUsageUtils:60 - The client is using resource type 'kafkamirrormakers' with unstable version 'v1beta1'
2020-05-29T16:26:12.614+0000: 124.053: [GC (Allocation Failure) 2020-05-29T16:26:12.614+0000: 124.053: [DefNew: 12257K->288K(12608K), 0.0046683 secs] 51582K->39967K(56384K), 0.0047541 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2020-05-29 16:26:12 INFO  AbstractOperator:122 - Reconciliation #2(timer) Kafka(kafka-1/cluster1): Kafka cluster1 should be created or updated
2020-05-29 16:26:12 WARN  VersionUsageUtils:60 - The client is using resource type 'kafkaconnects' with unstable version 'v1beta1'
2020-05-29 16:26:12 WARN  VersionUsageUtils:60 - The client is using resource type 'kafkamirrormaker2s' with unstable version 'v1alpha1'
2020-05-29 16:26:12 WARN  VersionUsageUtils:60 - The client is using resource type 'kafkabridges' with unstable version 'v1alpha1'
2020-05-29T16:26:12.834+0000: 124.273: [GC (Allocation Failure) 2020-05-29T16:26:12.834+0000: 124.273: [DefNew: 11552K->1241K(12608K), 0.0055264 secs] 51231K->40921K(56384K), 0.0056178 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2020-05-29T16:26:13.221+0000: 124.660: [GC (Allocation Failure) 2020-05-29T16:26:13.221+0000: 124.660: [DefNew: 12505K->112K(12608K), 0.0054510 secs] 52185K->40826K(56384K), 0.0055442 secs] [Times: user=0.01 sys=0.01, real=0.00 secs]
2020-05-29T16:26:13.355+0000: 124.793: [GC (Allocation Failure) 2020-05-29T16:26:13.355+0000: 124.793: [DefNew: 11376K->140K(12608K), 0.0030191 secs] 52090K->40853K(56384K), 0.0031008 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2020-05-29T16:26:15.420+0000: 126.859: [GC (Allocation Failure) 2020-05-29T16:26:15.420+0000: 126.859: [DefNew: 11404K->751K(12608K), 0.0035910 secs] 52117K->41465K(56384K), 0.0036841 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
2020-05-29 16:26:16 INFO  AbstractOperator:262 - Reconciliation #2(timer) Kafka(kafka-1/cluster1): reconciled
2020-05-29 16:28:12 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace kafka-1...
2020-05-29T16:28:12.622+0000: 244.060: [GC (Allocation Failure) 2020-05-29T16:28:12.622+0000: 244.060: [DefNew: 12015K->434K(12608K), 0.0040678 secs] 52728K->41206K(56384K), 0.0041753 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2020-05-29 16:28:12 INFO  AbstractOperator:122 - Reconciliation #3(timer) Kafka(kafka-1/cluster1): Kafka cluster1 should be created or updated
2020-05-29T16:28:13.046+0000: 244.485: [GC (Allocation Failure) 2020-05-29T16:28:13.047+0000: 244.485: [DefNew: 11698K->539K(12608K), 0.0038317 secs] 52469K->41311K(56384K), 0.0039435 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2020-05-29T16:28:13.173+0000: 244.612: [GC (Allocation Failure) 2020-05-29T16:28:13.173+0000: 244.612: [DefNew: 11803K->563K(12608K), 0.0034205 secs] 52575K->41335K(56384K), 0.0035032 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2020-05-29T16:28:14.416+0000: 245.855: [GC (Allocation Failure) 2020-05-29T16:28:14.416+0000: 245.855: [DefNew: 11827K->1087K(12608K), 0.0040359 secs] 52599K->41859K(56384K), 0.0041212 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2020-05-29T16:28:15.139+0000: 246.577: [GC (Allocation Failure) 2020-05-29T16:28:15.139+0000: 246.577: [DefNew: 12346K->486K(12608K), 0.0044050 secs] 53118K->41715K(56384K), 0.0044801 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2020-05-29 16:28:15 INFO  AbstractOperator:262 - Reconciliation #3(timer) Kafka(kafka-1/cluster1): reconciled
scholzj commented 4 years ago

Wait, the log does not contain any errors - that just looks like a regular behaviour. The operator by default every two minutes does the periodical reconciliation to double check that everything is ok.

pashafirdous commented 4 years ago

i got this error with the JMX monitoring agent.

5/29/20 05:56:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
5/29/20 05:56:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
5/29/20 05:56:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
5/29/20 05:56:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@cc4e121
5/29/20 05:56:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
5/29/20 05:56:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
5/29/20 05:57:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
5/29/20 05:57:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
5/29/20 05:57:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
5/29/20 05:57:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
5/29/20 05:57:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
5/29/20 05:57:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@42841

Could you advice what is the reason the the sampler is shutting down and also the JMX is closing connections regularly.

2020-05-29_2336

scholzj commented 4 years ago

To be honest, I have no idea. Never used the JMX stuff, always only Prometheus. Maybe @julianGoh17 who contributed this might have an idea?

julianGoh17 commented 4 years ago

@scholzj It seems like the Jmx monitoring agent can not connect to the Kafka Broker's JMX pod. Think it could be something with how the JMX monitoring agent is trying to connect to Kafka.

@pashafirdous If you could give me some more detail about how you are running the JMX monitoring agent and how Kafka JMX has been set up (i.e. does it have auth or not) that would help us to debug this issue.

pashafirdous commented 4 years ago

image

we have just enabled the option jmxOptions: {} the tools is listening on port 9999

the pic above shows that all the broker nodes are having equal availability time. so when we see a per node log, they are down alternatively.

Strimzi cluster logs are only for reconciliation as in the question.

the logs for the agent are below.

6/01/20 03:38:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:38:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:39:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:39:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:39:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:39:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:39:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:39:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@3f15241a
6/01/20 03:39:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:39:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:40:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:40:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:40:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:40:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:40:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:40:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@5ae95f89
6/01/20 03:40:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:40:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:41:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:41:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:41:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:41:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:41:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:41:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@40e5bbc4
6/01/20 03:41:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:41:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:42:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:42:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:42:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:42:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:42:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:42:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@49424900
6/01/20 03:42:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:42:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:43:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:43:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:43:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:43:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:43:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:43:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@22126d4e
6/01/20 03:43:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:43:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:44:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:44:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:44:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:44:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:44:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:44:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@111c8528
6/01/20 03:44:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:44:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:45:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:45:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:45:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:45:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:45:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:45:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@4be5d2e4
6/01/20 03:45:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:45:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:46:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:46:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:46:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:46:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:46:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:46:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@674ef2b4
6/01/20 03:46:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:46:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:47:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:47:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:47:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:47:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:47:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:47:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@1957c1d6
6/01/20 03:47:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:47:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:48:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:48:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:48:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:48:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:48:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:48:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@eb38180
6/01/20 03:48:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:48:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-2.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:49:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:50:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:51:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:52:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:52:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:52:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:52:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:52:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:52:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@7e1c1cf3
6/01/20 03:52:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:52:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:53:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:54:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:54:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Created new Sampler for cluster1-kafka-1.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:54:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:54:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:54:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Removed Stale Scheduled Sampler  -  cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:54:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] Cannot connect to MBeanServer - javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection@78ee02c9
6/01/20 03:54:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] shutting down sampler for server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:54:58 PM UTC [INFO] [IntroscopeAgent.KafkaExtension] Closing jmx connection to server: cluster1-kafka-0.cluster1-kafka-brokers.kafka-1.svc.cluster.local_0
6/01/20 03:55:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
6/01/20 03:56:58 PM UTC [ERROR] [IntroscopeAgent.KafkaExtension] No Kafka brokers online to monitor
k8soptr@jumpbox:~$
julianGoh17 commented 4 years ago

@pashafirdous I don't think that there are any problems with the Cluster Operator logs. The operator is set to reconcile every 2 minutes which is why the Reconciliation happens every two minutes. This means that the operator will check the state of the kubernetes environment and will adjust it as it sees fit. The logs you have provided suggest that there are no changes in between each reconcile.

How are you running the JMX exporter? Is it in a pod in the same namespace or ran externally? If it is not a pod deployed in the namespace, then you need to use https://strimzi.io/docs/operators/master/using.html#assembly-jmxtrans-deployment-configuration-kafka to obtain JMX. This is because of exposing the JMX port via a route/nodeport is not achievable.

pashafirdous commented 4 years ago

Hello Julian,

we are just enabling the JMX port in the kafka CR jmxOptions: {}

there is an agent installed in the same namespace with will poll the cluster on the port 9999

the concern we are observing is the kafka brokers (in case of 3 node) every time only one broker stays available as per the logs which is in a roundrobin fashion, the brokers are coming up, making it 33.08% uptime individually on the JMX port.

i am not understanding this behavior.

as you said, i understand we need to use JMXtrans in case we have an external polling agent. but we are not having that scenario.

scholzj commented 4 years ago

How do you configure the agent to connect? Which addresses do you use etc.? Is the agent actually ever connected? It loosk to me more like it is just looping around the brokers without really connecting anywhere.

scholzj commented 4 years ago

Also, does the cluster actually have constant rolling updates as the name of the issue suggests? Or is the JMX agent the only thing which has connection issues? We should change the issue accordingly.

pashafirdous commented 4 years ago

we are configuring a single monitoring agent, installed in the namespace, with JMX on service/my-cluster-kafka-brokers headless service on port 9999

service/cluster1-kafka-brokers              ClusterIP      9091/TCP,9092/TCP,9093/TCP,9999/TCP   

there is no rolling updates, the logs are right after configuration.

need some guidance to find, what is actually causing the problem.

scholzj commented 4 years ago

I do not know much about JMX. But would expect that you need to connect to each broker and just to one of them, or? Connecting to the cluster1-kafka-brokers service (or the cluster1-kafka-bootstrap service as well), you would just alternate between the brokers. Don't you need to open 3 different connections to cluster1-kafka-0.cluster1-kafka-brokers.svc:9999, cluster1-kafka-1.cluster1-kafka-brokers.svc:9999 etc.?

pashafirdous commented 4 years ago

just so you know, when i added the single service, it is polling all the 3 nodes, and when i monitor the agent uptime, i see only one broker is up at one time, as seen in the logs.

the image in https://github.com/strimzi/strimzi-kafka-operator/issues/3121#issuecomment-637006893 shows 3 brokers uptime.

i shall try with adding brokers individually and let you know.

scholzj commented 4 years ago

The headless service is basically a DNS construct which resolves to the 3 Kafka pods. I would assume that it is trying to connect to all 3 brokers because it is just alternating between the 3 addresses backing the DNS name ... first the first one, than the second one, then the third one etc.

pashafirdous commented 4 years ago

Hi Jakub,

thanks for assistance.

we tested this with couple of deployments and are satisfied with comments. closing the case.