strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.8k stars 1.28k forks source link

[Question/Help] Why the status of Kafka is NotReady? #4015

Closed AllenZMC closed 3 years ago

AllenZMC commented 3 years ago

See the following output, Why the status of Kafka is NotReady? But the all kafka cluster pods are all running.

on OCP4.x

oc describe Kafka testkafka -n openshift-operators
Name:         testkafka
Namespace:    openshift-operators
Labels:       <none>
Annotations:  <none>
API Version:  kafka.strimzi.io/v1beta1
Kind:         Kafka
Metadata:
  Creation Timestamp:  2020-11-24T05:38:04Z
  Generation:          6
  Managed Fields:
    API Version:  kafka.strimzi.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:entityOperator:
          f:topicOperator:
            f:reconciliationIntervalSeconds:
            f:topicMetadataMaxAttempts:
            f:zookeeperSessionTimeoutSeconds:
          f:userOperator:
            f:reconciliationIntervalSeconds:
            f:zookeeperSessionTimeoutSeconds:
      f:status:
        .:
        f:conditions:
        f:observedGeneration:
    Manager:      okhttp
    Operation:    Update
    Time:         2020-11-29T16:00:18Z
    API Version:  kafka.strimzi.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:cruiseControl:
        f:kafka:
          f:listeners:
          f:replicas:
        f:zookeeper:
          f:replicas:
    Manager:         Mozilla
    Operation:       Update
    Time:            2020-11-30T06:35:41Z
  Resource Version:  21845214
  Self Link:         /apis/kafka.strimzi.io/v1beta1/namespaces/openshift-operators/kafkas/testkafka
  UID:               a825accc-cfcb-4fc2-b4d4-2db9055d76cd
Spec:
  Cruise Control:
  Entity Operator:
    Topic Operator:
    User Operator:
  Kafka:
    Config:
      offsets.topic.replication.factor:          1
      transaction.state.log.min.isr:             1
      transaction.state.log.replication.factor:  1
    Listeners:
      Name:  plain
      Port:  9092
      Tls:   false
      Type:  internal
      Authentication:
        Type:  tls
      Name:    tls
      Port:    9093
      Tls:     true
      Type:    internal
    Replicas:  3
    Storage:
      Type:  jbod
      Volumes:
        Delete Claim:  false
        Id:            0
        Size:          100Gi
        Type:          persistent-claim
  Zookeeper:
    Replicas:  3
    Storage:
      Delete Claim:  false
      Size:          100Gi
      Type:          persistent-claim
Status:
  Conditions:
    Last Transition Time:  2020-11-29T16:00:11+0000
    Message:               Operation: [get]  for kind: [Kafka]  with name: [testkafka]  in namespace: [openshift-operators]  failed.
    Reason:                KubernetesClientException
    Status:                True
    Type:                  NotReady
  Observed Generation:     3
Events:                    <none>
k get po -n openshift-operators |grep testkafka
testkafka-cruise-control-65bfc4dc98-pbt7x             2/2     Running             0          5d
testkafka-entity-operator-6d59f97d46-kv76g            3/3     Running             0          6d4h
testkafka-kafka-0                                     1/1     Running             0          5d
testkafka-kafka-1                                     1/1     Running             0          5d
testkafka-kafka-2                                     1/1     Running             0          5d
testkafka-zookeeper-0                                 1/1     Running             0          6d4h
testkafka-zookeeper-1                                 1/1     Running             0          5d
testkafka-zookeeper-2                                 1/1     Running             0          5d
scholzj commented 3 years ago

You should check the Cluster Operator log for details. I do not think I saw this particular error before. Could be maybe some RBAC issue?

AllenZMC commented 3 years ago

See this, it is a KubernetesClientException, Have you encountered this situation?

Status:
  Conditions:
    Last Transition Time:  2020-11-29T16:00:11+0000
    Message:               Operation: [get]  for kind: [Kafka]  with name: [testkafka]  in namespace: [openshift-operators]  failed.
    Reason:                KubernetesClientException
    Status:                True
    Type:                  NotReady
scholzj commented 3 years ago

I saw that, but as I said, you have to check the Cluster Operator log for more details.

AllenZMC commented 3 years ago

Cluster Operator log :

2020-12-01 06:04:59 WARN  AbstractOperator:377 - Reconciliation #38462(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
2020-12-01 06:06:59 WARN  AbstractOperator:377 - Reconciliation #38465(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
2020-12-01 06:08:59 WARN  AbstractOperator:377 - Reconciliation #38468(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
2020-12-01 06:10:59 WARN  AbstractOperator:377 - Reconciliation #38471(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
2020-12-01 06:12:59 WARN  AbstractOperator:377 - Reconciliation #38474(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
2020-12-01 06:14:59 WARN  AbstractOperator:377 - Reconciliation #38477(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
2020-12-01 06:16:59 WARN  AbstractOperator:377 - Reconciliation #38480(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
2020-12-01 06:18:59 WARN  AbstractOperator:377 - Reconciliation #38483(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
2020-12-01 06:20:59 WARN  AbstractOperator:377 - Reconciliation #38486(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
2020-12-01 06:22:59 WARN  AbstractOperator:377 - Reconciliation #38489(timer) Kafka(openshift-operators/testkafka): Failed to acquire lock lock::openshift-operators::Kafka::testkafka within 10000ms.
scholzj commented 3 years ago

Please share the whole log from the beginning (when the container starts) till the end.

AllenZMC commented 3 years ago

Sorry, the whole log is too long, is there any other way?

Restart the operator pod?

 protected final <T> Future<T> withLock(Reconciliation reconciliation, long lockTimeoutMs, Callable<Future<T>> callable) {
        Promise<T> handler = Promise.promise();
        String namespace = reconciliation.namespace();
        String name = reconciliation.name();
        final String lockName = getLockName(namespace, name);
        log.debug("{}: Try to acquire lock {}", reconciliation, lockName);
        vertx.sharedData().getLockWithTimeout(lockName, lockTimeoutMs, res -> {
            if (res.succeeded()) {
                log.debug("{}: Lock {} acquired", reconciliation, lockName);
                Lock lock = res.result();
                try {
                    callable.call().onComplete(callableRes -> {
                        if (callableRes.succeeded()) {
                            handler.complete(callableRes.result());
                        } else {
                            handler.fail(callableRes.cause());
                        }

                        lock.release();
                        log.debug("{}: Lock {} released", reconciliation, lockName);
                    });
                } catch (Throwable ex) {
                    lock.release();
                    log.debug("{}: Lock {} released", reconciliation, lockName);
                    log.error("{}: Reconciliation failed", reconciliation, ex);
                    handler.fail(ex);
                }
            } else {
                log.warn("{}: Failed to acquire lock {} within {}ms.", reconciliation, lockName, lockTimeoutMs);
                handler.fail(new UnableToAcquireLockException());
            }
        });
        return handler.future();
    }
scholzj commented 3 years ago

You can upload it fir example to pastebin or other services. But at least share as much as possible. 10 lines are not enough, especially when they contain all the same message ;-). (FYI: you can read in the docs what the exact message means: https://strimzi.io/docs/operators/latest/full/using.html#what_do_the_failed_to_acquire_lock_warnings_in_the_log_mean)

scholzj commented 3 years ago

Well, the my-cluster is supposed to be in openshift-marketplace namespace. Not in the openshift-operators namespace.

AllenZMC commented 3 years ago

Thanks! I restart the operator pod to solve this problem.

AllenZMC commented 3 years ago

Is there a plan to fix the problem of lock acquisition failure? @scholzj

scholzj commented 3 years ago

There is no problem with the lock acquisition. It is expected as described in the FAQ. If you think your case is different from what is described in the FAQ and it never gets released without deleting the pod, we can investigate it. But we need the full log (at DEBUG level) for it.

AllenZMC commented 3 years ago

I will provide a complete log next time I encounter this situation.