strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.87k stars 1.3k forks source link

[Bug]: KafkaConnect cluster looking for previously configured configMap setting for configuration setting #8502

Open jslusher opened 1 year ago

jslusher commented 1 year ago

Bug Description

I'm using a configMap to supply some configuration for a Debezium connector. At first I called it inventory-table-includes and then I changed the name to connector-config in case I wanted to supply other configuration settings. This change was made a few days ago, and I thought it was working fine. Today I restarted the connect cluster and now the connector is failed because it's trying to get configuration from the previous configMap.

Steps to reproduce

  1. Create a Kafka connector using a configMap for configuration
  2. rename the configMap and point the connector configuration at it
  3. restart the connector?

Expected behavior

The connector should not be looking for or know anything about the previously configured configMap.

Strimzi version

0.34.0

Kubernetes version

1.24

Installation method

kustomize

Infrastructure

EKS

Configuration files and logs

This is the current state of the connector. You see here that its configured to use connect-config: https://gist.github.com/jslusher/cc3ae98c88e8e86af8eaa462b4e67c9f#file-kafka-connect-yaml-L54

and then it's complaining about the configMap inventory-table-includes here: https://gist.github.com/jslusher/cc3ae98c88e8e86af8eaa462b4e67c9f#file-kafka-connect-yaml-L78

In both cases, the configuration within the configMap is inventory-table-includes.txt

I can fix this by supplying the previous configMap and adding permissions to it in the role, but obviously it shouldn't be looking for it in the first place. Even after adding the previous configMap and giving the connector permissions, it seems to stop logging errors, but the connector itself seems to be stuck in a RESTARTING state.

https://gist.github.com/jslusher/ce52b37ffada286835fe9036db676ffb

Additional context

This started after I upgraded a couple of plugins but I reverted the upgrade and it's still having trouble. I suspect the builder might have something to do with it?

scholzj commented 1 year ago

I'm not sure this is a Strimzi bug. The Config Provider runs inside Kafka and not inside Strimzi operator. So it is up to Kafka how and when does it evaluate it. When I update the connector to use a different Config Map, I can see this in the Connect logs (in this case, both configmaps existed and the RBAC were set-up for both of them):

2023-05-11 00:05:42,540 INFO Retrieving configuration from ConfigMap connector-configuration2 in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [pool-16-thread-3]
2023-05-11 00:05:42,546 INFO Retrieving configuration from ConfigMap connector-configuration2 in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [StartAndStopExecutor-connect-1-3]
2023-05-11 00:05:42,549 INFO Retrieving configuration from ConfigMap connector-configuration2 in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [DistributedHerder-connect-1-1]
2023-05-11 00:05:42,551 INFO Retrieving configuration from ConfigMap connector-configuration in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [DistributedHerder-connect-1-1]
2023-05-11 00:05:42,569 INFO Retrieving configuration from ConfigMap connector-configuration2 in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [StartAndStopExecutor-connect-1-5]
2023-05-11 00:05:42,569 INFO Retrieving configuration from ConfigMap connector-configuration2 in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [StartAndStopExecutor-connect-1-2]
2023-05-11 00:05:42,569 INFO Retrieving configuration from ConfigMap connector-configuration2 in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [StartAndStopExecutor-connect-1-7]
2023-05-11 00:05:42,571 INFO Retrieving configuration from ConfigMap connector-configuration2 in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [StartAndStopExecutor-connect-1-7]
2023-05-11 00:05:42,571 INFO Retrieving configuration from ConfigMap connector-configuration2 in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [StartAndStopExecutor-connect-1-5]
2023-05-11 00:05:42,571 INFO Retrieving configuration from ConfigMap connector-configuration2 in namespace myproject (io.strimzi.kafka.AbstractKubernetesConfigProvider) [StartAndStopExecutor-connect-1-2]

In this case, the old config map was connector-configuration and the new was connector-configuration2. You can see that for some reason it evaluated the old one as well. I'm not sure why Kafka does this - maybe it needs to diff the actual values? But I don't think Strimzi can do anything about this and you would get the same even if you do the same with a REST API directly.


Another situations when it evaluates it is when the tasks are restarted during a rolling update. So what could also happen (and you touch on it in the additional context) is this:

1) You updated the Connect configuration (e.g. change of the container image with new connector plugins) which causes a rolling update 2) At the same time you updated the Connector configuration to use the new ConfigMap, updated the RBACs and deleted the old Config Map

But the way how these things will be applied will be to: a) First do the rolling update of the Connect cluster b) and only then reconfigure the connectors

That means that it will try to restart the connector first as part of the rolling update with the old configuration which fails because you already took it away the RBAC rights and deleted the old Config Map. And you will get a similar error.


I do not think this has an easy solution on the Strimzi level. I think the only way to deal with this is work around it in this way:


If you think there is a better solution, please share it. But I don't think there is much more we can do.

jslusher commented 1 year ago

I would have sworn that the connector and connect cluster pods got restarted when I had applied the configMap change in the first place, but I can't say for sure, so I'll operate on your assumption that they need to be restarted after the config changes are set. That should mean that once everything is set, it should start using the new configMap and ignore the previous one. If it's a race condition, that's easy enough to consider and prevent.

I can't seem to get it to stop looking for the previous configMap in my situation. I've restarted the connector and the connect pods several times, but it continues to look for both the new configMap and the old one. I even deleted and recreated the connector and it's still looking for that previous configMap. In this scenario, I have to maintain the old configMap and permissions to it so that the connector will start properly, even if it's using values from the new one.

ksqldb-test-connect-7476fd6577-lt7rs ksqldb-test-connect 2023-05-11 16:33:39,474 INFO Retrieving configuration from ConfigMap connector-config in namespace ksqldb-test (io.strimzi.kafka.AbstractKubernetesConfigProvider) [DistributedHerder-connect-1-1]
ksqldb-test-connect-7476fd6577-lt7rs ksqldb-test-connect 2023-05-11 16:33:39,488 INFO Retrieving configuration from ConfigMap inventory-table-includes in namespace ksqldb-test (io.strimzi.kafka.AbstractKubernetesConfigProvider) [DistributedHerder-connect-1-1]
scholzj commented 1 year ago

TBH, I do not really know at what point it is failing in your cluster, in what state it is, or what exactly happened before, so it is hard for me to comment. The Connector custom resources are the source of truth for the operator but not for Kafka. So the question is what is the actual configuration and state in Kafka - was it really modified? Was it really deleted and recreated? If you broker it by the missing RBAC / ConfigMap first, it might be in a broken state and the operator might not be able to continue until it is fixed.

jslusher commented 1 year ago

If I recall correctly, I verified using the REST API that the connector was indeed deleted when I deleted and recreated it.

Using the REST API to query the connector it shows the proper configuration:

kx ksqldb-test-connect-7476fd6577-7l9wq -- curl http://localhost:8083/connectors/debezium.inventory | jq | grep configmap
    "table.include.list": "${configmaps:ksqldb-test/connector-config:discogs-table-includes.txt}",

The operator doesn't seem to be complaining about it, unless these repeated reconciliations are an indication something is amiss:

2023-05-11 21:39:23 INFO  AbstractOperator:239 - Reconciliation #14245(timer) KafkaConnect(ksqldb-test/ksqldb-test): KafkaConnect ksqldb-test will be checked for creation or modification
2023-05-11 21:39:23 INFO  AbstractConnectOperator:502 - Reconciliation #14245(timer) KafkaConnect(ksqldb-test/ksqldb-test): creating/updating connector: debezium.inventory
2023-05-11 21:39:23 INFO  AbstractOperator:510 - Reconciliation #14245(timer) KafkaConnect(ksqldb-test/ksqldb-test): reconciled
2023-05-11 21:41:23 INFO  AbstractOperator:239 - Reconciliation #14248(timer) KafkaConnect(ksqldb-test/ksqldb-test): KafkaConnect ksqldb-test will be checked for creation or modification
2023-05-11 21:41:23 INFO  AbstractConnectOperator:502 - Reconciliation #14248(timer) KafkaConnect(ksqldb-test/ksqldb-test): creating/updating connector: debezium.inventory
2023-05-11 21:41:23 INFO  AbstractOperator:510 - Reconciliation #14248(timer) KafkaConnect(ksqldb-test/ksqldb-test): reconciled
scholzj commented 1 year ago

Then I'm not sure why would the old ConfigMap still be queried by Kafka I'm afraid :-/.

jslusher commented 1 year ago

So it seems that Strimzi is doing its thing, but something about the configMap configuration plugin seems to be amiss. Is there a more appropriate place to report this bug?

scholzj commented 1 year ago

But I think we need to first understand what is the bug. Kafka Connect is calling the plugin with the given options. The config provider does not do anything on its own without being called by Connect.

jslusher commented 1 year ago

Other than using the connect REST API to get the settings for the config provider, is there any other place I should look that might still have the old settings referenced? If the KafkaConnector Strimzi resource and the connector itself seem to have the proper value, could it be in a topic somewhere? I looked in the connect.inventory.configs Debezium topic and there are events in there referencing configuration, but they all show the new configMap setting there. I'm not sure where else to look.

scholzj commented 1 year ago

The REST API is the interface -> that is what you use to access it (and what the operator uses as well). The actual configs will be stored in the Connect topics. Those are the topics configured in the Connect custom resource:

    offset.storage.topic: connect-cluster-offsets
    config.storage.topic: connect-cluster-configs
    status.storage.topic: connect-cluster-status

But they should be used by Connect itself only and nothing else.

If you run multiple Connect clusters, you should read this: https://strimzi.io/docs/operators/latest/full/deploying.html#con-kafka-connect-multiple-instances-str => that causes all kind of weird issues if not configured correctly. But if that was the issue, I would normally expect it to fail way earlier then what fails for you.

jslusher commented 1 year ago

We've run multiple connect clusters in the past, and plan on adding more connect clusters to this Kafka cluster, but there's just the one at the moment. We've renamed our storage topics in the format I listed above, name-spacing them, so connect-cluster-configs up there is connect.inventory.configs for this connect cluster. I suspect it has something to do with one of these connect cluster topics, but like I was saying in my last reply, I looked in connect.inventory.configs and the connector config I found in there all refers to the new config provider configuration and there is no mention of the previous configuration. As a sanity check, I just tried restarting the connector again and it's still looking for the previous configuration. Very strange.

tombentley commented 1 year ago

Triaged on community call 18/5/2023: It would be good to see if we can reproduce this and figure out exactly where the problem lies.