strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.76k stars 1.27k forks source link

[Enhancement]: Configurable supported kafka versions #10494

Open MichaelMorrisEst opened 2 weeks ago

MichaelMorrisEst commented 2 weeks ago

Related problem

The kafka verions supported by a particular version of strimzi is controlled by the kafka-versions.yaml. During upgrade of Strimzi and/or Kafka a check is performed that the version of kafka to be used is supported by the Strmzi version.

In production environments there can be limitations on performing kafka upgrades, which in consequence limits to ability to upgrade Strimzi (as newer versions of Strimzi may no longer support the kafka version in use). There are good reasons as to why Strimzi limits the versions of kafka it supports (as detailed here: https://stackoverflow.com/questions/73898820/strimzi-kafka-operator-have-supported-kafka-versions/73906333#73906333), however it would be very useful for users to be able override the default supported versions where they are prepared to accept themselves the burden of ensuring their desired Strimzi/kafka version combination does not have any compatibility issues.

Suggested solution

A possible way to implement this could be to read the supported kafka versions from a config map.

I would like to contribute a PR for this enhancement if the community is open to idea.

Alternatives

No response

Additional context

No response

scholzj commented 2 weeks ago

The support for a particular Kafka version includes many different things:

So it is not some simple validation that you disable and everything works fine. The validation just gives you a nice error message to front all the missing things needed to support a Kafka version.

So you have only three options:

I can understand the demand for supporting older Kafka versions. But the reality is that right now, the Consumer and Producer APIs in Kafka are very stable, and getting the regular applications based on them to work with new Kafka versions is much easier (usually they work out of the box). But the Admin API, KRaft features, etc. is not. So it is much easier to upgrade your Kafka version than to support too many Kafka versions in something like Strimzi.

Maybe that changes after Kafka 4.0 when KRaft is finished, who knows. If it does, I'm sure we can reevaluate the current policy.

MichaelMorrisEst commented 2 weeks ago

We fully understand it is not just a matter of removing the check and there is effort required to establish if issues will arise for particular combinations of versions.

Allowing the versions be set by configuration will not mean every combination can be used (as there will be compatibility issues) but it will allow users to add additional extra versions on a case by case basis where they have done due diligence to ensure the combination will work for their needs. Where, for example, strimzi code has changed in a way that will no longer work with a particular older version of kafka then this proposed solution will not be of any benefit (which is fine), but would be very beneficial where this is not the case and does not lessen the default level of protection from incompatibilities as there is no proposal to add extra versions to the default supported list.

There can also be scenarios where a version combination will not work for certain configurations or features and is therefore not suitable to support generally, but can work perfectly fine for some configurations (for example where Zookeeper is used instead of KRaft). Allowing the versions to be set by configuration gives the ability to use combinations that are ok for your usage but cannot be supported generally.

On the three options you mention: Option 1. We are already doing this to overcome this issue but would like to move away from this approach to an approach that is supported in the source code Option 2. I agree, but I think its also worth noting that not all features and configurations are necessarily of interest to all. Where only a subset is used this can increase the number of versions that are possible to support without code changes (for example if Zookeeper is used instead of KRaft) Option 3. Yes I agree, but there is a comprise in between the current policy and "just support more versions" which is to support specific versions on a case by case basis where that combination is of interest. This could be done by submitting a PR to add a version to kafka-versions.yaml, but (even if the community was open to relaxing the current policy on a case by case basis) this imposes the burden/risk on the maintainers/community. Allowing it to be done through configuration places that burden/risk on those who want to use that combination.

I think allowing this proposed change provides flexibility to those who want it without imposing any extra burden/risk on the community or complexity in the code. There will undoubtedly be times when this is less useful than others (for example when APIs are in flux), but I think it would be great to at least have the option available so it can be used when suitable. Documentation can stress the importance of making sure you know what you are doing if you attempt to use a version not supported by the community

While I appreciate your point it can be much easier to upgrade the kafka version than to support more versions, there can, in certain settings, be obligations which restrict the ability to do so

scholzj commented 2 weeks ago

I'm sorry, but this is not how it works. As I tried to explain, the supported versions have several prerequisites such as

So as I said, it is not that you skip the validation and "hope it works". We know it does not work when you skip the validation because it will be missing all the things that are part of the build process.

Or to put it differently, there is already the configuration you are referring to:

And then you hope that it will work because there were no changes in the code, in the APIs etc.


I think allowing this proposed change provides flexibility to those who want it without imposing any extra burden/risk on the community or complexity in the code. There will undoubtedly be times when this is less useful than others (for example when APIs are in flux), but I think it would be great to at least have the option available so it can be used when suitable. Documentation can stress the importance of making sure you know what you are doing if you attempt to use a version not supported by the community

But the problem is that the proposed change does not work. The kafka-versions.yaml is not something that is there to generate some validation rules of what Kafka versions are supported at build time. Its main purpose is to drive the build to generate all the required pieces in order to enable the support of a particular Kafka version. So you cannot just say that you want to move it to a Config Map to configure it at runtime.

MichaelMorrisEst commented 1 week ago

During the build kafka-versions.yaml dictates which versions of the kafka image are built and also the versions of kafka for which a config model is generated. This need not change under what I am proposing, with the kafka images and config models being generated for the supported versions only.

At runtime, I see 4 issues that block an unsupported kafka version being used:

  1. Incompatibilites between the Strimzi version and the Kafka version What I am proposing is irrelevant in this case, and it will remain not possible to use such a combination.

  2. The check in KafkaVersion.java which checks the kafka version is known and supported based on kafka-versions.yaml in the cluster operator jar This can be overcome by using a config map to override the kafka-verions.yaml in the jar (in similar way to is already done for the log4j properties in /opt/kafka/custom/config)

  3. The kafka images need to be available While the Strimzi build for any version only builds the kafka versions it supports, the older kafka versions are still available from previous version builds. The main use case for this proposal is to upgrade Strimzi but leaving Kafka on a version not supported by the new Strimzi version, in which case the image is even already deployed.

  4. Validation of the kafka config Config models generated during the build are used to validate the values specified by the user for the kafka config params and are also used to determine when dynamic config updates are possible. The config model will be missing for any unsupported versions. I can see a few options here:

    • Allow additional config files be provided for the additional versions to be supported. This can also be supported using config map using config model generated in previous version builds. This will result in the same level of functionality as today.
    • Dont validate or perform dynamic reconfig where the config model is not found for the version in use, with a warning issued
    • Use the closest version of the config model where an exact match cannot be found with a warning issued

As a test of the concept I did the following:

  1. Deploy Strimzi 0.42.0
  2. Deploy Kafka 3.6.2
  3. Upgrade Strimzi to 0.43.0, but with the following changes (to allow the unsupported kafka 3.6.2 to continue in use):
    • Add kafka-versions.yaml (same as 0.43.0 version except with 3.6.2 also marked as supported) and kafka-3.6.2-config-model.json to the existing strimzi-cluster-operator configmap (050-ConfigMap-strimzi-cluster-operator.yaml)
    • In KafkaVersion.java read the kafka-versions.yaml from /opt/strimzi/custom_config. If not present, default to reading from classpath, as today
    • In KafkaConfiguration.yaml read the config model from /opt/strimzi/custom_config. If not present, default to read from classpath, as today
    • Add the 3.6.2 kafka jar to STRIMZI_KAFKA_IMAGES etc in the strimzi deployment (060-Deployment-strimzi-cluster-operator.yaml)

No errors are encountered. This is clearly far from sufficent to proof strimzi 0.43.0 can work successfully with kafka 3.6.2, but I think does show it is possible to enable supporting additional kafka versions through configuration, where no incompatibilites exist in the version combination. Extensive testing and careful reading of release notes/change logs etc. would be necessary by anyone who wants to go this road.

scholzj commented 1 week ago

Discussed on community call on 5.9.2024: @MichaelMorrisEst will prepare a proposal to detail the changes and discuss them.

ppatierno commented 1 week ago

I am supportive 100% to what Jakub said. While I can understand that companies can have restrictions on upgrading Kafka, at the same time the same companies should understand the benefits of upgrading Kafka and doing so. It's about fixed CVEs, bug fixes, new features (even if you could not be interested in it). Allowing users to use a very old Kafka with a pretty new shiny Strimzi operator can drive to a lot of issues which I guess were already discussed here. We would like to maintain the good stability that Strimzi has been shown for years now, without falling into a situation where a user can blame us that they are using Kafka 3.4.0 with Strimzi 0.43.0 and it doesn't work. From my pov, if you are supporting "something", you need to support it well and not just saying "they are prepared to accept themselves the burden of ensuring their desired Strimzi/kafka version combination does not have any compatibility issues" ... because I am pretty sure it's not going to happen in reality. The only viable option I see is the current one about rebuilding the project by yourself and tweaking the kafka versions yaml. In that case, you are really accepting that things could not work, because you are doing something not officially supported by the project. Supporting something like this (officially) would mean too much code complexity to have backward compatibility as well as increasing the tests matrix (if you support it, as I said, you need to support it well ... so testing every possible combination). So from my perspective, we should not go this way.

im-konge commented 1 week ago

I was thinking about this and I'm with Jakub and Paolo in this. As was already mentioned in the comments from Jakub & Paolo, this can be useful for many users. But I'm afraid of the maintainability and testability. Because with this, you are saying that users can pass for example Kafka 2.8.0 in the "supported matrix", which doesn't have features that are available in Strimzi 0.43 (and later). Also, how it will work with the movement to KRaft and ZK removal? What if the support for ZK will be removed from Strimzi and you will add the Kafka version that still supports ZK?

I guess that for few Strimzi versions, this can work (in some way). Maybe it would work for later versions of Strimzi when there will be just KRaft. But if you don't limit the versions, it can be really problematic to make Strimzi work - with certain combination of Kafka versions and Strimzi code.

And how it will be tested? Because of the resources that we have in our pipelines, we cannot test with various versions of Kafka in particular Strimzi version. It would mean really huge testing matrix, which would block us from implementing new features and testing them. As mentioned, if we support something, we need to test it. And in terms of the maintainability, I think that it would be really complicated and not worth it. Because, as I mentioned, it would mean that because of such support, we would not be able to implement something new.

Additionally, when such thing will be implemented, I can see how many users will have some issues when they'll add their own Kafka versions to the supported matrix. And I don't count the number of messages from users that will need help with issues they encountered during adding support of such Kafka versions.

So FMPOV, I think we should not go this way.

scholzj commented 1 week ago

@MichaelMorrisEst Just to respond to your last comment in case you decide to work on the proposal ...

  1. The kafka images need to be available While the Strimzi build for any version only builds the kafka versions it supports, the older kafka versions are still available from previous version builds. The main use case for this proposal is to upgrade Strimzi but leaving Kafka on a version not supported by the new Strimzi version, in which case the image is even already deployed.

While you added an example of where this works, it is also easy to find an example where this does not work. The Kafka images are tightly coupled with the operator version and there is currently no intention to keep them somehow compatible across releases. While it might sometimes work, it might not work otherwise.

Do you as a user really want to rely on this? It is a trap into which one can easily fall. One day you will need to upgrade the operator because of a bug or a CVE but it will suddenly not work anymore because the images changed. So you will either need to go through lengthy process to upgrade Kafka version ASAP. Or you will need to rebuild from the source to fix it. I have strong doubts about the sustainability of this, as you are either at a huge risk, or you anyway need to have the skills to rebuild Strimzi from source to be safe.

  1. Validation of the kafka config Config models generated during the build are used to validate the values specified by the user for the kafka config params and are also used to determine when dynamic config updates are possible. The config model will be missing for any unsupported versions. I can see a few options here:
    • Allow additional config files be provided for the additional versions to be supported. This can also be supported using config map using config model generated in previous version builds. This will result in the same level of functionality as today.
    • Dont validate or perform dynamic reconfig where the config model is not found for the version in use, with a warning issued
    • Use the closest version of the config model where an exact match cannot be found with a warning issued

The proposal surely would need to have more details how this would be achieved.


Aside from that:


Out of curiosity ... have you tried to simply configure a different Kafka image? E.g. run Strimzi 0.43.0, let the operator think that it is running Kafka 3.7.0, but configure it to use container image quay.io/strimzi/kafka:0.42.0-kafka-3.6.2 with 3.6.2? I think this is pretty much the same what you are suggesting (and will work or not work equally randomly) without the need to change anything.