Closed tombentley closed 6 years ago
One requirement is that we can abstract across evolution of k8s APIs. For example apps/v1beta2 StatefulSet on Kubernetes 1.8, apps/v1beta1 StatefulSet on 1.7 and earlier and PetSet.
About K8S API evolution and maybe even related to OpenShift API evolution, it makes more sense if the "on premise" installation scenario is just "raw" Kafka deployment or even Kafka on OpenShift (on premise). In the latter case we should support more K8S/OpenShift versions.
The controller can follow the operator concept. It will watch for two kinds of ConfigMap / CustomResourceDefinition resources. One for Zookeeper deployments and one for Kafka deployments. The config maps will identified by labels:
app=barnabas
type=deployment
kind=zookeeper
or
app=barnabas
type=deployment
kind=kafka
Inside the ConfigMap a detailed configuration of the requested deployment will be hidden. For example:
name=kafka
labels= |
label1=value1
label2=value2
nodes=5
zookeeper= |
labelX=valueY
labelU=valueV
config: |
default.replication.factor=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
When the ConfigMap is created, the controller will create the cluster. The separation of the Zookeeper and Kafka clusters into two ConfigMaps should make it possible to use multitenant Zookeeper clusters in the future. Not deploying the clusters directly has also the advantage that we can generate the resources dynamicaly. That will allow us to support much wider range of Openshift / Kubernetes versions. It will also give us a lot more flexibility in terms of deployment variablity compared to the openShift template / Kubernetes YAML files. The dynamically generated deplyoments can be easily extended to include nodeAffinity (e.g. running broker nodes on dedicated hosts with local hosts), rack IDs or exposing Kafka to the outside by atomatically generating loadbalancers / routes.
Changing the configuration in the ConfigMap should also reconfigure the cluster. That might be non trivial, since many changes require broker restart. So the controller will need to do a rolling update of the cluster. Zookeeper would need a rolling update even in case of scale dup/down (this seems to be non-critical feature, so we might ignore it). Similarly, we might need to rearange the topic distribution in Kafka when the cluster is scaled down or up.
When the ConfigMap is deleted, the operator will delete the cluster. This should be fairly simple as operation. But if we decide to split the Zookeeper and Kafka deployments because of the multitenancy, we should think about some protection for not deleting the Zookeeper cluster used by a running Kafka cluster (this might be tricky - admission controllers can probably do this, but I'm not sure about OpenShift support).
The operator concept can be also used for managing topics. ConfigMap / CustomResourceDefinition resource might be created in OpenShift / Kubernetes for each topic. It needs to be thought through how to synchronize the topics between Kafka and the operator:
The topics can be identified with labels such as:
app=barnabas
type=runtime
kind=topic
Inside the ConfigMap the topic details can be configured:
name=mytopic
kafka= |
label1=value1
label2=value2
partitions=5
replication.factor=3
config: |
cleanup.policy=compact
retention.ms=3600000
Many tools such as Kafka Streams or Kafka Connect tend to create their own topics. do we want to force the users to first create the ConfigMap before starting Kafka Connect? Here the automated reconciliation between the operator / ConfigMaps and Kafka topics seem to be the best solution.
There might be also some niche situations which cannot be implemented easily (such as "renaming ConfigMap will delete topic and create a new one" etc.)
The operator model should work fine when your Kafka ops are also your OpenShift / Kubernetes ops. This is because to manage the clusters and topics all you need is the appropriatte access to Openshift / Kubernetes. It has also big advantages since you can easily integrate deployment of the Kafka cluster / Kafka topics with the deployment of your application into OpenShift / Kubernetes.
But this might not suite everyone. Some people might prefer to have more traditional management console with separate authentication and authorization model then the OpenShift / Kubernetes cluster. it should be possible to work around this by creating a standalone management console with separate AuthN/AuthZ which will be integrated with the operator. Either in the form:
or the other way around
The first options seems to be easier for the reconciliation of configuration between the ConfigMaps and OpenShift / Kubernetes. The second option might be easier to integrate Kafka management with monitoring etc.
The cluster might be deployed in a similar way as today. OpenShift template / Kubernetes YAML files will be prepared and will load the controller(s) as well as ConfigMap(s) for deploying Zookeeper / Kafka. So apart from more processes, running in the cluster this should not be a major change for current users. Another template with only controller might be prepared for more advanced users who want to configure the ConfigMap(s) on their own.
(This seems to be risky as we might not delete some topics in case the cotnroller is not running when the topic is deleted)
I don't understand this sentence.
Many tools such as Kafka Streams or Kafka Connect tend to create their own topics. do we want to force the users to first create the ConfigMap before starting Kafka Connect?
We really shouldn't require users to create such topics. There can be many such topics and it's pointless work. A "single button import" facility to create them would be better, but I still prefer going the whole hog and just automatically creating ConfigMaps for ropics created outside of the controller.
@tombentley Basically, imagine following scenario:
If the controller does reconciliation, it will at least recreate the ConfigMap again and make this more visible to the user.
So in general - also with the Kafka Stream / Connect topics in mind - I agree with your point that to make this work, the full reconciliation and creation of configMaps by the controller sseems like the best way forward.
Many tools such as Kafka Streams or Kafka Connect tend to create their own topics. do we want to force the users to first create the ConfigMap before starting Kafka Connect?
There are some topics that the user doesn't know about them. For example the "state store" topics for stateful transformation (to re-build a RockDB database) and even "repartitioning" topic (i.e. when group by key is used) in Kafka Stream.
The same for the ___consumer_offsets topic as well.
I think it is best to leave the __
topics out of the management. They have specific names, so that should be easy.
@scholzj thanks, I understand the scenario now. This makes me think about the controller being HA.
I think you would need some active - passive locking mechanism (using Zookeeper to deploy Zookeeper??? :-o) so that you don't have two or more instances trying to do the same.
But if we implement it properly and really also create the ConfigMaps for the topics created directly in Kafka we should not need it.
Edit: trying to make my ramblings a bit clearer.
@tombentley @scholzj Some comments on the requirements. The notion of a topic controller is very similar to the concept of the address controller in enmasse. Basically, you can think of enmasse configuration involving 3 logical components (at least this is the goal):
API server - Can be used to create/delete address spaces and addresses within an address space, either through service broker or config map/rest api (want to use CRD here)
address space controller - instantiates infrastructure to handle an address space. Of type 'standard' (dispatch router ++) or 'brokered' (just a single Artemis broker). This is 'global' in a multitenant environment.
address controller - specific to each address space type (i.e. one for 'standard', one for 'brokered' etc), and lives within each address space as part of the infrastructure. It configures the rest of the infrastructure to deal with addresses the user has configured.
When reading the description of the topic-controller, it performs the same functions as an address controller. Basically, the address config in enmasse would map to the topic configmap. If this topic-controller understands the address config format, it can be plugged into enmasse as I describe in the kafka address space proposal[0]. I'm also interested in your thoughts on exposing management and console as an independent thing of barnabas and/or as part of EnMasse. Deploying enmasse with a 'kafka' address space would be very similar to what you are trying to achieve, at the cost of using a more generalized address model. The benefit would be that the user can relate to the same address model irrespective of which messaging technology used, and it has the potential of reusing lots of the console. You'd also get a somewhat multitenant model of barnabas.
Regarding having 1 or more controllers... Having HA of the controller it self makes the system way more complex as you need to establish consensus among them for deciding what to do. Besides, I would consider a controller crash a transient error as it should be restarted without a big delay in the operations it will perform (as I assume you don't have any strict timing requirements for it).
Thanks @lulf
If this topic-controller understands the address config format, it can be plugged into enmasse as I describe in the below document.
Is there any documentation/description of that format? Is the enmasse address controller using a CRD, or plain configmaps?
For Kafka we have the problem that clients (such as Kafka Streams) can create topics outside of the control of any operator, and it seems likely that we're going to need to reflect these as k8s resources (ConfigMap or CRD), so there's a need for bidirectional reconciliation. Is this a problem the address controller has had to solve?
How does the enmasse address controller handle errors, for example an invalid k8s resource which it cannot process? I was thinking perhaps we should create k8s events in such cases.
@tombentley Just updated my previous comment :rofl: with more details.
See http://enmasse.io/documentation/master/tenant/#creating_addresses for the format of addresses. Though somewhat simplified in this example, it shows the different types we support for the 'standard' address space. The proposal for kafka would be to have a 'kafka' type, along with a 'properties' entry in the spec for describing semantics.
The problem of clients creating topics outside of control is one we also have in the standard address space, where the mqtt-gateway creates 'custom' addresses that are not configured explicitly. This leads to some issues currently where we cannot instruct the router to reject unknown addresses. I think we'd could want to do something similar there, though it needs some more thinking.
Creating k8s events is probably the idiomatic approach and the one we should pursue as well. Currently we log an error decoding and move on.
@lulf With Kafka, ideally we want to provide multiple tiers.
I think that if we do this right it should be possible to achieve both. I will have to look more closely how the controllers in EnMasse work.
The addresses look a lot like CRDs, but if I understood it correctly the restapi
is not the OpenShift API. It is your controller API. How do you store / represent these on the OpenShift / Kubernetes level?
@scholzj I think those tiers make sense, and agreed that running just Kafka on OpenShift is an important use case. Do you see the topic-controller as something that would be used in a 'standalone' kafka deployment as well? I'd like to understand more the scope of that tier and what 'just Kafka' means :)
The restapi is not the OpenShift API, but I would hope for it to evolve into a proper kubernetes API server using something like https://kubernetes.io/docs/concepts/api-extension/apiserver-aggregation/ . The downside of using pure CRDs is that there is no schema validation if I've understood it correctly.
Right now the address 'CRD' is stored as a JSON object inside a configmap (which makes editing the configmaps a bit hard).
@lulf I haven't looked into the details, but CRD validation should be a new feature in Kubernetes 1.8 via OpenAPI schema ... so over the time it will get also into OpenShift etc.
We discussed more or less three tiers: 1) Just Kafka on Openshift: This should include only the Kafka deployment it self (that will probably still need the cluster controller to deploy and configure the cluster properly). But not the topic operator or management console. The user will manage Kafka using the default Kafka tooling. 2) Kafka on OpenShift with integration: This is something what should include the above + the topic controller and management console to provide better and more native OpenShift integration. So it is not just Kafka anymore. 3) Kafka in EnMasse
The question is whether we can manage to make the 2. and 3. the same.
Right now the address 'CRD' is stored as a JSON object inside a configmap (which makes editing the configmaps a bit hard).
Does the EnMasse controller expect the user to create all the addresses only through the REST API? Or does it also allow the users to create directly the ConfigMaps? I think it is quite nice feature when you can deploy the topic / queue as part of your application by simply creating the ConfigMap (this idea doesn't come from my head, but I really like it :-)).
@scholzj Thank you for clarifying!
@lulf I haven't looked into the details, but CRD validation should be a new feature in Kubernetes 1.8 via OpenAPI schema ... so over the time it will get also into OpenShift etc.
This has been added since I looked last time, that is great news! This means that we could rely on CRD instead of config maps eventually. The downside AFAIK is that you need cluster-admin role to be able to create the CRD definition.
The question is whether we can manage to make the 2. and 3. the same.
I think this needs a wider discussion, but personally I think it would be good if we discussed the pros and cons of making them the same. Maybe the concept of address spaces and addresses are a bit abstract for some use cases? Another question is what will the next step for 2. be? Multitenancy, integration with authentication services, console? Then 2. starting too look like the same thing as 3 but using the more specific 'topic' instead of 'address' as the model.
I don't mean to say that barnabus shouldn't be multitenant or not provide authentication services, but that it would be nice that whatever is part of 2. can be reused and fits into 3. As an example, if two enmasse address spaces A and B has type 'kafka' they might both run on the same barnabas cluster, but the console would show them as different address spaces potentially belonging to different tenants.
Yet another question is how much of OpenShift concepts do you want to expose in the service. For instance, in enmasse we try to hide the concepts of namespaces even though they are used for isolation of address space. An address space may conceptually run in the same namespace as another or in an isolated one. This makes it harder to integrate with CRDs directly because we can't allow the user to create CRDs directly inside the namespace.
Does the EnMasse controller expect the user to create all the addresses only through the REST API? Or does it also allow the users to create directly the ConfigMaps? I think it is quite nice feature when you can deploy the topic / queue as part of your application by simply creating the ConfigMap (this idea doesn't come from my head, but I really like it :-)).
Yes, the REST API is currently the way we use in our documentation. It is, however, currently possible to create addresses using configmap. The format of the configmap would look something like below. The challenge though is that we don't want the user to care about which namespace his address space is running, so we need to do some thinking on how to do this in multitenant mode or if we need to roll our own api server since we might want to authenticate against keycloak instead for that.
apiVersion: v1
kind: ConfigMap
metadata:
name: address-config-myqueue
labels:
type: address-config
data:
config.json: |
{
"apiVersion": "enmasse.io/v1",
"kind": "Address",
"metadata": {
"name": "myqueue",
"addressSpace": "myspace"
},
"spec": {
"type": "queue"
}
}
I've spent some time experimenting with an ConfigMap <-> topic operator, using fabric8, the Kafka AdminClient and ZooKeeper. It:
/brokers/topics
znode to know about topics created outside of the operator. In this case it creates/deletes ConfigMaps to keep things in syncWhile it seems that CM creations while the operator is down result in the watch receiving ADDED events when the operator is brought back up, the same doesn't seem to be true of CM deletions. This means that on start up of the operator we need to reconcile all existing configmaps and topics. When a topic exists and a CM doesn't, it's ambiguous whether the CM was deleted while the operator was offline (so topic should be deleted) or whether the topic was created while the operator was offline (so a CM should be created). We could extract the ctime
from the /brokers/topics/mystery_topic
znode, to determine if the topic was created while we were offline, but that's worthless without knowing the time the operator went down. Doing that naïvely, we end up with a stateful operator, which further complicates things.
CM names have to be a valid DNS-1123 subdomain. Topic names can consist of characters from the character class [a-zA-Z0-9._-]
. This means some topic names – such as those including _
– are not legal CM names. So we can't simply use the CM name as the topic name.
As sketched above, we could configure the name of a topic via the name=mytopic
entry in the CM data. This opens up further problems, however:
name
key in their data.name
?Because we're watching the znode, we can know about new topics before the Controller has finished creating them. But we need to know the topic config and description in order to create the CM. So we have to retry fetching this information if we fetch it too quickly. (This would be avoided if, instead of the znode watch, we polled AdminClient.listTopics()
, but that would be higher latency, and wasteful of CPU). But some 'interesting' corner cases exist:
@tombentley
Uses a watch on k8s ConfigMaps with appropriate labels, creating/deleting topics when CMs are created/deleted
Be careful relying on k8s watches only. They are not reliable in the same way as ZK watches are, and should be combined with relistings even though you can watch from a given resource version. The address controller do a periodic poll of the CMs in addition to watching . See https://github.com/EnMasseProject/enmasse/blob/master/k8s-api/src/main/java/io/enmasse/k8s/api/ResourceController.java
and
@tombentley
CM names have to be a valid DNS-1123 subdomain. Topic names can consist of characters from the character class [a-zA-Z0-9.-]. This means some topic names – such as those including – are not legal CM names. So we can't simply use the CM name as the topic name.
FWIW, the enmasse config map names are independent of the 'address' name (as you can see in my previous comment) for this exact reason. The REST API sanitizes the address to make sure it is a legal kube id: https://github.com/EnMasseProject/enmasse/blob/master/k8s-api/src/main/java/io/enmasse/k8s/api/KubeUtil.java#L22
When a topic exists and a CM doesn't, it's ambiguous whether the CM was deleted while the operator was offline (so topic should be deleted) or whether the topic was created while the operator was offline (so a CM should be created).
This can utlimately always happen. Even with HA controllers you can just minimize the probability of this occuring but you never eliminate it completely. I think we should prefer keeping the topics instead of deleting them by mistake (i.e. we just recreate the config map). It should be IMHO quite rare.
What would be more interesting is what to do when the configuration between ZK and the ConfigMap diverges. If the controller was down, there is no way how to tell whhich of these changed.
CM names have to be a valid DNS-1123 subdomain. Topic names can consist of characters from the character class [a-zA-Z0-9.-]. This means some topic names – such as those including – are not legal CM names. So we can't simply use the CM name as the topic name.
This is a bit unfortunate. It would be quite hard to encode the topic name into config map name. Even doing something such as using only the first config map with the same name might be complicated when the controller initializes and it could again create some race conditions.
Oh and another thing...
I've only really experimented with creation/deletion, not looking at modification yet. There is no AdminClient API for partition reassignment yet (see KIP-179), so if partition assignment is configurable via a topic operator then:
But we should think about that if. One desirable feature for barnabas is automatic balancing and scaling. Partition reassignment is really a cluster-level concern; it doesn't change the semantics of the topic or its partitions. So it might make more sense to not expose the partition assignment via the topic operator, at least initially.
I think we should prefer keeping the topics instead of deleting them by mistake (i.e. we just recreate the config map).
Definitely.
I was just wondering if it makes sense having the ConfigMap mechanism for Barnabas.
I mean, afaik (but @lulf can be more precise than me), EnMasse uses the CM files as storage for addresses because there is no other storage to put their information. So if you use the REST API, a corresponding CM file is created; otherwise you can create the CM directly.
With Kafka we have Zookeeper for that under the broker/topics
znode which contains all metadata about a topic.
Maybe does handling both mechanisms provide more problems (reconciliation) than advantages ?
The controller could be the only "entry" point for handling topics as a "facade" for avoiding exposing the cluster outside (something that happens in the "just Kafka" on OpenShift scenario). Of course it could expose CRUD API (i.e. HTTP REST and maybe an AMQP management one as well ? :-)).
Of course in this way, we lose something like more "integration" with OpenShift/Kubernetes not relying on resources like CMs.
Maybe I have forgotten the rationale to use CMs.
@lulf having properties for the address space (other than for addresses) could be a way for bringing Kafka broker configuration during the cluster deployment. It can be used for configuring broker in the "brokered" space as primary purpose but the application to the Kafka address space makes more sense.
@lulf having properties for the address space (other than for addresses) could be a way for bringing Kafka broker configuration during the cluster deployment. It can be used for configuring broker in the "brokered" space as primary purpose but the application to the Kafka address space makes more sense
@ppatierno After thinking more about this, I think the address space plan is the appropriate place to configure this. A plan is basically a template with configuration parameters. The broker configuration would therefore be controlled by the service admin (who configures the available plans), and not the tenant, which I think is how it should be. If you want full control, you can be your own service admin running your own OpenShift cluster, but if you're using someone elses service, they control the broker config.
wdyt?
@ppatierno so you're suggesting something like an aggregated apiserver where we provide an API façade of resources backed by the Kafka/ZooKeeper topic metadata?
@tombentley in some way ... yes. As I said I guess that the main reason for using CM files in EnMasse was to chose a way for storing address information in a centralized source (so using CM files). In our case we already have such source (Zookeeper) why we should have duplicates ?
Kubernetes uses etcd for backing such resources right ? Our API facade will use Zookeeper :-)
The broker configuration would therefore be controlled by the service admin (who configures the available plans), and not the tenant, which I think is how it should be
If the service admin is from the customers team then yes. What is the difference with the tenant at this point ?
If the service admin is from the customers team then yes. What is the difference with the tenant at this point
If the service admin is OpenShift Online, then the available plans will be selected by the team that runs that service. If someone starts a 'freemessaging.com' service, they decide what plans etc. Any tenant using those services will only be able to select the plans availabe. If a customer runs their own openshift cluster/enmasse instance, then they can decide what plans they have.
Now, when I say this is possible I'm lying. Currently the plans are hardcoded, but we are thinking that they should be configurable when deploying enmasse.
Well I'm thinking that even using OpenShift Online, the customer should be able to change the configuration parameters on the brokers. For the Kafka address space it has to be possible.
Well I'm thinking that even using OpenShift Online, the customer should be able to change the configuration parameters on the brokers. For the Kafka address space it has to be possible. @ppatierno
I always thought that use case would be solved by 'Just kafka on OpenShift'. Also, an address space does not necessarily map to an isolated cluster. However, there is some level of flexibility in a plan if we allow it to have parameters (we don't currently, but it would be a simple extension to the current model). For plans that provide an 'isolated kafka' address space, one could expose configuration options there.
Well with "just Kafka on OpenShift" I expect to have brokers configurable in one of the possible ways described by @scholzj in issue https://github.com/EnMasseProject/barnabas/issues/27.
With Kafka address space I think that having address space parameter could be the solution.
Thinking more about @ppatierno's suggestion around an aggregated server. I find it quite attractive because conceptually it's quite simple.
Cons:
PATCH
method) we probably wouldn't do if we weren't trying to fit with those conventions.uid
without maintaining our own persistent mapping of znode czxid
to RFC 4122 UUID.Pros:
References:
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-machinery/aggregated-api-servers.md https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/identifiers.md
Installing an aggregated apiserver isn't trivial. This hinders experimentation with this approach.
We need to be sure that users can try this easily. If the aggregated server is enabled by default it should be fine. If they would need to reconfigure the API server nobody will bother give it a try.
Here is what seems to be a nice example (however in Golang). Not sure whether the Java client from Fabric8 offers some builerplate for the API servers.
It doesn't solve the problem about K8s names v.s. topic names.
Does the name field still have the same limitations eventrhough you run it through your own server and not through CRDs? Even then it might make some thigns easier - you can for example keep the topic names inside the object and do your own duplicate detection to reject duplicates. You cannot do this with CRDs.
An aggrgated apiserver seems to be a cluster-wide thing, requiring various RBAC permissions (I've not read up about those yet), but perhaps they would limit the applicability of this approach. The nice thing about ConfigMaps is they're supported OOTB on every cluster.
I normally wouldn't be worried too much about this. But I wonder a bit how these things work on OpenShift Online or some other places where AFAIK you have multitenant OpenShift cluster. In the past our assumption was that in the current state of things multitentn Kafka cluster is probably something to avoid. So will the API server be multitenant and destribute the things to different "singletenant" Kafka clusters?
Aside from your points I also guess that you get full RBAC support from the API server even for your custom ZK backed resources. So you offload authorization to Kubernetes. That might be advantage as well as disadvantage depending on the point of view I guess.
Not sure whether the Java client from Fabric8 offers some builerplate for the API servers.
I don't think it does. AFAIK, they only provide a client for an apiserver. And it wouldn't even help with interacting with our apiserver, because their typesafe API is generated at compile time, so it wouldn't know about our API.
Enmasse seems to be using a vertx/resteasy/jackson/swagger stack, which seems reasonable enough.
Does the name field still have the same limitations eventrhough you run it through your own server and not through CRDs?
My reading of https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#idempotency (and the transitively linked https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/identifiers.md) is that, because it's a metadata/name
in the object, we need to follow those conventions. But you're right, we could have our own name elsewhere in our Topic
object and reject requests which would create dupes. We'd need a way to sanitize names of topics with underscores (such as __consumer_offsets
), or upper case, so they could be represented as Topic
objects; and that means there's the possibility of collision between sanitized names.
So will the API server be multitenant and destribute the things to different "singletenant" Kafka clusters?
I guess it would have to be. Because (if my understanding of https://kubernetes.io/docs/tasks/access-kubernetes-api/setup-extension-api-server/ steps 9–11 is correct, and please correct me if it's not!) you put the apiserver in its own namespace, and let it do whatever it needs to in other namespaces.
We need to think about how Kafka clusters relate to k8s namespaces: Does a namespace need to support > 1 Kafka cluster?
But you're right, we could have our own name elsewhere in our Topic object and reject requests which would create dupes. We'd need a way to sanitize names of topics with underscores (such as __consumer_offsets), or upper case, so they could be represented as Topic objects; and that means there's the possibility of collision between sanitized names.
This might be still kind of complicated because you will need to store somewhere some addiiotnal mapping. Because when I create resource with name jakub
containing topic myTopic1
you will need to remember this relationship and I'm not sure you can inject it into Kafka's ZK nodes.
Another problem with such resource is when I create resource with name jakub
containing topic myTopic1
and afterwards someone creates actuall topic names jakub
- you will need to generate this topic some other name.
So some parts of this will be still mess.
We need to think about how Kafka clusters relate to k8s namespaces: Does a namespace need to support > 1 Kafka cluster?
I don't see much sense in running two Kafka clusters in one namespace. But users can always surprise and come up with some reason :-o.
you will need to remember this relationship and I'm not sure you can inject it into Kafka's ZK nodes.
Kafka wouldn't care if you added extra keys in the JSON in the /brokers/topics/myTopic1
znode, but they'd get removed when it updated the znode. What you could do, though, is to store the mapping in the same zookeeper ensemble. A benefit of doing that is you're having to worry about keeping ZK healthy/backedup anyway.
Another problem is around discovery: You list the topics and you get jakub
, but you've no way to know what topic that corresponds to without describing it. To find the object corresponding to the topic myTopic1
becomes rather intensive.
An alternative approach would be to use a Kafka topic create policy to prevent people from creating topics with underscores or uppercase, with some systematic mapping for internal topic names with underscores (e.g. __consumer_offsets
→ internal.consumer-offsets
). That's easy enough for people to reason about that discovery isn't really a problem.
But users can always surprise and come up with some reason :-o.
Perhaps one use case is what netflix does where they failover to a whole standby Kafka cluster. They can do this because the can tolerate losing messages. One use case for them doing this is it allows them to upgrade to a new kafka version without doing a rolling update.
What you could do, though, is to store the mapping in the same zookeeper ensemble.
Good point, since we already have to run Zookeeper we can also use ti for storing other things.
An alternative approach would be to use a Kafka topic create policy to prevent people from creating topics with underscores or uppercase
This might cause many problems for anyone who already has some apps using such names.
- It's not clear how we could satisfy the requirement that resources have a uid without maintaining our own persistent mapping of znode czxid to RFC 4122 UUID.
Thinking more about that, we could use name-based UUIDs (version 3 or 5), and use a name that's based on the czxid of the /brokers/topics/${topic} znode possibly combined with the data of the /cluster/id
znode.
Does a namespace need to support > 1 Kafka cluster?
Thinking more about that, if I have two clusters in a namespace they have distinct sets of topics, but we only have one space of object names. So supporting > 1 Kafka cluster/namespace would seem to require a way to disambiguate, either in space (an extra level within the object name to indicate the cluster), or in time (only one cluster live at a time).
This might cause many problems for anyone who already has some apps using such names.
OK, maybe that particular scheme was a bit restrictive, but given the space of legal topic names is larger than the space of legal resource names it's fundamentally impossible to truly solve this problem. It's simply a matter of who gets the problems and when.
We could relax that restriction somewhat (perhaps using a persistent mapping in zookeeper), allowing mixed case topic names and underscore (downcasing to -
) so long as there were no collisions. There's still the problem of leading and trailing underscores (which is where my internal.
prefix came from, if that wasn't obvious).
What you could do, though, is to store the mapping in the same zookeeper ensemble.
We could use this idea for the ConfigMap
operator too. When reconciling we would fetch all three versions of the topic, k8s
, kafka
and ours
, and deal with the possible combinations as follows:
ours
doesn't exist:
k8s
doesn't exist, we reason the topic's been created in Kafka and we create it k8s
from kafka
kafka
doesn't exist, we reason it's been created in k8s, and we create it in kafka
from k8s
ours
ours
does exist:
k8s
doesn't exist, we reason it was deleted, and delete kafka
and ours
kafka
doesn't exist, we reason it was delete and we delete k8s
and ours
ours
.k8s
and kafka
exist then all three exist, and we need to reconcile:
ours
→k8s
and ours
→kafka
, we merge the two diffs:
ours
, and update all three ours
, k8s
and kafka
with that updated ours
.*We can get the k8s modification time from the metadata. The Kafka topic's mtime is a little more tricky, since the topic info is scattered around several znodes, but we basically take the most recent mtime of all those znodes.
@tombentley This sounds reasonable.
@tombentley There is an implicit assumption about synchronized clocks on zk and k8s nodes here, could that pose a problem or do you think it would be ok in practice?
@lulf You're referring to the "most recent modification time" part, right? Yes I realise this assumes both zk and k8s nodes have mythical synchronized clocks. I don't think it's a problem in practice because I think it unlikely that both the k8s ConfigMap and the Kafka topic would both be modified at around the same time and in different ways. (Obviously our own operator could modify both ends at around the same time, but in that case we'd have ours
and so there would be no need to compare mtimes).
That said, I don't think it would make the logic above significantly worse to change that tie-breaking condition to "Kafka wins", or to "give up and report an error". It just seems like a slightly nicer way to do it assuming the clocks were at least approximately synchronized.
TBH ... I haven't a solution right now and It will be a great discussion but ...
k8s
and kafka
) to 3 copies (k8s
, kafka
and ours
) of topics informationWhen we speak about Kafka created topics so not by operator or CM files are you speaking about "internal" Kafka topic (i.e. consumer offsets and Stream API). What's the value to have them in a CM ? Just for deletion ?
In any case if we want better integration in EnMasse we need CMs (but only for "regular" topics ?)
The prototypes of the Topic Controller and Cluster Controller are already done. This issue should be closed and the followups should be done in separate issues.
We expect, eventually, to need some kind of controller which will use the k8s REST API to manage the zookeeper and kafka deployments. This issue is about trying to understand what the requirements for a controller would be (while related to #31 that would be a build-time process, whereas this is about doing something similar at runtime).