tombentley commented 6 years ago

We expect, eventually, to need some kind of controller which will use the k8s REST API to manage the zookeeper and kafka deployments. This issue is about trying to understand what the requirements for a controller would be (while related to #31 that would be a build-time process, whereas this is about doing something similar at runtime).

tombentley commented 6 years ago

One requirement is that we can abstract across evolution of k8s APIs. For example apps/v1beta2 StatefulSet on Kubernetes 1.8, apps/v1beta1 StatefulSet on 1.7 and earlier and PetSet.

ppatierno commented 6 years ago

About K8S API evolution and maybe even related to OpenShift API evolution, it makes more sense if the "on premise" installation scenario is just "raw" Kafka deployment or even Kafka on OpenShift (on premise). In the latter case we should support more K8S/OpenShift versions.

scholzj commented 6 years ago

Deployment controller / operator

The controller can follow the operator concept. It will watch for two kinds of ConfigMap / CustomResourceDefinition resources. One for Zookeeper deployments and one for Kafka deployments. The config maps will identified by labels:

app=barnabas
type=deployment
kind=zookeeper

or

app=barnabas
type=deployment
kind=kafka

Inside the ConfigMap a detailed configuration of the requested deployment will be hidden. For example:

name=kafka
labels= |
    label1=value1
    label2=value2
nodes=5
zookeeper= |
    labelX=valueY
    labelU=valueV
config: |
    default.replication.factor=1
    offsets.topic.replication.factor=3
    transaction.state.log.replication.factor=3

When the ConfigMap is created, the controller will create the cluster. The separation of the Zookeeper and Kafka clusters into two ConfigMaps should make it possible to use multitenant Zookeeper clusters in the future. Not deploying the clusters directly has also the advantage that we can generate the resources dynamicaly. That will allow us to support much wider range of Openshift / Kubernetes versions. It will also give us a lot more flexibility in terms of deployment variablity compared to the openShift template / Kubernetes YAML files. The dynamically generated deplyoments can be easily extended to include nodeAffinity (e.g. running broker nodes on dedicated hosts with local hosts), rack IDs or exposing Kafka to the outside by atomatically generating loadbalancers / routes.

Changing the configuration in the ConfigMap should also reconfigure the cluster. That might be non trivial, since many changes require broker restart. So the controller will need to do a rolling update of the cluster. Zookeeper would need a rolling update even in case of scale dup/down (this seems to be non-critical feature, so we might ignore it). Similarly, we might need to rearange the topic distribution in Kafka when the cluster is scaled down or up.

When the ConfigMap is deleted, the operator will delete the cluster. This should be fairly simple as operation. But if we decide to split the Zookeeper and Kafka deployments because of the multitenancy, we should think about some protection for not deleting the Zookeeper cluster used by a running Kafka cluster (this might be tricky - admission controllers can probably do this, but I'm not sure about OpenShift support).

Topic controller / operator

The operator concept can be also used for managing topics. ConfigMap / CustomResourceDefinition resource might be created in OpenShift / Kubernetes for each topic. It needs to be thought through how to synchronize the topics between Kafka and the operator:

Should the operator delete the Kafka topics not managed by ConfigMap?
Should the operator automatically create ConfigMaps for topics which are created in Kafka? (this seems to be my preference)
Should the operator ignore the topics which don't have ConfigMap? (This seems to be risky as we might not delete some topics in case the cotnroller is not running when the topic is deleted)

The topics can be identified with labels such as:

app=barnabas
type=runtime
kind=topic

Inside the ConfigMap the topic details can be configured:

name=mytopic
kafka= |
    label1=value1
    label2=value2
partitions=5
replication.factor=3
config: |
    cleanup.policy=compact
    retention.ms=3600000

Many tools such as Kafka Streams or Kafka Connect tend to create their own topics. do we want to force the users to first create the ConfigMap before starting Kafka Connect? Here the automated reconciliation between the operator / ConfigMaps and Kafka topics seem to be the best solution.

There might be also some niche situations which cannot be implemented easily (such as "renaming ConfigMap will delete topic and create a new one" etc.)

Security concerns

The operator model should work fine when your Kafka ops are also your OpenShift / Kubernetes ops. This is because to manage the clusters and topics all you need is the appropriatte access to Openshift / Kubernetes. It has also big advantages since you can easily integrate deployment of the Kafka cluster / Kafka topics with the deployment of your application into OpenShift / Kubernetes.

But this might not suite everyone. Some people might prefer to have more traditional management console with separate authentication and authorization model then the OpenShift / Kubernetes cluster. it should be possible to work around this by creating a standalone management console with separate AuthN/AuthZ which will be integrated with the operator. Either in the form:

Management Console creates ConfigMaps based on user requests => Operator picks up the ConfigMaps

or the other way around

When ConfigMap created, operator calls the management console API to actually create the resource

The first options seems to be easier for the reconciliation of configuration between the ConfigMaps and OpenShift / Kubernetes. The second option might be easier to integrate Kafka management with monitoring etc.

How to deploy the cluster

The cluster might be deployed in a similar way as today. OpenShift template / Kubernetes YAML files will be prepared and will load the controller(s) as well as ConfigMap(s) for deploying Zookeeper / Kafka. So apart from more processes, running in the cluster this should not be a major change for current users. Another template with only controller might be prepared for more advanced users who want to configure the ConfigMap(s) on their own.

tombentley commented 6 years ago

(This seems to be risky as we might not delete some topics in case the cotnroller is not running when the topic is deleted)

I don't understand this sentence.

tombentley commented 6 years ago

Many tools such as Kafka Streams or Kafka Connect tend to create their own topics. do we want to force the users to first create the ConfigMap before starting Kafka Connect?

We really shouldn't require users to create such topics. There can be many such topics and it's pointless work. A "single button import" facility to create them would be better, but I still prefer going the whole hog and just automatically creating ConfigMaps for ropics created outside of the controller.

scholzj commented 6 years ago

@tombentley Basically, imagine following scenario:

The controller crashes
User deletes the ConfigMap
Controller is started again => The controller will not find out that the ConfigMap was deleted and will consider the topic as unmanaged and let it exist in Kafka.

If the controller does reconciliation, it will at least recreate the ConfigMap again and make this more visible to the user.

So in general - also with the Kafka Stream / Connect topics in mind - I agree with your point that to make this work, the full reconciliation and creation of configMaps by the controller sseems like the best way forward.

ppatierno commented 6 years ago

Many tools such as Kafka Streams or Kafka Connect tend to create their own topics. do we want to force the users to first create the ConfigMap before starting Kafka Connect?

There are some topics that the user doesn't know about them. For example the "state store" topics for stateful transformation (to re-build a RockDB database) and even "repartitioning" topic (i.e. when group by key is used) in Kafka Stream.

The same for the ___consumer_offsets topic as well.

scholzj commented 6 years ago

I think it is best to leave the __ topics out of the management. They have specific names, so that should be easy.

tombentley commented 6 years ago

@scholzj thanks, I understand the scenario now. This makes me think about the controller being HA.

scholzj commented 6 years ago

I think you would need some active - passive locking mechanism (using Zookeeper to deploy Zookeeper??? :-o) so that you don't have two or more instances trying to do the same.

But if we implement it properly and really also create the ConfigMaps for the topics created directly in Kafka we should not need it.

lulf commented 6 years ago

Edit: trying to make my ramblings a bit clearer.

@tombentley @scholzj Some comments on the requirements. The notion of a topic controller is very similar to the concept of the address controller in enmasse. Basically, you can think of enmasse configuration involving 3 logical components (at least this is the goal):

API server - Can be used to create/delete address spaces and addresses within an address space, either through service broker or config map/rest api (want to use CRD here)
address space controller - instantiates infrastructure to handle an address space. Of type 'standard' (dispatch router ++) or 'brokered' (just a single Artemis broker). This is 'global' in a multitenant environment.
address controller - specific to each address space type (i.e. one for 'standard', one for 'brokered' etc), and lives within each address space as part of the infrastructure. It configures the rest of the infrastructure to deal with addresses the user has configured.

When reading the description of the topic-controller, it performs the same functions as an address controller. Basically, the address config in enmasse would map to the topic configmap. If this topic-controller understands the address config format, it can be plugged into enmasse as I describe in the kafka address space proposal[0]. I'm also interested in your thoughts on exposing management and console as an independent thing of barnabas and/or as part of EnMasse. Deploying enmasse with a 'kafka' address space would be very similar to what you are trying to achieve, at the cost of using a more generalized address model. The benefit would be that the user can relate to the same address model irrespective of which messaging technology used, and it has the potential of reusing lots of the console. You'd also get a somewhat multitenant model of barnabas.

[0] https://github.com/EnMasseProject/enmasse/blob/kafka-design-proposal/documentation/design_docs/design/kafka-addressspace.adoc T

Regarding having 1 or more controllers... Having HA of the controller it self makes the system way more complex as you need to establish consensus among them for deciding what to do. Besides, I would consider a controller crash a transient error as it should be restarted without a big delay in the operations it will perform (as I assume you don't have any strict timing requirements for it).

tombentley commented 6 years ago

Thanks @lulf

If this topic-controller understands the address config format, it can be plugged into enmasse as I describe in the below document.

Is there any documentation/description of that format? Is the enmasse address controller using a CRD, or plain configmaps?

For Kafka we have the problem that clients (such as Kafka Streams) can create topics outside of the control of any operator, and it seems likely that we're going to need to reflect these as k8s resources (ConfigMap or CRD), so there's a need for bidirectional reconciliation. Is this a problem the address controller has had to solve?

How does the enmasse address controller handle errors, for example an invalid k8s resource which it cannot process? I was thinking perhaps we should create k8s events in such cases.

lulf commented 6 years ago

@tombentley Just updated my previous comment :rofl: with more details.

See http://enmasse.io/documentation/master/tenant/#creating_addresses for the format of addresses. Though somewhat simplified in this example, it shows the different types we support for the 'standard' address space. The proposal for kafka would be to have a 'kafka' type, along with a 'properties' entry in the spec for describing semantics.

The problem of clients creating topics outside of control is one we also have in the standard address space, where the mqtt-gateway creates 'custom' addresses that are not configured explicitly. This leads to some issues currently where we cannot instruct the router to reject unknown addresses. I think we'd could want to do something similar there, though it needs some more thinking.

Creating k8s events is probably the idiomatic approach and the one we should pursue as well. Currently we log an error decoding and move on.

scholzj commented 6 years ago

@lulf With Kafka, ideally we want to provide multiple tiers.

Integration of Kafka within EnMasse (not just via the AMQP Kafka bridge but really in the way you suggested).
But we also want to make it possible to run just Kafka on OpenShift. There are many users who are on the Kafka train and want to simply run Kafka on OpenShift with a different amount of additional management tools.

I think that if we do this right it should be possible to achieve both. I will have to look more closely how the controllers in EnMasse work.

The addresses look a lot like CRDs, but if I understood it correctly the restapi is not the OpenShift API. It is your controller API. How do you store / represent these on the OpenShift / Kubernetes level?

lulf commented 6 years ago

@scholzj I think those tiers make sense, and agreed that running just Kafka on OpenShift is an important use case. Do you see the topic-controller as something that would be used in a 'standalone' kafka deployment as well? I'd like to understand more the scope of that tier and what 'just Kafka' means :)

The restapi is not the OpenShift API, but I would hope for it to evolve into a proper kubernetes API server using something like https://kubernetes.io/docs/concepts/api-extension/apiserver-aggregation/ . The downside of using pure CRDs is that there is no schema validation if I've understood it correctly.

Right now the address 'CRD' is stored as a JSON object inside a configmap (which makes editing the configmaps a bit hard).

scholzj commented 6 years ago

@lulf I haven't looked into the details, but CRD validation should be a new feature in Kubernetes 1.8 via OpenAPI schema ... so over the time it will get also into OpenShift etc.

We discussed more or less three tiers: 1) Just Kafka on Openshift: This should include only the Kafka deployment it self (that will probably still need the cluster controller to deploy and configure the cluster properly). But not the topic operator or management console. The user will manage Kafka using the default Kafka tooling. 2) Kafka on OpenShift with integration: This is something what should include the above + the topic controller and management console to provide better and more native OpenShift integration. So it is not just Kafka anymore. 3) Kafka in EnMasse

The question is whether we can manage to make the 2. and 3. the same.

Right now the address 'CRD' is stored as a JSON object inside a configmap (which makes editing the configmaps a bit hard).

Does the EnMasse controller expect the user to create all the addresses only through the REST API? Or does it also allow the users to create directly the ConfigMaps? I think it is quite nice feature when you can deploy the topic / queue as part of your application by simply creating the ConfigMap (this idea doesn't come from my head, but I really like it :-)).

lulf commented 6 years ago

@scholzj Thank you for clarifying!

@lulf I haven't looked into the details, but CRD validation should be a new feature in Kubernetes 1.8 via OpenAPI schema ... so over the time it will get also into OpenShift etc.

This has been added since I looked last time, that is great news! This means that we could rely on CRD instead of config maps eventually. The downside AFAIK is that you need cluster-admin role to be able to create the CRD definition.

The question is whether we can manage to make the 2. and 3. the same.

I think this needs a wider discussion, but personally I think it would be good if we discussed the pros and cons of making them the same. Maybe the concept of address spaces and addresses are a bit abstract for some use cases? Another question is what will the next step for 2. be? Multitenancy, integration with authentication services, console? Then 2. starting too look like the same thing as 3 but using the more specific 'topic' instead of 'address' as the model.

I don't mean to say that barnabus shouldn't be multitenant or not provide authentication services, but that it would be nice that whatever is part of 2. can be reused and fits into 3. As an example, if two enmasse address spaces A and B has type 'kafka' they might both run on the same barnabas cluster, but the console would show them as different address spaces potentially belonging to different tenants.

Yet another question is how much of OpenShift concepts do you want to expose in the service. For instance, in enmasse we try to hide the concepts of namespaces even though they are used for isolation of address space. An address space may conceptually run in the same namespace as another or in an isolated one. This makes it harder to integrate with CRDs directly because we can't allow the user to create CRDs directly inside the namespace.

Does the EnMasse controller expect the user to create all the addresses only through the REST API? Or does it also allow the users to create directly the ConfigMaps? I think it is quite nice feature when you can deploy the topic / queue as part of your application by simply creating the ConfigMap (this idea doesn't come from my head, but I really like it :-)).

Yes, the REST API is currently the way we use in our documentation. It is, however, currently possible to create addresses using configmap. The format of the configmap would look something like below. The challenge though is that we don't want the user to care about which namespace his address space is running, so we need to do some thinking on how to do this in multitenant mode or if we need to roll our own api server since we might want to authenticate against keycloak instead for that.

apiVersion: v1
kind: ConfigMap
metadata:
  name: address-config-myqueue
  labels:
    type: address-config
data:
  config.json: |
    {
      "apiVersion": "enmasse.io/v1",
      "kind": "Address",
      "metadata": {
        "name": "myqueue",
        "addressSpace": "myspace"
      },
      "spec": {
        "type": "queue"
      }
    }

tombentley commented 6 years ago

I've spent some time experimenting with an ConfigMap <-> topic operator, using fabric8, the Kafka AdminClient and ZooKeeper. It:

Uses a watch on k8s ConfigMaps with appropriate labels, creating/deleting topics when CMs are created/deleted
Uses a watch on the /brokers/topics znode to know about topics created outside of the operator. In this case it creates/deletes ConfigMaps to keep things in sync

Reconciliation

While it seems that CM creations while the operator is down result in the watch receiving ADDED events when the operator is brought back up, the same doesn't seem to be true of CM deletions. This means that on start up of the operator we need to reconcile all existing configmaps and topics. When a topic exists and a CM doesn't, it's ambiguous whether the CM was deleted while the operator was offline (so topic should be deleted) or whether the topic was created while the operator was offline (so a CM should be created). We could extract the ctime from the /brokers/topics/mystery_topic znode, to determine if the topic was created while we were offline, but that's worthless without knowing the time the operator went down. Doing that naïvely, we end up with a stateful operator, which further complicates things.

Topic names vs CM names

CM names have to be a valid DNS-1123 subdomain. Topic names can consist of characters from the character class [a-zA-Z0-9._-]. This means some topic names – such as those including _ – are not legal CM names. So we can't simply use the CM name as the topic name.

As sketched above, we could configure the name of a topic via the name=mytopic entry in the CM data. This opens up further problems, however:

Two CMs with the same name key in their data.
What name should be given to a CM created in response to a topic being created? Obviously we should use an encoding, but what if a such named CM already exists with a different name?

Races

Because we're watching the znode, we can know about new topics before the Controller has finished creating them. But we need to know the topic config and description in order to create the CM. So we have to retry fetching this information if we fetch it too quickly. (This would be avoided if, instead of the znode watch, we polled AdminClient.listTopics(), but that would be higher latency, and wasteful of CPU). But some 'interesting' corner cases exist:

Topic created and operator notified
Topic deleted, (notified in a separate event)
Operator left polling, while processing first event, waiting for topic metadata that will never arrive.

lulf commented 6 years ago

@tombentley

Uses a watch on k8s ConfigMaps with appropriate labels, creating/deleting topics when CMs are created/deleted

Be careful relying on k8s watches only. They are not reliable in the same way as ZK watches are, and should be combined with relistings even though you can watch from a given resource version. The address controller do a periodic poll of the CMs in addition to watching . See https://github.com/EnMasseProject/enmasse/blob/master/k8s-api/src/main/java/io/enmasse/k8s/api/ResourceController.java

and

https://github.com/EnMasseProject/enmasse/blob/master/k8s-api/src/main/java/io/enmasse/k8s/api/ConfigMapAddressApi.java

lulf commented 6 years ago

@tombentley

CM names have to be a valid DNS-1123 subdomain. Topic names can consist of characters from the character class [a-zA-Z0-9.-]. This means some topic names – such as those including – are not legal CM names. So we can't simply use the CM name as the topic name.

FWIW, the enmasse config map names are independent of the 'address' name (as you can see in my previous comment) for this exact reason. The REST API sanitizes the address to make sure it is a legal kube id: https://github.com/EnMasseProject/enmasse/blob/master/k8s-api/src/main/java/io/enmasse/k8s/api/KubeUtil.java#L22

scholzj commented 6 years ago

When a topic exists and a CM doesn't, it's ambiguous whether the CM was deleted while the operator was offline (so topic should be deleted) or whether the topic was created while the operator was offline (so a CM should be created).

This can utlimately always happen. Even with HA controllers you can just minimize the probability of this occuring but you never eliminate it completely. I think we should prefer keeping the topics instead of deleting them by mistake (i.e. we just recreate the config map). It should be IMHO quite rare.

What would be more interesting is what to do when the configuration between ZK and the ConfigMap diverges. If the controller was down, there is no way how to tell whhich of these changed.

CM names have to be a valid DNS-1123 subdomain. Topic names can consist of characters from the character class [a-zA-Z0-9.-]. This means some topic names – such as those including – are not legal CM names. So we can't simply use the CM name as the topic name.

This is a bit unfortunate. It would be quite hard to encode the topic name into config map name. Even doing something such as using only the first config map with the same name might be complicated when the controller initializes and it could again create some race conditions.

tombentley commented 6 years ago

Oh and another thing...

Partition reassignment

I've only really experimented with creation/deletion, not looking at modification yet. There is no AdminClient API for partition reassignment yet (see KIP-179), so if partition assignment is configurable via a topic operator then:

Barnabas would need to be using a future version of Kafka, or
The operator would need to be using the CLI tools (via exec, or by invoking the scala API directly) from the node running the operator, or
The operator would need to be poking at znodes directly.

But we should think about that if. One desirable feature for barnabas is automatic balancing and scaling. Partition reassignment is really a cluster-level concern; it doesn't change the semantics of the topic or its partitions. So it might make more sense to not expose the partition assignment via the topic operator, at least initially.

tombentley commented 6 years ago

I think we should prefer keeping the topics instead of deleting them by mistake (i.e. we just recreate the config map).

Definitely.

ppatierno commented 6 years ago

I was just wondering if it makes sense having the ConfigMap mechanism for Barnabas.

I mean, afaik (but @lulf can be more precise than me), EnMasse uses the CM files as storage for addresses because there is no other storage to put their information. So if you use the REST API, a corresponding CM file is created; otherwise you can create the CM directly.

With Kafka we have Zookeeper for that under the broker/topics znode which contains all metadata about a topic. Maybe does handling both mechanisms provide more problems (reconciliation) than advantages ?

The controller could be the only "entry" point for handling topics as a "facade" for avoiding exposing the cluster outside (something that happens in the "just Kafka" on OpenShift scenario). Of course it could expose CRUD API (i.e. HTTP REST and maybe an AMQP management one as well ? :-)).

Of course in this way, we lose something like more "integration" with OpenShift/Kubernetes not relying on resources like CMs.

Maybe I have forgotten the rationale to use CMs.

ppatierno commented 6 years ago

@lulf having properties for the address space (other than for addresses) could be a way for bringing Kafka broker configuration during the cluster deployment. It can be used for configuring broker in the "brokered" space as primary purpose but the application to the Kafka address space makes more sense.

lulf commented 6 years ago

@lulf having properties for the address space (other than for addresses) could be a way for bringing Kafka broker configuration during the cluster deployment. It can be used for configuring broker in the "brokered" space as primary purpose but the application to the Kafka address space makes more sense

@ppatierno After thinking more about this, I think the address space plan is the appropriate place to configure this. A plan is basically a template with configuration parameters. The broker configuration would therefore be controlled by the service admin (who configures the available plans), and not the tenant, which I think is how it should be. If you want full control, you can be your own service admin running your own OpenShift cluster, but if you're using someone elses service, they control the broker config.

wdyt?

tombentley commented 6 years ago

@ppatierno so you're suggesting something like an aggregated apiserver where we provide an API façade of resources backed by the Kafka/ZooKeeper topic metadata?

ppatierno commented 6 years ago

@tombentley in some way ... yes. As I said I guess that the main reason for using CM files in EnMasse was to chose a way for storing address information in a centralized source (so using CM files). In our case we already have such source (Zookeeper) why we should have duplicates ?

Kubernetes uses etcd for backing such resources right ? Our API facade will use Zookeeper :-)

ppatierno commented 6 years ago

The broker configuration would therefore be controlled by the service admin (who configures the available plans), and not the tenant, which I think is how it should be

If the service admin is from the customers team then yes. What is the difference with the tenant at this point ?

lulf commented 6 years ago

If the service admin is from the customers team then yes. What is the difference with the tenant at this point

If the service admin is OpenShift Online, then the available plans will be selected by the team that runs that service. If someone starts a 'freemessaging.com' service, they decide what plans etc. Any tenant using those services will only be able to select the plans availabe. If a customer runs their own openshift cluster/enmasse instance, then they can decide what plans they have.

Now, when I say this is possible I'm lying. Currently the plans are hardcoded, but we are thinking that they should be configurable when deploying enmasse.

ppatierno commented 6 years ago

Well I'm thinking that even using OpenShift Online, the customer should be able to change the configuration parameters on the brokers. For the Kafka address space it has to be possible.

lulf commented 6 years ago

Well I'm thinking that even using OpenShift Online, the customer should be able to change the configuration parameters on the brokers. For the Kafka address space it has to be possible. @ppatierno

I always thought that use case would be solved by 'Just kafka on OpenShift'. Also, an address space does not necessarily map to an isolated cluster. However, there is some level of flexibility in a plan if we allow it to have parameters (we don't currently, but it would be a simple extension to the current model). For plans that provide an 'isolated kafka' address space, one could expose configuration options there.

ppatierno commented 6 years ago

Well with "just Kafka on OpenShift" I expect to have brokers configurable in one of the possible ways described by @scholzj in issue https://github.com/EnMasseProject/barnabas/issues/27.

With Kafka address space I think that having address space parameter could be the solution.

tombentley commented 6 years ago

Thinking more about @ppatierno's suggestion around an aggregated server. I find it quite attractive because conceptually it's quite simple.

Cons:

Installing an aggregated apiserver isn't trivial. This hinders experimentation with this approach.
An aggrgated apiserver seems to be a cluster-wide thing, requiring various RBAC permissions (I've not read up about those yet), but perhaps they would limit the applicability of this approach. The nice thing about ConfigMaps is they're supported OOTB on every cluster.
There's a lengthy "API conventions" document we'd need to stick to. Some of it (such as supporting a PATCH method) we probably wouldn't do if we weren't trying to fit with those conventions.
It doesn't solve the problem about K8s names v.s. topic names.
It's not clear how we could satisfy the requirement that resources have a uid without maintaining our own persistent mapping of znode czxid to RFC 4122 UUID.

Pros:

The fact that we're not trying to keep in sync two things which can both change independently.
It's worth noting that an HTTP API onto what is, in effect, the Kafka AdminClient could be used standalone (not aggregated), and would still be useful.

References:

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-machinery/aggregated-api-servers.md https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/identifiers.md

scholzj commented 6 years ago

Installing an aggregated apiserver isn't trivial. This hinders experimentation with this approach.

We need to be sure that users can try this easily. If the aggregated server is enabled by default it should be fine. If they would need to reconfigure the API server nobody will bother give it a try.

Here is what seems to be a nice example (however in Golang). Not sure whether the Java client from Fabric8 offers some builerplate for the API servers.

It doesn't solve the problem about K8s names v.s. topic names.

Does the name field still have the same limitations eventrhough you run it through your own server and not through CRDs? Even then it might make some thigns easier - you can for example keep the topic names inside the object and do your own duplicate detection to reject duplicates. You cannot do this with CRDs.

An aggrgated apiserver seems to be a cluster-wide thing, requiring various RBAC permissions (I've not read up about those yet), but perhaps they would limit the applicability of this approach. The nice thing about ConfigMaps is they're supported OOTB on every cluster.

I normally wouldn't be worried too much about this. But I wonder a bit how these things work on OpenShift Online or some other places where AFAIK you have multitenant OpenShift cluster. In the past our assumption was that in the current state of things multitentn Kafka cluster is probably something to avoid. So will the API server be multitenant and destribute the things to different "singletenant" Kafka clusters?

Aside from your points I also guess that you get full RBAC support from the API server even for your custom ZK backed resources. So you offload authorization to Kubernetes. That might be advantage as well as disadvantage depending on the point of view I guess.

tombentley commented 6 years ago

Not sure whether the Java client from Fabric8 offers some builerplate for the API servers.

I don't think it does. AFAIK, they only provide a client for an apiserver. And it wouldn't even help with interacting with our apiserver, because their typesafe API is generated at compile time, so it wouldn't know about our API.

Enmasse seems to be using a vertx/resteasy/jackson/swagger stack, which seems reasonable enough.

Does the name field still have the same limitations eventrhough you run it through your own server and not through CRDs?

My reading of https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#idempotency (and the transitively linked https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/identifiers.md) is that, because it's a metadata/name in the object, we need to follow those conventions. But you're right, we could have our own name elsewhere in our Topic object and reject requests which would create dupes. We'd need a way to sanitize names of topics with underscores (such as __consumer_offsets), or upper case, so they could be represented as Topic objects; and that means there's the possibility of collision between sanitized names.

So will the API server be multitenant and destribute the things to different "singletenant" Kafka clusters?

I guess it would have to be. Because (if my understanding of https://kubernetes.io/docs/tasks/access-kubernetes-api/setup-extension-api-server/ steps 9–11 is correct, and please correct me if it's not!) you put the apiserver in its own namespace, and let it do whatever it needs to in other namespaces.

We need to think about how Kafka clusters relate to k8s namespaces: Does a namespace need to support > 1 Kafka cluster?

scholzj commented 6 years ago

But you're right, we could have our own name elsewhere in our Topic object and reject requests which would create dupes. We'd need a way to sanitize names of topics with underscores (such as __consumer_offsets), or upper case, so they could be represented as Topic objects; and that means there's the possibility of collision between sanitized names.

This might be still kind of complicated because you will need to store somewhere some addiiotnal mapping. Because when I create resource with name jakub containing topic myTopic1 you will need to remember this relationship and I'm not sure you can inject it into Kafka's ZK nodes.

Another problem with such resource is when I create resource with name jakub containing topic myTopic1 and afterwards someone creates actuall topic names jakub - you will need to generate this topic some other name.

So some parts of this will be still mess.

We need to think about how Kafka clusters relate to k8s namespaces: Does a namespace need to support > 1 Kafka cluster?

I don't see much sense in running two Kafka clusters in one namespace. But users can always surprise and come up with some reason :-o.

tombentley commented 6 years ago

you will need to remember this relationship and I'm not sure you can inject it into Kafka's ZK nodes.

Kafka wouldn't care if you added extra keys in the JSON in the /brokers/topics/myTopic1 znode, but they'd get removed when it updated the znode. What you could do, though, is to store the mapping in the same zookeeper ensemble. A benefit of doing that is you're having to worry about keeping ZK healthy/backedup anyway.

Another problem is around discovery: You list the topics and you get jakub, but you've no way to know what topic that corresponds to without describing it. To find the object corresponding to the topic myTopic1 becomes rather intensive.

An alternative approach would be to use a Kafka topic create policy to prevent people from creating topics with underscores or uppercase, with some systematic mapping for internal topic names with underscores (e.g. __consumer_offsets → internal.consumer-offsets). That's easy enough for people to reason about that discovery isn't really a problem.

But users can always surprise and come up with some reason :-o.

Perhaps one use case is what netflix does where they failover to a whole standby Kafka cluster. They can do this because the can tolerate losing messages. One use case for them doing this is it allows them to upgrade to a new kafka version without doing a rolling update.

scholzj commented 6 years ago

What you could do, though, is to store the mapping in the same zookeeper ensemble.

Good point, since we already have to run Zookeeper we can also use ti for storing other things.

An alternative approach would be to use a Kafka topic create policy to prevent people from creating topics with underscores or uppercase

This might cause many problems for anyone who already has some apps using such names.

tombentley commented 6 years ago

It's not clear how we could satisfy the requirement that resources have a uid without maintaining our own persistent mapping of znode czxid to RFC 4122 UUID.

Thinking more about that, we could use name-based UUIDs (version 3 or 5), and use a name that's based on the czxid of the /brokers/topics/${topic} znode possibly combined with the data of the /cluster/id znode.

Does a namespace need to support > 1 Kafka cluster?

Thinking more about that, if I have two clusters in a namespace they have distinct sets of topics, but we only have one space of object names. So supporting > 1 Kafka cluster/namespace would seem to require a way to disambiguate, either in space (an extra level within the object name to indicate the cluster), or in time (only one cluster live at a time).

This might cause many problems for anyone who already has some apps using such names.

OK, maybe that particular scheme was a bit restrictive, but given the space of legal topic names is larger than the space of legal resource names it's fundamentally impossible to truly solve this problem. It's simply a matter of who gets the problems and when.

We could relax that restriction somewhat (perhaps using a persistent mapping in zookeeper), allowing mixed case topic names and underscore (downcasing to -) so long as there were no collisions. There's still the problem of leading and trailing underscores (which is where my internal. prefix came from, if that wasn't obvious).

tombentley commented 6 years ago

What you could do, though, is to store the mapping in the same zookeeper ensemble.

We could use this idea for the ConfigMap operator too. When reconciling we would fetch all three versions of the topic, k8s, kafka and ours, and deal with the possible combinations as follows:

If ours doesn't exist:
- If k8s doesn't exist, we reason the topic's been created in Kafka and we create it k8s from kafka
- If kafka doesn't exist, we reason it's been created in k8s, and we create it in kafka from k8s
- If both exist, and are the same: That's fine
- If both exist, and are different: We use whichever has the most recent modification time*.
- In all above cases we create ours
If ours does exist:
- If k8s doesn't exist, we reason it was deleted, and delete kafka and ours
- If kafka doesn't exist, we reason it was delete and we delete k8s and ours
- If neither exists, we delete ours.
- If both k8s and kafka exist then all three exist, and we need to reconcile:
  - We compute diff ours→k8s and ours→kafka, we merge the two diffs:
    - If there are conflicts we give up with an error.
    - Otherwise we apply the apply the merged diff to ours, and update all three ours, k8s and kafka with that updated ours.

*We can get the k8s modification time from the metadata. The Kafka topic's mtime is a little more tricky, since the topic info is scattered around several znodes, but we basically take the most recent mtime of all those znodes.

scholzj commented 6 years ago

@tombentley This sounds reasonable.

lulf commented 6 years ago

@tombentley There is an implicit assumption about synchronized clocks on zk and k8s nodes here, could that pose a problem or do you think it would be ok in practice?

tombentley commented 6 years ago

@lulf You're referring to the "most recent modification time" part, right? Yes I realise this assumes both zk and k8s nodes have mythical synchronized clocks. I don't think it's a problem in practice because I think it unlikely that both the k8s ConfigMap and the Kafka topic would both be modified at around the same time and in different ways. (Obviously our own operator could modify both ends at around the same time, but in that case we'd have ours and so there would be no need to compare mtimes).

That said, I don't think it would make the logic above significantly worse to change that tie-breaking condition to "Kafka wins", or to "give up and report an error". It just seems like a slightly nicer way to do it assuming the clocks were at least approximately synchronized.

ppatierno commented 6 years ago

TBH ... I haven't a solution right now and It will be a great discussion but ...

in order to support CMs we are moving from 2 (k8s and kafka) to 3 copies (k8s, kafka and ours) of topics information
we rely on "timing" on not sync components .... and sometimes it's a "dark hole" :-)

When we speak about Kafka created topics so not by operator or CM files are you speaking about "internal" Kafka topic (i.e. consumer offsets and Stream API). What's the value to have them in a CM ? Just for deletion ?

In any case if we want better integration in EnMasse we need CMs (but only for "regular" topics ?)

scholzj commented 6 years ago

The prototypes of the Topic Controller and Cluster Controller are already done. This issue should be closed and the followups should be done in separate issues.

strimzi / strimzi-kafka-operator

Requirements for Controller #40

Deployment controller / operator

Topic controller / operator

Security concerns

How to deploy the cluster

Reconciliation

Topic names vs CM names

Races

Partition reassignment