nats-io / nats-operator

NATS Operator
https://nats-io.github.io/k8s/
Apache License 2.0
574 stars 111 forks source link

Allow for setting the namespace where to run the operator #126

Open prune998 opened 5 years ago

prune998 commented 5 years ago

I don't understand why we force the use of the nats-io namespace here. It's a huge limitation only to foolproof people trying to deploy many nats-operators in different namespaces !

At least, add an option, like feature-gates, to allow to specify the Namespace to use for leader election.

My point of view here is that people starting many nats-operators in different NS will end with troubles, which is fine (and documented).

Simply remove code at : https://github.com/nats-io/nats-operator/blob/33a51993af2ad3a85338ab752961d5f0a4ac45b3/cmd/operator/main.go#L108-L111

Or maybe we should find a better foolproof solution to manage the leader election ? If yes we can start the discussion and I'll be happy to add and PR that.

wallyqs commented 5 years ago

Thanks for the feedback, I think an option to be able to specify the namespace for leader election would be a good addition.

pires commented 5 years ago

I don't understand why we force the use of the nats-io namespace here.

Using a fixed, well-known namespace - as opposed to allowing for it to be specified using a flag, as you suggest - is about the only way we have to guarantee that there is a single instance of nats-operator operating cluster-wide. Should we allow for the namespace to be specified using a flag or some other method, there would be nothing preventing users from shooting themselves in the foot because they didn't read the warning sign (documentation).

It's a huge limitation only to foolproof people trying to deploy many nats-operators in different namespaces !

What makes you think this is a huge limitation?

Or maybe we should find a better foolproof solution to manage the leader election ?

If you find a (better) way then, by all means, please present your point of view for us to discuss it.

That being said, if documentation is enough for you, then there's no need to act smart in many other places of the code, such as:

prune998 commented 5 years ago

Actually, the foolproof functions you are mentioning could already do the trick and should be able to detect if either :

You could also merge the two in a single one that would detect both cases and only refuse to start if :

So you don't really need to stick the cluster-scoped operator to the nats-io namespace don't you ?

When I'm saying a huge limitation, it's not that huge... it's just annoying to be forced to do something very limitative only to protect people not knowing what they do. Although I could understand that sort of limitations when it's driving a car on a road where one could kill people, I don't see the gain of foolproofing an operator deployment :)

On the same level, if you can enforce that NO cluster-scoped operators can be started in different namespaces, then you can handle the leader-election with a lock in a secret or configmap and a lease. Many operators are already doing it already (never checked how it's done right now, sorry).

Many thanks for your time and consideration for the previous answer.

pires commented 5 years ago

Many operators are already doing it already (never checked how it's done right now, sorry).

Can you provide those many examples (links to the code, please)? I'd like to analyze those before I provide you with a more developed answer.

prune998 commented 5 years ago

in Cert-Manager for example, they are using :

"k8s.io/client-go/tools/leaderelection"
"k8s.io/client-go/tools/leaderelection/resourcelock"

Which is instanciated at https://github.com/jetstack/cert-manager/blob/5f96b378e6a8bc99e63d011cd5d33d4999d1e21c/cmd/controller/app/controller.go#L104

and https://github.com/jetstack/cert-manager/blob/5f96b378e6a8bc99e63d011cd5d33d4999d1e21c/cmd/controller/app/controller.go#L210

I'll check in other operators (and in Operator SDK, but I think it's the same code as Cert-Manager)

pires commented 5 years ago

That's exactly what this operator uses (client-go leaderelection)! We simply use an EndpointLock, while cert-manager uses a ConfigMapLock. However, cert-manager implementation does allow for the configuration of the namespace to be used for leader-election.

This, however, doesn't prove it's the right thing to do. I just wanted examples so I can gather more info. In this case, and since I'm connected to cert-manager authors, I'll discuss with them why they do this.

You could also merge the two in a single one that would detect both cases and only refuse to start if

  • you're starting a namespaced-scoped operator AND (there is already an operator in the namespace OR there is already a cluster-scope operator)

How do you suggest we achieve the latter? Our solution is to have the lock live in a specific namespace, which we hardcode to nats-io.

  • you're starting a cluster-scoped operator AND (there is already an operator in the same namespace OR there is already a cluster-scope operator)

How do you suggest we achieve the latter? Our solution is to have the lock live in a specific namespace, which we hardcode to nats-io.

prune998 commented 5 years ago

Well, my suggestion would be to lock in the namespace where the first operator is started. But most of the time, you will die if you have 2 namespace-scope operator in the same namespace or more than one cluster-scope operator in all namespaces

so :

Say hello to James (from Cert-Manager) for me :)

pires commented 5 years ago
  • operator get_pods(all_namespace)
  • check if an operator is already running

You lost me here. This provides no guarantees whatsoever. And looking into pods is not enough. You'd have to go and parse their spec, including args (flags) to figure out whether it's cluster-scoped or namespace-scoped, which namespace is configured, and so on. This is very error prone, eg what happens if a flag was renamed in a different version? What happens if the pod (deployment) is being deleted while you run?

pires commented 5 years ago

Going back to your argument of the current behavior being a huge limitation:

When I'm saying a huge limitation, it's not that huge... it's just annoying to be forced to do something very limitative only to protect people not knowing what they do.

What is very limiting about hardcoding nats-io namespace when (and only when) a cluster-scoped operator is installed?

prune998 commented 5 years ago

The limitation is that you don't have the choice to where you put the operator. All my cluster-scoped operators and services are in a tools namespace (kafka-operator, prometheus-operator, etcd-operator, elasticsearch-operator, some internal tools also). I don't want to have one namespace per tool/operator/application.

If the limitation is not strongly needed and if everything can work without it, why not removing it ?

The "default" Helm Chart or deployment manifests that are part of the project can still create a nats-io namespace and put everything inside it. The documentation can enforce this usage and warn, in RED, that there should be only one cluster-scope operator running at the same time (which can be enforced at the Operator startup), but there is absolutely no reason to die if a cluster-scoped operator is not running in a special namespace.

pires commented 5 years ago

As a maintainer, I want to make sure this software is easy enough for anyone to use and be successful. Given the description of your requirement, which seems irrelevant to the success of your usage of this software, I am still convinced we should not change the existing behavior. I also disagree fundamentally this implementation detail is not needed and I have described already why it is needed: we must make sure no more than one instance of nats-operator is managing cluster-scoped resources, and for that we store a resource (endpoint) lock in a well-known namespace, nats-io.

Since you brought up cert-manager, I talked to James Munnelly (AFAIK, author and still the most active contributor of said project), and we agree on the above.

slamdev commented 5 years ago

as a user of this operator, I want to deploy it to my own namespace, where I maintain the other infrastructure tools. I don't want to create a namespace per tool, since I am using some global policies for all deployed in this namespace, and maintaining 20+ namespaces is also not nice.

btw, istio operator can be deployed to any namespace and you can use annotations to tell it which namespaces it should monitor. the same goes to prometheus - no hardcoded namespace restrictions.

so it will be cool to have this feature :)

Quentin-M commented 5 years ago

Hi,

I would like to back @pires & the other users here. I understand where you are coming from, you want to prevent people from deploying this multiple times, and thus having different cluster-scoped operator conflicting. It is noble to protect users from dumb behavior, but locking down knowledgeable Kubernetes administrators, who like to keep all operators as part of kube-system, either because their Kubernetes alerts (separation of concerns) are there, to keep it tidy or because it is the only reserved namespace that's unmanaged by say - LDAP, is a regrettable choice.

When it comes to us, we've simply started building the operator like so to work around the issue:

go build -ldflags "-X github.com/nats-io/nats-operator/pkg/constants.KubernetesNamespaceNatsIO=kube-system
pires commented 5 years ago

@Quentin-M thank you for putting things in a different perspective and for sharing a way to achieve this, which is pretty neat. However, it requires building and pushing the operator, and doing this for every release, which may become overwhelming, so I'll just go ahead and add a flag or feature-gate for this.

Thank you all for your patience and feedback.

Quentin-M commented 5 years ago

@pires Agreed, thanks for your involvement & help. It looks like Supriya, one of our Kubernetes engineer, submitted a workaround PR and will work with you to make it fully-fledged one.

SidGrundfos commented 4 years ago

Hello, I am new to this NATS and I have a question related to a confusion I have -

  1. Nats Operator installed in nats-io (namespace-scoped) in k8s
  2. Nats Streaming Operator installed in the same namespace in k8s
  3. A server less function running in openfaas namespace which tries to publishg messages to NATS. I am getting error: Invalid connection. ---> NATS.Client.NATSNoServersException: Unable to connect to a server. Is this not allowed in case of namespace scoped installation
prune998 commented 4 years ago

@SidGrundfos the Operator is here to create Nats clusters, and is not a Nats server/cluster on its own. Once installed, you can create a cluster in the namespace you want depending on the CR you use. So you can either deploy the Nats cluster in the same NS as your app, or into another one, it does not have to be nats-io. Does that make sense ?

SidGrundfos commented 4 years ago

@prune998 Makes Sense. But my scenario is as below - A function (openfaas) image Namespace scoped installation image

image

image

I am getting the error -

image

prune998 commented 4 years ago

Well, I have no experience with Nats streaming operator. I see one of the pods is loop-crashing... why ?

Ensure your nats clusters (nats + streaming) are working, test it locally from the same namespace... then if your application can connect from inside the namespace but not from outside, you will have to dig in the RBAC and network setup... Can't help much here... this is not related to this issue at all, so if you think you found a bug, please open a new issue. Else, you may find help on a Nats or K8s slack channel, forum or mailinglist.

SidGrundfos commented 4 years ago

Yes true, but its one of the replica pods. I have created 3 replicas for both operator and streaming for each. only one of the replica is crashing, others are running fine

surajnarwade commented 2 years ago

any updates on this, I would like to install cluster scoped deployment in namespace other than nats-io