projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.99k stars 1.33k forks source link

Remove hard-coded namespace from Helm Chart 'projectcalico/tigera-operator' #4812

Closed davivcgarcia closed 2 years ago

davivcgarcia commented 3 years ago

Expected Behavior

When installing the Helm Chart, the user expects that namespace is configurable by flag --namespace. However the current version fixes namespace and doesn't respect the convention.

https://helm.sh/docs/faq/changes_since_helm2/#release-names-are-now-scoped-to-the-namespace

Current Behavior

If you render the current chart version, you'll see that many resources are created with hard-coded namespace tigera-operator.

helm template calico projectcalico/tigera-operator --namespace dummy

Possible Solution

Update the Helm Chart to respect the namespace definition provided by the user when supplied --namespace, and recommend the usage of tigera-operator in the docs. Also, drop the creation of the namespace and recommend the usage of the flag --create-namespace instead.

Context

The way the chart is written today is breaking the GitOps workflow using FluxCD.

davivcgarcia commented 3 years ago

This issue was initially raised at tigera/operator#1434 and tigera/operator#1427.

stevehipwell commented 3 years ago

I've got an alternative chart that supports running the operator in a custom namespace.

ArchiFleKs commented 3 years ago

Right now it is not possible to install the operator inside the namespace tigera-operator which is annoying because the chart itself tries to create it.

caseydavenport commented 3 years ago

@davivcgarcia marked this as a bug. I am curious though why the tigera-operator namespace isn't suitable / why you need to change it?

caseydavenport commented 3 years ago

Actually swapping this to enhancement - I think it's working as intended right now, but that doens't mean we can't change it. I do need to understand the motivation for wanting to install in a separate namespace though.

ArchiFleKs commented 3 years ago

@caseydavenport the namespace is suitable but the helm release cannot be install in the tigera operator namespace because it already exist (either created by helm or by another tools before (in my case wit terraform for example). The current state of thing is :

I think the namespace should juste be removed and left to the user or at least not created via Helm.

Hope it is clearer. Helm v3 release need to be install in a namespace, which is specify when running Helm install/upgrade command. This namespace need to exist or can be created by helm with « —create-namespace » but this chart create the namespace also.

TLDR: Helm install tigera-operator -n tigera-operator does not work.

Edit: If you need people to use the tigera operator namespace that’s fine if it is specified in the docs or elsewhere but people need to be able to decide how the namespace creation will occurred without having no choice that being created and managed by the Helm release

caseydavenport commented 3 years ago

the namespace is suitable but the helm release cannot be install in the tigera operator namespace because it already exist (either created by helm or by another tools before (in my case wit terraform for example)

If I understand correctly, you mean that there is already a deployment of the tigera-operator on the cluster?

If that's the case, you definitely shouldn't be installing a second one in a new namespace. That's a surefire way to break your cluster. Sorry if I'm still misunderstanding, though!

ArchiFleKs commented 3 years ago

No what I mean is that helm need a namespace to store the release, in the sense of storing the helm release in a secret. With this chart, there are 3 use case:

ArchiFleKs commented 3 years ago

the namespace is suitable but the helm release cannot be install in the tigera operator namespace because it already exist (either created by helm or by another tools before (in my case wit terraform for example.

By that I mean namespace are kind of privileged resources and people might want to manage them in a different way. Almost no chart embed a namespace object. But Helm offer the option to create it or not.

For example when I managed cluster with Terraform I’m creating the namespace with terraform and not letting Helm automatically create the namespace, so we can add annotation or labels etc. Then I install the release in the terraform created namespace.

ArchiFleKs commented 3 years ago

Here is the outputs on a fresh minikube cluster:

 helm install -n tigera-operator calico projectcalico/tigera-operator --version v3.20.0 
Error: create: failed to create: namespaces "tigera-operator" not found

Then if a create the namespace with Helm:

helm install -n tigera-operator calico projectcalico/tigera-operator --version v3.20.0 --create-namespace      
Error: namespaces "tigera-operator" already exists

The official doc install into the default namespace:

 helm install calico projectcalico/tigera-operator --version v3.20.0                                          ░▒▓ INT ✘  0.31.3 terragrunt  minikube ⎈  01:44:36   ▓▒░
W0817 01:44:48.216026 1781290 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0817 01:44:48.275522 1781290 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: calico
LAST DEPLOYED: Tue Aug 17 01:44:48 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Helm release is installed in default namespace

helm ls -n default
NAME    NAMESPACE       REVISION        UPDATED                                         STATUS          CHART                   APP VERSION
calico  default         1               2021-08-17 01:44:48.051033191 +0200 CEST        deployed        tigera-operator-v3.20.0 v3.20.0

But the pods are in tigera-operator but not the Helm release:

k -n tigera-operator get pods 
NAME                               READY   STATUS    RESTARTS   AGE
tigera-operator-698876cbb5-nt8bz   1/1     Running   0          68s
caseydavenport commented 3 years ago

@ArchiFleKs thanks for the detailed explanation, I think that makes more sense to me now!

Here's what I'm thinking - we need to allow specification of the -n argument to helm so that it can properly put its releases in the tigera-operator namespace. However, we really don't want two versions of the operator installed on the same cluster, so we probably want to make sure that the Deployment, etc is hard-coded to tigera-operator independent of what the --namespace argument is to helm.

I think that means we should:

One thing I don't fully understand is the implication for upgrades for anyone who is using the existing chart. Would helm delete the namespace on upgrade?

CC @ozdanborne @lwr20

stevehipwell commented 3 years ago

@caseydavenport I don't think you should be using hard coded namespaces to block a second operator being installed in cluster, that's not what Helm is designed for. An idiomatic Helm chart should be able to be installed in any namespace. I get the need for the calico-system namespace but there is no reason why the operator needs to be in the tigera-operator namespace. If you really need to guarantee there is only ever one operator it would need to be a collaborative solution such as a fixed named secret in the installation namespace from the chart to stop multiple installations to the same namespace; combined with the operator maintaining a fixed named secret in calico-system to define the active operator. However after this you're into the 80/20 realm of diminishing returns and would need to think of the effort is worth it for an edge case actively documented against.

RE the Helm chart I'd suggest a v2 release to fix the current shortcomings. I have my own version of the chart which can be installed into any namespace that I'd be happy to offer as a PR.

caseydavenport commented 3 years ago

However after this you're into the 80/20 realm of diminishing returns and would need to think of the effort is worth it for an edge case actively documented against.

Yeah, I wish it were true, but actively documenting against something doesn't mean it won't happen. I guarantee that we'll end up seeing users hitting issues due to accidental misuse of the chart.

That's not to say we shouldn't do it - I completely agree that we're going against the grain here a bit - just understand that it's additional time investment to enable a workflow that doesn't have a clear-cut use-case other than aligning with an idiom.

I'm completely on board with calling the current behavior a bug (the helm release being installed into default). Although, I don't see the lack of ability to configure an arbitrary namespaces for the entire chart a bug, because that is working as was designed. But, it seems like we might need to allow for that in order to fix the helm release issue, so that point is probably moot.

@stevehipwell a few more questions for you that popped into my head. These might be obvious to a helm expert but not to me!

stevehipwell commented 3 years ago

@caseydavenport have you taken a look at my version of the chart I linked? I'd happily open a PR to change the chart with the MVP for single install hack, which is a fixed name secret in kube-system. This could act as a discussion point. In an ideal world I could deprecate my chart if the official one supported my use case.

In response to your questions. Global resources are usually named to align with the chart name, see my chart above. See my chart for how I deal with the installation resource. I'm also going to add templating for the installation so any chart variable can be passed in as an alternative to the current charts use of pull secrets. RE v2 I was referring to a breaking API change, but the chart above uses the Helm v2 API (I'm not sure if the current chart does or not).

tmjd commented 3 years ago

We're expecting to include this issue as a topic of discussion next week (Sept 8th, 2021) in the Calico community meeting if anyone here wants to join to be part of that discussion. You can find the details to join here https://github.com/projectcalico/community

tmjd commented 3 years ago

@stevehipwell I took a look at your chart specifically the Installation resource and I don't see specific any handling of that global resource. AFAICT, to me, it looks like if someone installed multiple versions of the chart with installation.enabled then which ever was applied 2nd would take precedence. Or because the installation resource already existed would the 2nd one with installation.enabled be blocked because the resource already existed?

I was looking at https://github.com/stevehipwell/helm-charts/blob/master/charts/tigera-operator/templates/installation.yaml (mentioning in case there is something else I should be looking at too).

stevehipwell commented 3 years ago

@tmjd it's on my backlog to test the behaviour, but my expectation is that a second attempted installation would fail due to the presence of a fixed name global resource. This is where Helm beats kubectl, even server side apply with field ownership, due to the concept of resource ownership.

After working through the code as part of the priority class PR it is my opinion that the Helm chart source should live in the operator repo, as this is where the related logic exists. I'd also suggest that the chart should be published from the operator repo in full form as the Calico version appears to be missing CRDs and thus functionality. If Calico needs a chart as a release artifact the source could be copied (AWS do this for eks-charts) and CRDs could be excluded as desired.

FYI the cluster role in the current Calico chart is incorrect resulting in the operator being unable to watch felix configurations. My charts latest release solves this by allowing the operator to watch and get all the Calico CRD resources, but this could be fixed just for the specific one.

tmjd commented 3 years ago

FYI the cluster role in the current Calico chart is incorrect resulting in the operator being unable to watch felix configurations.

You're talking about using the Calico cluster role with a master build of the operator correct? I believe this has been fixed now.

BTW, there has been some talk recently about making the operator repo the source of the manifests for installing the operator.

stevehipwell commented 3 years ago

@tmjd I'm talking about the cluster role in the last Calico release chart artifact.

stevehipwell commented 3 years ago

@tmjd I've just added a lock secret to my chart (https://github.com/stevehipwell/helm-charts/pull/318), as the installation resource isn't always created, to make it only possible to install the chart once. This could be adopted by the operator code and used to validate against the OPERATOR_NAME & OPERATOR_NAMESPACE env variables which would stop someone with custom yaml getting two operators into a cluster as only the one with the correct configuration would process the CRD. Another possibility to support this would be to capture these details directly in the CRD, so the operator could validate itself against these.

tmjd commented 3 years ago

@stevehipwell could you create a separate issue about the helm chart? From what I can tell the chart should be fine as long as it it is used with the v1.20.0 version of the operator that the helm chart includes.

stevehipwell commented 3 years ago

@tmjd that might explain it as I'm using the v1.21.1 operator in my chart.

caseydavenport commented 2 years ago

FYI I've started some incremental work on improving our helm chart to be more idiomatic. The first one I want to tackle is this issue - removing the hardcoded namespace from the chart.

I've got a PR here that includes chart changes, manifest changes, and also includes some instructions for how to upgrade from previous chart releases gracefully: https://github.com/projectcalico/calico/pull/5649

Would appreciate if interested parties could take a look and give me some feedback on it.

The upgrade steps in particular are a little bit of a hack, but I think probably a necessary evil in order to ensure a graceful upgrade with no downtime.