kubernetes-sigs / node-feature-discovery-operator

Operator for managing Node Feature Discovery deployment
Apache License 2.0
66 stars 29 forks source link

Move to Operator-Framework Helm based #209

Closed ArangoGutierrez closed 9 months ago

ArangoGutierrez commented 11 months ago

Based on https://sdk.operatorframework.io/docs/building-operators/helm/

k8s-ci-robot commented 11 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ArangoGutierrez

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/node-feature-discovery-operator/blob/master/OWNERS)~~ [ArangoGutierrez] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
ArangoGutierrez commented 11 months ago

/assign @marquiz /cc @zvonkok

ArangoGutierrez commented 11 months ago

@ArangoGutierrez: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command pull-node-feature-discovery-operator-verify 1bdcedc link true /test pull-node-feature-discovery-operator-verify pull-node-feature-discovery-operator-verify-docs 1bdcedc link true /test pull-node-feature-discovery-operator-verify-docs pull-node-feature-discovery-operator-build-image 1bdcedc link true /test pull-node-feature-discovery-operator-build-image Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

/hold /needs-ok-to-test

ArangoGutierrez commented 11 months ago

This PR proposes moving from a GO based operator to a HELM based operator NFD has become very good at being deployed by HELM, and from an operator point of view we only need a controller to keep objects healthy. The operator has been lagging behind NFD development, with this move, NFD Helm chart becomes the CRD it self, making it easier to maintain the Operator, and also makes the operator more flexible to control NFD deployment.

yevgeny-shnaidman commented 11 months ago

This PR proposes moving from a GO based operator to a HELM based operator NFD has become very good at being deployed by HELM, and from an operator point of view we only need a controller to keep objects healthy. The operator has been lagging behind NFD development, with this move, NFD Helm chart becomes the CRD it self, making it easier to maintain the Operator, and also makes the operator more flexible to control NFD deployment.

I am not very familiar with help operator, but i have a couple of questions: 1) in the current operator we have a finalization logic, would we need it in helm-operator? 2) how is it going to handle prune option deployment during NFD CR deletion? 3) moving to helm operator removes a lot of code, but on the other hand it limits our flexibility, in case we want to add a new feature to NFD operator

ArangoGutierrez commented 11 months ago

This PR proposes moving from a GO based operator to a HELM based operator NFD has become very good at being deployed by HELM, and from an operator point of view we only need a controller to keep objects healthy. The operator has been lagging behind NFD development, with this move, NFD Helm chart becomes the CRD it self, making it easier to maintain the Operator, and also makes the operator more flexible to control NFD deployment.

I am not very familiar with help operator, but i have a couple of questions:

  1. in the current operator we have a finalization logic, would we need it in helm-operator?
  2. how is it going to handle prune option deployment during NFD CR deletion?
  3. moving to helm operator removes a lot of code, but on the other hand it limits our flexibility, in case we want to add a new feature to NFD operator
  1. It will have finalizers, just test it and see.
  2. Prune is a NFD option, not an Operator option. so You can pass that via Helm values on the CR.
  3. I don't see how. I think this moves the operator to a more healthy state. it has been un touched for a long time, with this, Maintaining the NFD-Operator becomes part of maintaining NFD Helm Charts. IF we need new features they should go into NFD, the operator is just to control the lifecycle of the Operand.
yevgeny-shnaidman commented 11 months ago

This PR proposes moving from a GO based operator to a HELM based operator NFD has become very good at being deployed by HELM, and from an operator point of view we only need a controller to keep objects healthy. The operator has been lagging behind NFD development, with this move, NFD Helm chart becomes the CRD it self, making it easier to maintain the Operator, and also makes the operator more flexible to control NFD deployment.

I am not very familiar with help operator, but i have a couple of questions:

  1. in the current operator we have a finalization logic, would we need it in helm-operator?
  2. how is it going to handle prune option deployment during NFD CR deletion?
  3. moving to helm operator removes a lot of code, but on the other hand it limits our flexibility, in case we want to add a new feature to NFD operator
  1. It will have finalizers, just test it and see.
  2. Prune is a NFD option, not an Operator option. so You can pass that via Helm values on the CR.
  3. I don't see how. I think this moves the operator to a more healthy state. it has been un touched for a long time, with this, Maintaining the NFD-Operator becomes part of maintaining NFD Helm Charts. IF we need new features they should go into NFD, the operator is just to control the lifecycle of the Operand.

Regarding 3: new features like tolerations and node selectors are operator features. So they either go into the Go code, or we need to set them in the Helm charts based on NFD CR.

ArangoGutierrez commented 11 months ago

This PR proposes moving from a GO based operator to a HELM based operator NFD has become very good at being deployed by HELM, and from an operator point of view we only need a controller to keep objects healthy. The operator has been lagging behind NFD development, with this move, NFD Helm chart becomes the CRD it self, making it easier to maintain the Operator, and also makes the operator more flexible to control NFD deployment.

I am not very familiar with help operator, but i have a couple of questions:

  1. in the current operator we have a finalization logic, would we need it in helm-operator?
  2. how is it going to handle prune option deployment during NFD CR deletion?
  3. moving to helm operator removes a lot of code, but on the other hand it limits our flexibility, in case we want to add a new feature to NFD operator
  1. It will have finalizers, just test it and see.
  2. Prune is a NFD option, not an Operator option. so You can pass that via Helm values on the CR.
  3. I don't see how. I think this moves the operator to a more healthy state. it has been un touched for a long time, with this, Maintaining the NFD-Operator becomes part of maintaining NFD Helm Charts. IF we need new features they should go into NFD, the operator is just to control the lifecycle of the Operand.

Regarding 3: new features like tolerations and node selectors are operator features. So they either go into the Go code, or we need to set them in the Helm charts based on NFD CR.

Agree, we are currently working to get those options into NFD deployment(Helm) I have an open PR with Automated Taints under review, so the idea is every functionality goes into NFD, and the operator is to manage it's life cycle, not to add new features

ArangoGutierrez commented 10 months ago

Gently reminder PING @marquiz @yevgeny-shnaidman

ArangoGutierrez commented 10 months ago

@marquiz @yevgeny-shnaidman @zvonkok Hope we can take a look this week at this proposal

ArangoGutierrez commented 10 months ago

Thanks @ArangoGutierrez. I think this would make sense.

With this – the operator basically going hand-in-hand-lockstep with the Helm chart - the immediate next question is that should we move the operator to the NFD main repo? 😊

sounds reasonable, but let's do 1 step at a time

yevgeny-shnaidman commented 10 months ago

Thanks @ArangoGutierrez. I think this would make sense.

With this – the operator basically going hand-in-hand-lockstep with the Helm chart - the immediate next question is that should we move the operator to the NFD main repo? 😊

@marquiz @ArangoGutierrez what is different in this new implementation that requires moving operator into the main repo? If i understand correctly , the new implementation just replaces the build/assets with the helm chats stored in the operator image.

ArangoGutierrez commented 10 months ago

Thanks @ArangoGutierrez. I think this would make sense. With this – the operator basically going hand-in-hand-lockstep with the Helm chart - the immediate next question is that should we move the operator to the NFD main repo? 😊

@marquiz @ArangoGutierrez what is different in this new implementation that requires moving operator into the main repo? If i understand correctly , the new implementation just replaces the build/assets with the helm chats stored in the operator image.

Is not a requirement, We have discussed in the past, having the 2 projects in a single git-repo. but is not the goal of this PR. What we have discussed in the past, is that the Operator should be the to-go way of deploying NFD, and having it on the same repo makes it more clear for users, and easier to maintain for us. Also Having the Operator be a Helm based operator, means we could potentially have a unified release cycle. instead of the operator playing catch with NFD releases.

ArangoGutierrez commented 10 months ago

Few random'ish comments/questions below.

But I think this is good, if @yevgeny-shnaidman is fine with this change.

Have you verified that it works? 😅

Thanks for the review @marquiz !! I have addressed your comments. Please take it for a spin, it works on my local Minikube, but a double check never hurts PTAL

marquiz commented 10 months ago

Not without issues, did a quick test:

$ helm -n nfd-o install -f values-local.yaml nfd-o deploy/helm/nfd-operator/
W1103 15:53:54.344149 2245049 warnings.go:70] unknown field "spec.enableNodeFeatureApi"
W1103 15:53:54.344161 2245049 warnings.go:70] unknown field "spec.gc"
...
Error: INSTALLATION FAILED: 1 error occurred:
    * NodeFeatureDiscovery.nfd.k8s-sigs.io "nfd-operator-instance" is invalid: spec.topologyUpdater: Invalid value: "object": spec.topologyUpdater in body must be of type boolean: "object"

Also creation of the CR failed:

$ k apply -f nfd.yaml
Error from server (BadRequest): error when creating "nfd.yaml": NodeFeatureDiscovery in version "v1" cannot be handled as a NodeFeatureDiscovery: strict decoding error: unknown field "spec.image"
ArangoGutierrez commented 10 months ago

Not without issues, did a quick test:

$ helm -n nfd-o install -f values-local.yaml nfd-o deploy/helm/nfd-operator/
W1103 15:53:54.344149 2245049 warnings.go:70] unknown field "spec.enableNodeFeatureApi"
W1103 15:53:54.344161 2245049 warnings.go:70] unknown field "spec.gc"
...
Error: INSTALLATION FAILED: 1 error occurred:
  * NodeFeatureDiscovery.nfd.k8s-sigs.io "nfd-operator-instance" is invalid: spec.topologyUpdater: Invalid value: "object": spec.topologyUpdater in body must be of type boolean: "object"

Also creation of the CR failed:

$ k apply -f nfd.yaml
Error from server (BadRequest): error when creating "nfd.yaml": NodeFeatureDiscovery in version "v1" cannot be handled as a NodeFeatureDiscovery: strict decoding error: unknown field "spec.image"

Thanks for reviewing, I have pushed an amend, I got it working with both make deploy and helm install

image I used for testing ghcr.io/arangogutierrez/node-feature-discovery-operator:v0.2.0-226-gb006fa0a-dirty

ArangoGutierrez commented 10 months ago

Few notes about the deployment. Is it desired that deploying the operator also always automatically deploys the operand in the same namespace? This doesn't feel right to me...

@yevgeny-shnaidman any comments from you to this PR.

Umm It was an idea I had, to help users get a running NFD right after helm install but can be removed.

marquiz commented 10 months ago

Umm It was an idea I had, to help users get a running NFD right after helm install but can be removed.

Kinda nice, I guess, but if we do this we should have a clear understanding of the implications. In this scenario the operator and operand become part of the same deployment (also the operand will be managed through the same helm deployment) 🧐 At least we should have a parameter for the user to opt-out from default-installing the operand. Plus make sure that helm uninstall works as expected in all cases (which is hard, I think, as the Helm deployment is not managing the nfd operand itself...)

yevgeny-shnaidman commented 10 months ago

Few notes about the deployment. Is it desired that deploying the operator also always automatically deploys the operand in the same namespace? This doesn't feel right to me...

@yevgeny-shnaidman any comments from you to this PR.

@marquiz if i understand correctly than the operand will only be deployed once the NFD CR is deployed. or does it bother you that it will be deployed in the same namespace?

marquiz commented 10 months ago

@marquiz if i understand correctly than the operand will only be deployed once the NFD CR is deployed. or does it bother you that it will be deployed in the same namespace?

Probably the biggest concern were the dependencies in the deployment and who manages what.

I.e. NFD CR was deployed by Helm -> the operator deployed NFD operand -> Helm uninstall will remove NFD CR and the operator and the NFD remains running un-managed

I had hard time wrapping my head around this scheme. At least there should be a parameter to disable automatic deployment of NFD CR (and thus the operand). Also it feels a bit overloaded/unorthodox to manage both the operator and the operator with the same Helm chart (what is the purpose of the operator in this scheme?)

yevgeny-shnaidman commented 10 months ago

@marquiz if i understand correctly than the operand will only be deployed once the NFD CR is deployed. or does it bother you that it will be deployed in the same namespace?

Probably the biggest concern were the dependencies in the deployment and who manages what.

I.e. NFD CR was deployed by Helm -> the operator deployed NFD operand -> Helm uninstall will remove NFD CR and the operator and the NFD remains running un-managed

I had hard time wrapping my head around this scheme. At least there should be a parameter to disable automatic deployment of NFD CR (and thus the operand). Also it feels a bit overloaded/unorthodox to manage both the operator and the operator with the same Helm chart (what is the purpose of the operator in this scheme?)

I am probably missing something, what do you mean by automatic deployment of NFD CR? NFD CR is deployed by cluster admin/user/other operator, no? If i understand the sequence correctly:

  1. operator is deployed
  2. admin/user/other_operator deploys NFD CR ( either via Helm or by some other means)
  3. operator deploys operand In case operator is un-deployed at the same time as NFD CR is un-deployed, the it is a problem, but maybe we can fix it with Owner/Controller references
marquiz commented 10 months ago

NFD CR is deployed by cluster admin/user/other operator, no? If i understand the sequence correctly:

  1. operator is deployed
  2. admin/user/other_operator deploys NFD CR ( either via Helm or by some other means)
  3. operator deploys operand In case operator is un-deployed at the same time as NFD CR is un-deployed, the it is a problem, but maybe we can fix it with Owner/Controller references

Exactly my thinking. I was commenting a previous version of this PR where step 2. was missing (or steps 1. and 2. were combined). But now that's gone from the PR

ArangoGutierrez commented 10 months ago

NFD CR is deployed by cluster admin/user/other operator, no? If i understand the sequence correctly:

  1. operator is deployed
  2. admin/user/other_operator deploys NFD CR ( either via Helm or by some other means)
  3. operator deploys operand In case operator is un-deployed at the same time as NFD CR is un-deployed, the it is a problem, but maybe we can fix it with Owner/Controller references

Exactly my thinking. I was commenting a previous version of this PR where step 2. was missing (or steps 1. and 2. were combined). But now that's gone from the PR

I think we can add the automated CR deployment later on, just moving to helm is a big change, we don't want to change today's way of deploying NFD, people are used to deploying the CR after deploying via Helm/Kustomize, let's not surprise them with a pre-created CR for now

ArangoGutierrez commented 10 months ago

/cc yevgeny-shnaidman

ArangoGutierrez commented 10 months ago

@yevgeny-shnaidman have you had time to take this Patch for a spin?

yevgeny-shnaidman commented 10 months ago

@yevgeny-shnaidman have you had time to take this Patch for a spin?

i am trying to port into OCP as we speak...

yevgeny-shnaidman commented 10 months ago

@yevgeny-shnaidman have you had time to take this Patch for a spin?

i am trying to port into OCP as we speak...

BTW, is there a helm-operator image for ARM?

ArangoGutierrez commented 10 months ago

@yevgeny-shnaidman have you had time to take this Patch for a spin?

i am trying to port into OCP as we speak...

BTW, is there a helm-operator image for ARM?

I built and tested this on my M1 Mac, just run make image and you will get an ARM image.

yevgeny-shnaidman commented 10 months ago

@yevgeny-shnaidman have you had time to take this Patch for a spin?

i am trying to port into OCP as we speak...

BTW, is there a helm-operator image for ARM?

I built and tested this on my M1 Mac, just run make image and you will get an ARM image.

It' s a multi-arch image?

ArangoGutierrez commented 10 months ago

@yevgeny-shnaidman have you had time to take this Patch for a spin?

i am trying to port into OCP as we speak...

BTW, is there a helm-operator image for ARM?

I built and tested this on my M1 Mac, just run make image and you will get an ARM image.

It' s a multi-arch image?

you can build multiarch if you want, look at the Makefile rules, I think you need docker with buildx enabled in your host

ArangoGutierrez commented 9 months ago

@yevgeny-shnaidman Everything under the nfd folder has to be the same Helm chart as in the NFD Helm chart. we should not modify that folder.

The helm-based-operator will use the information on the CR to then deploy the chart under nfd folder and modify the variables as needed.

yevgeny-shnaidman commented 9 months ago

@yevgeny-shnaidman Everything under the nfd folder has to be the same Helm chart as in the NFD Helm chart. we should not modify that folder.

The helm-based-operator will use the information on the CR to then deploy the chart under nfd folder and modify the variables as needed.

@ArangoGutierrez i understand that, i am just saying that with current code, the image values ( for example) are not taken from the NFD CR

ArangoGutierrez commented 9 months ago

@yevgeny-shnaidman Everything under the nfd folder has to be the same Helm chart as in the NFD Helm chart. we should not modify that folder. The helm-based-operator will use the information on the CR to then deploy the chart under nfd folder and modify the variables as needed.

@ArangoGutierrez i understand that, i am just saying that with current code, the image values ( for example) are not taken from the NFD CR

Why you say that? the NFD folder got created by the operator-framework tool, and it ties the Helm Values.yaml file to the CRD. remeber this PR creates a new CRD that is basically NFD Values.yaml file

yevgeny-shnaidman commented 9 months ago

@yevgeny-shnaidman Everything under the nfd folder has to be the same Helm chart as in the NFD Helm chart. we should not modify that folder. The helm-based-operator will use the information on the CR to then deploy the chart under nfd folder and modify the variables as needed.

@ArangoGutierrez i understand that, i am just saying that with current code, the image values ( for example) are not taken from the NFD CR

Why you say that? the NFD folder got created by the operator-framework tool, and it ties the Helm Values.yaml file to the CRD. remeber this PR creates a new CRD that is basically NFD Values.yaml file

Right, i have missed that part ( new CRD). I have been using the existing CR in my test. But do we want to change the CRD? Can't we use the old one? New CRD is a breaking change..

ArangoGutierrez commented 9 months ago

@yevgeny-shnaidman Everything under the nfd folder has to be the same Helm chart as in the NFD Helm chart. we should not modify that folder.

The helm-based-operator will use the information on the CR to then deploy the chart under nfd folder and modify the variables as needed.

@ArangoGutierrez i understand that, i am just saying that with current code, the image values ( for example) are not taken from the NFD CR

Why you say that? the NFD folder got created by the operator-framework tool, and it ties the Helm Values.yaml file to the CRD. remeber this PR creates a new CRD that is basically NFD Values.yaml file

Right, i have missed that part ( new CRD). I have been using the existing CR in my test. But do we want to change the CRD? Can't we use the old one? New CRD is a breaking change..

Yes, we know is a hard change, but is an improvement for the better of the project, and allows better integration with NFD deployment

ArangoGutierrez commented 9 months ago

Hi @yevgeny-shnaidman , I know this is a big change but we will get a lot from it. Migrating won't be that hard, we could help with documentation and examples. Hope we can work together to build a better Operator that goes more tied coupled with the Operand it manages

yevgeny-shnaidman commented 9 months ago

Hi @yevgeny-shnaidman , I know this is a big change but we will get a lot from it. Migrating won't be that hard, we could help with documentation and examples. Hope we can work together to build a better Operator that goes more tied coupled with the Operand it manages

@ArangoGutierrez i think think that if we keep the NFD CRD as is we can integrate your PR pretty easily. All we need to do is change the values.yaml to be conformant with the NFD CRD, and to change templates' conditions accordingly. Any new configuration (GC, Topology, etc') can be added as new fields to the CRD, so that most customer can continue using their operators without any change.

ArangoGutierrez commented 9 months ago

Hi @yevgeny-shnaidman , I know this is a big change but we will get a lot from it. Migrating won't be that hard, we could help with documentation and examples. Hope we can work together to build a better Operator that goes more tied coupled with the Operand it manages

@ArangoGutierrez i think think that if we keep the NFD CRD as is we can integrate your PR pretty easily. All we need to do is change the values.yaml to be conformant with the NFD CRD, and to change templates' conditions accordingly. Any new configuration (GC, Topology, etc') can be added as new fields to the CRD, so that most customer can continue using their operators without any change.

Changing the values.yaml is not an option, the idea is to use the one straight from NFD Charts, if not we end up in the same situation where the operator has a "special" condition that we have to maintain, and is extra work. The whole idea behind this PR, and other comments like moving the Operator to the NFD repo, is to reduce the maintenance effort, so we can focus on features.

We understand this might be a big change, but moving from the current CRD to this new one is not a lot of effort. and right now the operator is so outdated, that customers can not use new features from NFD, and that is something we want to remediate.