Re-factor operator code to be compliant with k8s operator guidelines

yevgeny-shnaidman commented 7 months ago

Currently the NFD operator is implemented using the outdated operator format, relaying on the hard-coded yamls files, configured in the build/assets directory. Those directories include, besides Deployments and Daemonsets definitions also service-accounts/roles/roles/bindings definitions.Reconciliation loop is reading all those files, updates the data with input from NFD CR and creates/updates the resources in the k8s cluster.

We want to move to a more compliant operator, and those are the changes that we may apply: 1.Remove build/assets directory. All the k8s objects will be created on-the-fly by the appropriate reconciller/s with the default values of the structures updated with values taken from NFD CR

All the RBAC/SA object will not be created/updated during reconciliation, but will be created during the deployment process
Reconcile the following 3 components that can be deployed by the operator:
- worker daemonset/configmap
- master deployment
- topology daemonset/configmap

We have 2 option how to reconcile those objects:

Option 1

We define a controller per component (worker, master, topology). Each controller will watch for NFD CR and the appropriate k8s object (daemonset, deployment, configmap) and will run a reconciliation loop whenever it detects a change.

Option 2

We define one controller that will watch for NFD CR, daemonsets/deployments and configmaps. It will run a reconciliation on any change. It will call a dedicated package to reconcile each one of the components.

CR Status

We can remove removing reporting conditions on any of the RBAC features/SA, since those will be created during operator deployment. We will only report on Daemonsets/Deployment creation/patching. No need to report on configmaps

yevgeny-shnaidman commented 7 months ago

/cc @marquiz @ArangoGutierrez

yevgeny-shnaidman commented 7 months ago

/cc @chr15p

marquiz commented 7 months ago

Thanks @yevgeny-shnaidman for this. The plan sounds good to me 👍 I would've probably resorted to e.g. ansible-based operator had I been doing this overhaul myself 😅 But you know operators and e.g. the OpenShift requirements so much better than me (still an operator n00b).

Regarding option 1 vs option 2: are you still talking about one NFD CRD (not one CRD per daemon)? I don't think I have much opinion on which would be preferred. Whatever makes sense from the implementation pov, I'd choose...

Thoughts @ArangoGutierrez?

yevgeny-shnaidman commented 7 months ago

@marquiz we are not changing/adding any APIs (CRDs) , just refactoring the code. a controller or controllers will watch for NFD CR and will extract from it the parts that are relevant for their specific components

ArangoGutierrez commented 7 months ago

/lgtm

yevgeny-shnaidman commented 7 months ago

@marquiz @ArangoGutierrez we are gonna starting coding next week, we are going to work on the master branch. Do you have any reservations regarding it? It means that for a number of week the master branch will not be deployable

marquiz commented 7 months ago

Cool @yevgeny-shnaidman. I don't think there are any reservations on that.

But what do you mean by not deployable? The master branch should remain buildable (and staging images available) all the time.

yevgeny-shnaidman commented 7 months ago

@marquiz i thought of working by submitting small PRs each time: controller/reconciler loop, then worker handling, deployment handling, topologyupdate etc'. Then taking care of deployment and bundling. So, there is going to be a period where the operator is not operational

marquiz commented 7 months ago

I think that's fine. I don 't think anyone's really using the staging/devel version of the operator

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

yevgeny-shnaidman commented 4 months ago

/remove-lifecycle stale

marquiz commented 4 months ago

@yevgeny-shnaidman any news on this issue, new PRs to expect in the near future?

yevgeny-shnaidman commented 4 months ago

@marquiz , me and Chris were on PTO, we will be back next week. Most of the PRs are in, we will start testing in, and the subsequent PRs will include the necessary fixes

yevgeny-shnaidman commented 2 months ago

@marquiz, @ArangoGutierrez this task can be considered completed. But in order to keep the operator in sync with the operand development, i think we need to start talking about unifying the operator and operand repos

ArangoGutierrez commented 2 months ago

/lgmt /close

k8s-ci-robot commented 2 months ago

@ArangoGutierrez: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/node-feature-discovery-operator/issues/210#issuecomment-2203409334): >/lgmt >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / node-feature-discovery-operator