`garm` server configuration should be possible via `garm-operator`

bavarianbidi commented 1 month ago

What is the feature you would like to have?

With the recent changes in garm (and also future ideas in the backlog) it will be possible to "configure" a garm-server instance in a declarative way. The current version of garm-operator already takes care of the circumstance if a garm-instance is initialized or not and takes care of the initialization if needed.

As we already check on every api-request for an initialed garm-server (func Ensured Auth) we should at least decouple this implementation and use some existing primitives of kubernetes to make the code a bit more easier to read.

I would like to see a dedicated controller with is responsible for the generic garm-server management and controllers for scoped resources (Separation of concerns).

Current known resources

name	scope	depends_on
github endpoint	shared across github credentials
github credentials	shared across orgs/repos/enterprises	github endpoint
webhooks
repo/org		github credentials
pool		repo/org, enterprise
enterprise		github credentials
image	shared across pools

Anything else you would like to add?

As the configuration of the github backend has already moved out of a config file and moved into the api (with https://github.com/cloudbase/garm/pull/243) and we have an issue for this (https://github.com/mercedes-benz/garm-operator/issues/127) we should agree on the separation level first.

How could the new CRs look like? Maybe we can "design" the CRs first before start implementing :pray:

bavarianbidi commented 1 month ago

cc @gabriel-samfira - as you already have similar thoughts about managing garm itself via the operator: would be nice if you could drop some of your thoughts about this in here :pray:

gabriel-samfira commented 1 month ago

I've actually started playing around with this here: https://github.com/gabriel-samfira/garm-operator/tree/add-garm-server-controller

However I've been sidetracked a bit. Hopefully I can resume soon.

The basic idea is as follows:

A new CR is created to describe the garm-server
A single CRD should be allowed (perhaps created automatically) that describes the managed garm server
Perhaps allow the controller to be disabled and use an external garm server, in which case things should work as they do now (not sure about this)
The controller is responsible for composing the GARM config file
GARM now returns the version it's running as part of the controller info API.
- Using this version we can decide if the credentials/providers should be composed as part of the config or as database entries. v0.1.4 and bellow requires a config entry. From 0.1.5 (not yet released), credentials are in the config. Providers will perhaps be moved to the DB starting with v0.1.6.
- The lack of a version in the controller info will make the operator assume the version is <= v0.1.4
New CR will be added for:
- Github Endpoints
- Github credentials
- Providers. This might be split up in the future into provider types and provider endpoints. The provider type will indicate the cloud which that provider is meant for. The provider endpoint will be a set of credentials for that cloud. This is not yet set in stone, comments and ideas are welcome. This will be slatd for v0.1.6 or v0.1.7.
The garm server operator will list all github endpoints, all gh credentials and all providers and along with the settings specified by the user for the garm server CRD, a valid TOML config will be composed and set as a config map.
That config map will be mounted as the garm-server config file and the server will be started.
Status will be set to Ready for the garm server, at which point the garm client can be instantiated and the rest of the operators will be unblocked to reconcile to their heart's contempt.

This is just a rough idea so far. What do you think?

bavarianbidi commented 1 month ago

i'm not sure if i got the entire idea :see_no_evil:

The first few bullet-points in your list are awesome - especially as you probably know best how it should be :sweat_smile:

are you even go one step further to manage the garm-server Deployment object (incl. Service, ... ) or is it more like "generating and creating/updating the configmap in k8s"? If so, the "state" of the garm-server Deployment is managed by two dedicated controllers/components:

configmap (and image if i got your working branch right) are managed by the operator
everything else which belongs to the garm-server belongs to another tool (e.g. flux/argocd or a semi-automatic github action deploy workflow)

gabriel-samfira commented 1 month ago

The controller would manage a StatefulSet which is the garm-server itself. Not sure if it's worth having separate controllers for config map management. I think we could fit that into the reconciliation loop of the garm server controller.

the reconciliation loop would basically use the CRD settings to compose the config map, save the config map, update the stateful set. If we add CRs for credentials and providers, we just add those two types in the loop as well. List al creds/providers, get CRD settings, compose config map, update stateful set.

The controller can watch for changes to credentials, providers and the garm server CRD (which it owns) and react accordingly.

For credentials and providers we need separate controllers/webhooks.

I think the best way to go about it would be to create a WiP branch and discuss it once it's ready. Doesn't seem to difficult to write a controller using kubebuilder. An initial draft would be without tests and not necessarily correct, but would give everyone a view of the rough idea.

bavarianbidi commented 1 month ago

hey gabriel, first of all - thanks for the detailed explanation and all the work you already put into garm :heart:

we can also jump into a dedicated call and take some time to talk about garm, garm-operator and other related stuff to get a better feeling what make sense from different point of views.

just some thoughts from our side which came into our mind: Your desired approach feels very similar to the prometheus-operator implementation. The entire configuration of the prometheus-stack (different parts of application configuration + parts of the pod.spec) is done in multiple CRs. This makes offering Prometheus-as-a-Service super handy.

Taking this similarity and reflecting to your approach means, it will be possible to offer Garm-as-a-Service very easy (which is awesome)

Pros:

offering Garm-as-a-Service
Getting a fully opinionated garm-stack became much more easier

Cons:

reflecting the entire pod-spec (below is a compressed version of our Deployment manifest from our helm-chart) must be implemented (not that it's not possible, but it's a lot of work to do and to maintain)
how to deal with additional resources (e.g. netpol, pdb, ...) - in our case we need networkpolicies beside the garm-application - so we still need an additional way to deploy resources which belong to the managed garm-instance (adding this to the operator, will also increase complexity and probably ending up re-implementing existing solutions - like helm-chart support, ...)
High complexity if operator "just" manages one garm-instance (please note, the current operator doesn't support multiple garm-instances yet)
Maintainability - as we do not have that large community with code contributions it might became hard to maintain this

Our preferred approach is still the following:

everything which must be configured on garm side via the API (e.g. github configuration , ...) will became a CRD where the operator takes care of
garm-server is deployed/configured/managed outside the operator (if wanted, we can try to contribute our existing helm chart back to cloudbase/garm )

deployment-manifest

e.g.: we have to switch the pods `dnsPolicy` and `dnsConfig` if we have to interact with azure ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: {{ include "garm.fullname" . }} labels: {{- include "garm.labels" . | nindent 4 }} spec: replicas: 1 strategy: {{- .Values.deployment.strategy | toYaml | nindent 4 }} selector: matchLabels: {{- include "garm.selectorLabels" . | nindent 6 }} template: metadata: labels: {{- include "garm.selectorLabels" . | nindent 8 }} annotations: {{- if .Values.rx.enabled }} prometheus.io/scrape: "true" prometheus.io/path: "/metrics" prometheus.io/port: "{{ .Values.service.port }}" prometheus.io/rx-scrape-every-1m: "true" #default {{- end }} spec: serviceAccountName: {{ .Values.serviceAccount.name }} securityContext: runAsUser: {{ .Values.securityContext.runAsUser }} fsGroup: {{ .Values.securityContext.fsGroup }} {{- if .Values.garm.provider.azure.enabled }} dnsPolicy: "None" dnsConfig: nameservers: - 8.8.8.8 {{- end }} containers: - name: garm image: "{{ .Values.image.registry }}/{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" command: ["/opt/garm"] imagePullPolicy: {{ .Values.image.imagePullPolicy }} ports: - name: garm containerPort: 9997 protocol: TCP livenessProbe: tcpSocket: port: 9997 failureThreshold: 4 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 readinessProbe: tcpSocket: port: 9997 failureThreshold: 4 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 resources: {{- toYaml .Values.resources | nindent 12 }} volumeMounts: - name: {{ include "garm.fullname" . }}-garm-config-volume mountPath: /etc/garm readOnly: true {{- if .Values.garm.provider.openstack.enabled }} - name: {{ include "garm.fullname" . }}-openstack-provider-config-volume mountPath: /etc/garm/provider-config/openstack readOnly: true {{- end }} {{- if .Values.garm.provider.azure.enabled }} - name: {{ include "garm.fullname" . }}-azure-provider-config-volume mountPath: /etc/garm/provider-config/azure readOnly: true {{- end }} {{- if .Values.garm.provider.k8s.enabled }} - name: {{ include "garm.fullname" . }}-kubernetes-provider-config-volume mountPath: /etc/garm/provider-config/kubernetes readOnly: true {{- end }} - name: {{ include "garm.fullname" . }}-database mountPath: "/etc/garm/database" - name: serviceaccount-volume mountPath: /var/run/secrets/kubernetes.io/serviceaccount {{- if .Values.garm.provider.azure.enabled }} - name: transparent-proxy image: internal-transparent-proxy:v0.1.12-nonroot command: ["/bin/bash", "-c"] args: - | cat << EOF > /tmp/transparent-proxy.yaml port: 3128 disable_sudo: true debug: true proxies: - addresses: - proxy-server.company.io:3128 EOF /app/cmd/transparent-proxy.binary -config-file /tmp/transparent-proxy.yaml 2>&1 | tee /tmp/logs/proxy-log.txt & CHILD_PID=$! trap "kill -SIGTERM $CHILD_PID" SIGTERM wait $CHILD_PID env: - name: USER value: "{{ .Values.securityContext.runAsUser }}" securityContext: capabilities: add: - NET_ADMIN resources: requests: cpu: 10m memory: 10Mi limits: memory: 50Mi {{- end }} {{- if .Values.queryExporter.enabled }} - name: query-exporter image: "{{ .Values.queryExporter.image.registry }}/{{ .Values.queryExporter.image.repository }}:{{ .Values.queryExporter.image.tag }}" command: ["query-exporter"] args: - "/etc/query-exporter/config.yaml" - "-H" - "0.0.0.0" imagePullPolicy: {{ .Values.queryExporter.image.imagePullPolicy }} ports: - name: query-exporter containerPort: {{ .Values.queryExporter.service.port }} protocol: TCP resources: {{- toYaml .Values.queryExporter.resources | nindent 12 }} volumeMounts: - name: query-exporter-config-volume mountPath: /etc/query-exporter - name: {{ include "garm.fullname" . }}-database mountPath: "/etc/garm/database" {{- end }} volumes: {{- if .Values.queryExporter.enabled }} - name: query-exporter-config-volume projected: sources: - configMap: name: {{ include "garm.fullname" . }}-query-exporter-config {{- end}} - name: {{ include "garm.fullname" . }}-garm-config-volume projected: sources: - secret: name: {{ include "garm.fullname" . }}-garm-config {{- if .Values.garm.provider.openstack.enabled }} - name: {{ include "garm.fullname" . }}-openstack-provider-config-volume projected: sources: - secret: name: {{ include "garm.fullname" . }}-openstack-provider-cloud-config - secret: name: {{ include "garm.fullname" . }}-openstack-provider-config {{- end }} {{- if .Values.garm.provider.azure.enabled }} - name: {{ include "garm.fullname" . }}-azure-provider-config-volume projected: sources: - secret: name: {{ include "garm.fullname" . }}-azure-provider-config {{- end }} {{- if .Values.garm.provider.k8s.enabled }} - name: {{ include "garm.fullname" . }}-kubernetes-provider-config-volume projected: sources: {{- range $runner_cluster := .Values.garm.provider.k8s.runnerClusters }} {{- if ne $runner_cluster.name "local"}} - secret: name: {{ include "garm.fullname" $ }}-kubernetes-provider-kubeconfig {{- end }} - configMap: name: {{ include "garm.fullname" $ }}-kubernetes-provider-config {{- end }} {{- end }} - name: {{ include "garm.fullname" . }}-database persistentVolumeClaim: claimName: {{ include "garm.fullname" . }} # use a custom token with specified expiration time - this implementation behaves like the default token mechanism # see: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume - name: serviceaccount-volume projected: sources: - serviceAccountToken: path: token expirationSeconds: 600 # dont set exact to 3607 otherwise the token will be valid for 1 year - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace ```

gabriel-samfira commented 1 month ago

You're right. That makes a lot of sense. A proper helm chart is definitely preferable to reinventing the wheel. In that case, we should be fine with just implementing the resources that were moved to the DB (gh endpoints and credentials - potentially providers in the future), and potentially offer a helm chart that folks can use.

It's so easy to over complicate things. Thanks for the reality check!

Edit: It makes sense to have any helm chart live in the operator project. GARM should just do what it does. At most it should worry about packaging (deb, rpm, snap, flatpack, plain binary releases, etc). But automation should be separate.

mercedes-benz / garm-operator

`garm` server configuration should be possible via `garm-operator` #143

What is the feature you would like to have?

Anything else you would like to add?