mercedes-benz / garm-operator

a k8s operator to run garm
MIT License
20 stars 2 forks source link

`garm` server configuration should be possible via `garm-operator` #143

Open bavarianbidi opened 1 month ago

bavarianbidi commented 1 month ago

What is the feature you would like to have?

With the recent changes in garm (and also future ideas in the backlog) it will be possible to "configure" a garm-server instance in a declarative way. The current version of garm-operator already takes care of the circumstance if a garm-instance is initialized or not and takes care of the initialization if needed.

As we already check on every api-request for an initialed garm-server (func Ensured Auth) we should at least decouple this implementation and use some existing primitives of kubernetes to make the code a bit more easier to read.

I would like to see a dedicated controller with is responsible for the generic garm-server management and controllers for scoped resources (Separation of concerns).

Current known resources

name scope depends_on
github endpoint shared across github credentials
github credentials shared across orgs/repos/enterprises github endpoint
webhooks
repo/org github credentials
pool repo/org, enterprise
enterprise github credentials
image shared across pools

Anything else you would like to add?

As the configuration of the github backend has already moved out of a config file and moved into the api (with https://github.com/cloudbase/garm/pull/243) and we have an issue for this (https://github.com/mercedes-benz/garm-operator/issues/127) we should agree on the separation level first.

How could the new CRs look like? Maybe we can "design" the CRs first before start implementing :pray:

bavarianbidi commented 1 month ago

cc @gabriel-samfira - as you already have similar thoughts about managing garm itself via the operator: would be nice if you could drop some of your thoughts about this in here :pray:

gabriel-samfira commented 1 month ago

I've actually started playing around with this here: https://github.com/gabriel-samfira/garm-operator/tree/add-garm-server-controller

However I've been sidetracked a bit. Hopefully I can resume soon.

The basic idea is as follows:

This is just a rough idea so far. What do you think?

bavarianbidi commented 1 month ago

i'm not sure if i got the entire idea :see_no_evil:

The first few bullet-points in your list are awesome - especially as you probably know best how it should be :sweat_smile:

are you even go one step further to manage the garm-server Deployment object (incl. Service, ... ) or is it more like "generating and creating/updating the configmap in k8s"? If so, the "state" of the garm-server Deployment is managed by two dedicated controllers/components:

gabriel-samfira commented 1 month ago

The controller would manage a StatefulSet which is the garm-server itself. Not sure if it's worth having separate controllers for config map management. I think we could fit that into the reconciliation loop of the garm server controller.

the reconciliation loop would basically use the CRD settings to compose the config map, save the config map, update the stateful set. If we add CRs for credentials and providers, we just add those two types in the loop as well. List al creds/providers, get CRD settings, compose config map, update stateful set.

The controller can watch for changes to credentials, providers and the garm server CRD (which it owns) and react accordingly.

For credentials and providers we need separate controllers/webhooks.

I think the best way to go about it would be to create a WiP branch and discuss it once it's ready. Doesn't seem to difficult to write a controller using kubebuilder. An initial draft would be without tests and not necessarily correct, but would give everyone a view of the rough idea.

bavarianbidi commented 1 month ago

hey gabriel, first of all - thanks for the detailed explanation and all the work you already put into garm :heart:

we can also jump into a dedicated call and take some time to talk about garm, garm-operator and other related stuff to get a better feeling what make sense from different point of views.

just some thoughts from our side which came into our mind: Your desired approach feels very similar to the prometheus-operator implementation. The entire configuration of the prometheus-stack (different parts of application configuration + parts of the pod.spec) is done in multiple CRs. This makes offering Prometheus-as-a-Service super handy.

Taking this similarity and reflecting to your approach means, it will be possible to offer Garm-as-a-Service very easy (which is awesome)

Pros:

Cons:

Our preferred approach is still the following:

deployment-manifest e.g.: we have to switch the pods `dnsPolicy` and `dnsConfig` if we have to interact with azure ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: {{ include "garm.fullname" . }} labels: {{- include "garm.labels" . | nindent 4 }} spec: replicas: 1 strategy: {{- .Values.deployment.strategy | toYaml | nindent 4 }} selector: matchLabels: {{- include "garm.selectorLabels" . | nindent 6 }} template: metadata: labels: {{- include "garm.selectorLabels" . | nindent 8 }} annotations: {{- if .Values.rx.enabled }} prometheus.io/scrape: "true" prometheus.io/path: "/metrics" prometheus.io/port: "{{ .Values.service.port }}" prometheus.io/rx-scrape-every-1m: "true" #default {{- end }} spec: serviceAccountName: {{ .Values.serviceAccount.name }} securityContext: runAsUser: {{ .Values.securityContext.runAsUser }} fsGroup: {{ .Values.securityContext.fsGroup }} {{- if .Values.garm.provider.azure.enabled }} dnsPolicy: "None" dnsConfig: nameservers: - 8.8.8.8 {{- end }} containers: - name: garm image: "{{ .Values.image.registry }}/{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" command: ["/opt/garm"] imagePullPolicy: {{ .Values.image.imagePullPolicy }} ports: - name: garm containerPort: 9997 protocol: TCP livenessProbe: tcpSocket: port: 9997 failureThreshold: 4 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 readinessProbe: tcpSocket: port: 9997 failureThreshold: 4 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 resources: {{- toYaml .Values.resources | nindent 12 }} volumeMounts: - name: {{ include "garm.fullname" . }}-garm-config-volume mountPath: /etc/garm readOnly: true {{- if .Values.garm.provider.openstack.enabled }} - name: {{ include "garm.fullname" . }}-openstack-provider-config-volume mountPath: /etc/garm/provider-config/openstack readOnly: true {{- end }} {{- if .Values.garm.provider.azure.enabled }} - name: {{ include "garm.fullname" . }}-azure-provider-config-volume mountPath: /etc/garm/provider-config/azure readOnly: true {{- end }} {{- if .Values.garm.provider.k8s.enabled }} - name: {{ include "garm.fullname" . }}-kubernetes-provider-config-volume mountPath: /etc/garm/provider-config/kubernetes readOnly: true {{- end }} - name: {{ include "garm.fullname" . }}-database mountPath: "/etc/garm/database" - name: serviceaccount-volume mountPath: /var/run/secrets/kubernetes.io/serviceaccount {{- if .Values.garm.provider.azure.enabled }} - name: transparent-proxy image: internal-transparent-proxy:v0.1.12-nonroot command: ["/bin/bash", "-c"] args: - | cat << EOF > /tmp/transparent-proxy.yaml port: 3128 disable_sudo: true debug: true proxies: - addresses: - proxy-server.company.io:3128 EOF /app/cmd/transparent-proxy.binary -config-file /tmp/transparent-proxy.yaml 2>&1 | tee /tmp/logs/proxy-log.txt & CHILD_PID=$! trap "kill -SIGTERM $CHILD_PID" SIGTERM wait $CHILD_PID env: - name: USER value: "{{ .Values.securityContext.runAsUser }}" securityContext: capabilities: add: - NET_ADMIN resources: requests: cpu: 10m memory: 10Mi limits: memory: 50Mi {{- end }} {{- if .Values.queryExporter.enabled }} - name: query-exporter image: "{{ .Values.queryExporter.image.registry }}/{{ .Values.queryExporter.image.repository }}:{{ .Values.queryExporter.image.tag }}" command: ["query-exporter"] args: - "/etc/query-exporter/config.yaml" - "-H" - "0.0.0.0" imagePullPolicy: {{ .Values.queryExporter.image.imagePullPolicy }} ports: - name: query-exporter containerPort: {{ .Values.queryExporter.service.port }} protocol: TCP resources: {{- toYaml .Values.queryExporter.resources | nindent 12 }} volumeMounts: - name: query-exporter-config-volume mountPath: /etc/query-exporter - name: {{ include "garm.fullname" . }}-database mountPath: "/etc/garm/database" {{- end }} volumes: {{- if .Values.queryExporter.enabled }} - name: query-exporter-config-volume projected: sources: - configMap: name: {{ include "garm.fullname" . }}-query-exporter-config {{- end}} - name: {{ include "garm.fullname" . }}-garm-config-volume projected: sources: - secret: name: {{ include "garm.fullname" . }}-garm-config {{- if .Values.garm.provider.openstack.enabled }} - name: {{ include "garm.fullname" . }}-openstack-provider-config-volume projected: sources: - secret: name: {{ include "garm.fullname" . }}-openstack-provider-cloud-config - secret: name: {{ include "garm.fullname" . }}-openstack-provider-config {{- end }} {{- if .Values.garm.provider.azure.enabled }} - name: {{ include "garm.fullname" . }}-azure-provider-config-volume projected: sources: - secret: name: {{ include "garm.fullname" . }}-azure-provider-config {{- end }} {{- if .Values.garm.provider.k8s.enabled }} - name: {{ include "garm.fullname" . }}-kubernetes-provider-config-volume projected: sources: {{- range $runner_cluster := .Values.garm.provider.k8s.runnerClusters }} {{- if ne $runner_cluster.name "local"}} - secret: name: {{ include "garm.fullname" $ }}-kubernetes-provider-kubeconfig {{- end }} - configMap: name: {{ include "garm.fullname" $ }}-kubernetes-provider-config {{- end }} {{- end }} - name: {{ include "garm.fullname" . }}-database persistentVolumeClaim: claimName: {{ include "garm.fullname" . }} # use a custom token with specified expiration time - this implementation behaves like the default token mechanism # see: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume - name: serviceaccount-volume projected: sources: - serviceAccountToken: path: token expirationSeconds: 600 # dont set exact to 3607 otherwise the token will be valid for 1 year - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace ```
gabriel-samfira commented 1 month ago

You're right. That makes a lot of sense. A proper helm chart is definitely preferable to reinventing the wheel. In that case, we should be fine with just implementing the resources that were moved to the DB (gh endpoints and credentials - potentially providers in the future), and potentially offer a helm chart that folks can use.

It's so easy to over complicate things. Thanks for the reality check!

Edit: It makes sense to have any helm chart live in the operator project. GARM should just do what it does. At most it should worry about packaging (deb, rpm, snap, flatpack, plain binary releases, etc). But automation should be separate.