Actions-Runner-Controller support for Gitea Actions

omniproc commented 8 months ago

Feature Description

The Gitea Actions release was a great first step. But currently it's missing many features of a more mature solution based on K8s runners rather then single nodes. While it's possible to have runners on K8s this currently requires DinD which has it's hole set of own problems, security issues (privileged exec required as of today) and feature limitations (can't use DinD to start another container to build a container image (DinDinD)). I know with buildx workarounds exist, but those are just that: workarounds.

I think the next step could be something like what actions-runner-controller is doing for GitHub actions. Basically a operator that is deployed on K8s and registers as runner. Every job it starts is then started in it's own pod rather then the runner itself. The runner coordinates the pods.

Related docs:

Screenshots

No response

ChristopherHX commented 8 months ago

k8s hooks are technically (means there is no documentation, the docker compose examples use dind + docker hooks) already usable with Gitea Actions see this third party runner adapter https://gitea.com/gitea/awesome-gitea/pulls/149

Actions-Runner-Controller would require emulation of a bigger set of internal githib actions api

I actually find this interesting to reverse engineer that product too, but I never dealt with k8s myself.

_act_runner with it's act backend doesn't support container hooks or k8s for the time beeing_

omniproc commented 8 months ago

Interesting. I wasn't aware you could change the runner implementation just like that. Def will look into it. However given what you said about DinD still being a requirement I don't think it will change much (we already have our runners on K8s with DinD using a adopted version of gitea/act-runner for k8s but as mentioned, this comes with many headaches).

The goal IMHO would be to be able to start workflows on k8s directly. Possible implementations:

Every job is it's own pod. Challenge: data sharing between jobs would require PVs and complicated mount/unmount logic to support the more common RWO PVs. I'm aware that currently github's approach to data sharing between jobs is "yo dawg, just upload it to our artifact store" but in on-prem scenarios that's not what you normally want so some sort of common local cache between jobs is a relevant feature at least I would be very interested in.
Every workflow is a pod. Jobs start as containers. Benefit: all containers can have access to the same data easily using e.g. a EmptyDir volume. Challenge: pods are immutable so:
- either all jobs (==containers) need to be present when the pod starts, requiring some kind of wait logic when we need job dependencies which pbl. comes with it's own set of problems.
- possibly ephemeral container could be used to add containers (==jobs) to a pod at runtime when a dependent job is ready. However ephemeral containers come with a set of limitations and are ment for a different use case, so I'm not sure if that would be a good fit.

Option one (every job is it's own pod) seems like the most promissing option in my opinion.

ChristopherHX commented 8 months ago

However given what you said about DinD still being a requirement I don't think it will change much

I meant, I didn't create any k8s mode examples / actually tried it yet. Sorry for confusion here.

The docker container hooks only allow dind for k8s. While the k8s hooks should use kubernetes api for container management, I still need to look into creating a test setup running.

I can imagine

(controller) actions_runner is started with maxparallel 100 (yes it's possible to use any value >= 1)
(job controller) a worker script (spawned when a job request is received) forwards stdin and the network to the adapter to spawn the actions/runner
(actual job) k8s hooks spawn a job container using k8s apis

_Well not using actrunner has limitations when you try to use Gitea Actions Extensions (using features not present in GitHub Actions)

I think option 1 is more likly to happen than option 2. Job scheduling is based on jobs not on workflows.

ChristopherHX commented 8 months ago

k8shooks works for me using these files on minikube (arm64)

actions-runner-k8s-gitea-sample-files.zip

Missing usage of secrets, need to learn kubernetes
No autoscaling
No persistence of runner credentials

With clever sharing of the runner credentials volume, you could start a lot of replicas for more parallel runners

This works without dind

Test workflow

on: push
jobs:
  _:
    runs-on: k8s # <-- Used runner label
    container: ubuntu:latest # <-- Required, maybe the Gitea Actions adapter could insert a default
    steps:
    # Git is needed for actions/checkout to work for Gitea, rest api is not compatible
    - run: apt update && apt install -y git
    - uses: https://github.com/actions/checkout@v3 # <-- The almost only Gitea Extension supported
    - run: ls -la
    - run: ls -la .github/workflows

The runner-pod-workflow is the job container pod, running directly via k8s.

omniproc commented 8 months ago

Looks promising. I'll give it a shot and share my findings.

omniproc commented 8 months ago

Okay, so... there seems to be some issues with the current setup. Let me share my findings:

You've been asking how to provide secrets in K8s, it's as simple as that:
```
- name: GITEA_RUNNER_REGISTRATION_TOKEN
valueFrom:
  secretKeyRef:
    name: secret_name
    key: secret_key
```
and creating your secret with (take care: K8s is case sensitive):

apiVersion: v1
kind: Secret
metadata:
  name: secret_name
type: Opaque
stringData:
  secret_key: "s3cr3t"

You shouldn't start pods in K8s directly but rather wrap them into a higher level resource such as a deployment which will make it benefit from the (deployment) controller logic when updating or self-healing the pod. I did that so the result looks something like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: runner
  name: runner
spec:
  replicas: 1
  selector:
    matchLabels:
      app: runner
  template:
    metadata:
      labels:
        app: runner
    spec:
      strategy:
        type: Recreate
      restartPolicy: Always
      serviceAccountName: ci-builder
      #securityContext:
      #  runAsNonRoot: true
      #  runAsUser: 1000
      #  runAsGroup: 1000
      #  seccompProfile:
      #    type: RuntimeDefault
      volumes:
        - name: workspace
          emptyDir:
            sizeLimit: 5Gi
      containers:
      - name: runner
        image: ghcr.io/christopherhx/gitea-actions-runner:v0.0.11
        #securityContext:
        #  readOnlyRootFilesystem: true
        #  allowPrivilegeEscalation: false
        #  capabilities:
        #    drop:
        #      - ALL
        volumeMounts:
          - mountPath: /home/runner/_work
            name: workspace
        env:
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: GITEA_INSTANCE_URL
            value: https://foo.bar
          - name: GITEA_RUNNER_REGISTRATION_TOKEN
            valueFrom:
              secretKeyRef:
                name: gitea
                key: token
          - name: GITEA_RUNNER_LABELS
            value: k8s
        resources:
          requests:
            cpu: 500m
            memory: 2Gi
          limits:
            cpu: 1000m
            memory: 8Gi

Few changes I made here:

For the volume if no persistence across new pods started by the runner is needed a volume of type emptyDir can act as a temporary volume to share data between containers of a pod and write data to a well known location.
I added a resources section to follow best practice. numbers pbl. need to be adopted to something that makes more sense.
I added securityContext but needed to disable it for now for trouble shooting since it currently can't work as needed because of some issues with the current runner setup:
- The Dockerfile switches to the runner user using it's name in USER runner. K8s doesn't like that if runAsNonRoot is specified but no runAsUser is given in the security context and the image is using a "non-numeric" user. I'd opt in for using USER 1000 in the Dockerfile instead, which should make this easier in the future.
- allowPrivilegeEscalation: false can't currently be used because start.sh makes use of sudo to create the folder layout: sudo chown -R runner:docker /home/runner/_work and sudo chown -R runner:docker /data. I think a better approach would be to just create those folders within the mounted EmptyDir volume. The running user should already have all permissions there to create the folders so no sudo would be needed but I'm not sure what those folders are currently used for and how hardcoded those paths are.
- readOnlyRootFilesystem will pbl. also cause issues in the future when other paths then the mounted volume is used and again, I think the easiest way to allow for max. container security in k8s would be to simply not use the root fs at all but simply do everything on the mounted volume.

So, those are simply improvement suggestions for the future. For now as you can see I've been trying to keep it as simple as possible, but I still run into a issue. The runner starts and registers, but when using the job you provided I run into the following error returned by the job:


[WORKER 2024-03-12 15:59:08Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2024-03-12 15:59:08Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2024-03-12 15:59:08Z INFO HostContext] Well known config file 'Credentials': '/home/runner/.credentials'
[WORKER 2024-03-12 15:59:08Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2024-03-12 15:59:08Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2024-03-12 15:59:08Z INFO HostContext] Well known config file 'Runner': '/home/runner/.runner'
[WORKER 2024-03-12 15:59:08Z INFO Worker] Version: 2.314.0
[WORKER 2024-03-12 15:59:08Z INFO Worker] Commit: bc79e859d7b66e8018716bc94160656f6c6948fc
[WORKER 2024-03-12 15:59:08Z INFO Worker] Culture: 
[WORKER 2024-03-12 15:59:08Z INFO Worker] UI Culture: 
[WORKER 2024-03-12 15:59:08Z INFO Worker] Waiting to receive the job message from the channel.
[WORKER 2024-03-12 15:59:08Z INFO ProcessChannel] Receiving message of length 6322, with hash '30564f1b4d3e28c3d9cc39d17eca1132cc026a2abeb6ab1be6736d80cf019ea9'
[WORKER 2024-03-12 15:59:08Z INFO Worker] Message received.
Newtonsoft.Json.JsonReaderException: Invalid character after parsing property name. Expected ':' but got:  . Path 'ContextData.github.d[20].v.d[5].v.d[14].v.d[11].v', line 1, position 6322.
   at Newtonsoft.Json.JsonTextReader.ParseProperty()
   at Newtonsoft.Json.JsonTextReader.ParseObject()
   at Newtonsoft.Json.Linq.JContainer.ReadContentFrom(JsonReader r, JsonLoadSettings settings)
   at Newtonsoft.Json.Linq.JContainer.ReadTokenFrom(JsonReader reader, JsonLoadSettings options)
   at Newtonsoft.Json.Linq.JObject.Load(JsonReader reader, JsonLoadSettings settings)
   at Newtonsoft.Json.Linq.JObject.Load(JsonReader reader)
   at GitHub.DistributedTask.Pipelines.ContextData.PipelineContextDataJsonConverter.ReadJson(JsonReader reader, Type objectType, Object existingValue, JsonSerializer serializer)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.DeserializeConvertable(JsonConverter converter, JsonReader reader, Type objectType, Object existingValue)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.PopulateDictionary(IDictionary dictionary, JsonReader reader, JsonDictionaryContract contract, JsonProperty containerProperty, String id)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateObject(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateValueInternal(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.SetPropertyValue(JsonProperty property, JsonConverter propertyConverter, JsonContainerContract containerContract, JsonProperty containerProperty, JsonReader reader, Object target)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.PopulateObject(Object newObject, JsonReader reader, JsonObjectContract contract, JsonProperty member, String id)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateObject(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateValueInternal(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent)
   at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonSerializer.Deserialize(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonConvert.DeserializeObject(String value, Type type, JsonSerializerSettings settings)
   at Newtonsoft.Json.JsonConvert.DeserializeObject[T](String value, JsonSerializerSettings settings)
   at GitHub.Runner.Sdk.StringUtil.ConvertFromJson[T](String value)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
[WORKER 2024-03-12 15:59:09Z ERR  Worker] Newtonsoft.Json.JsonReaderException: Invalid character after parsing property name. Expected ':' but got:  . Path 'ContextData.github.d[20].v.d[5].v.d[14].v.d[11].v', line 1, position 6322.
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.JsonTextReader.ParseProperty()
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.JsonTextReader.ParseObject()
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Linq.JContainer.ReadContentFrom(JsonReader r, JsonLoadSettings settings)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Linq.JContainer.ReadTokenFrom(JsonReader reader, JsonLoadSettings options)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Linq.JObject.Load(JsonReader reader, JsonLoadSettings settings)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Linq.JObject.Load(JsonReader reader)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at GitHub.DistributedTask.Pipelines.ContextData.PipelineContextDataJsonConverter.ReadJson(JsonReader reader, Type objectType, Object existingValue, JsonSerializer serializer)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.DeserializeConvertable(JsonConverter converter, JsonReader reader, Type objectType, Object existingValue)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.PopulateDictionary(IDictionary dictionary, JsonReader reader, JsonDictionaryContract contract, JsonProperty containerProperty, String id)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateObject(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateValueInternal(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.SetPropertyValue(JsonProperty property, JsonConverter propertyConverter, JsonContainerContract containerContract, JsonProperty containerProperty, JsonReader reader, Object target)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.PopulateObject(Object newObject, JsonReader reader, JsonObjectContract contract, JsonProperty member, String id)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateObject(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.CreateValueInternal(JsonReader reader, Type objectType, JsonContract contract, JsonProperty member, JsonContainerContract containerContract, JsonProperty containerMember, Object existingValue)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.JsonSerializer.Deserialize(JsonReader reader, Type objectType)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.JsonConvert.DeserializeObject(String value, Type type, JsonSerializerSettings settings)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at Newtonsoft.Json.JsonConvert.DeserializeObject[T](String value, JsonSerializerSettings settings)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at GitHub.Runner.Sdk.StringUtil.ConvertFromJson[T](String value)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
[WORKER 2024-03-12 15:59:09Z ERR  Worker]    at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)

##[Error]failed to execute worker exitcode: 1

/Edit so the root cause seems to be somewhere here: https://github.com/actions/runner/blob/v2.314.0/src/Runner.Worker/Program.cs#L20

In addition I found that providing a runner config by mounting one and setting the CONFIG_FILE env var doesn't seem to work, you'll get a Error: unknown flag: --config if you try. Root cause seems to be this.

ChristopherHX commented 8 months ago

I didn't got this this kind of error before (at least for a year)

Receiving message of length 6322, with hash '30564f1b4d3e28c3d9cc39d17eca1132cc026a2abeb6ab1be6736d80cf019ea9' [WORKER 2024-03-12 15:59:08Z INFO Worker] Message received. Newtonsoft.Json.JsonReaderException: Invalid character after parsing property name. Expected ':' but got: . Path 'ContextData.github.d[20].v.d[5].v.d[14].v.d[11].v', line 1, position 6322.

Sounds like the message inside the container got trimmed before it reached the actions/runner.

Based on the error the begin was sent to the actions/runner successfully

Maybe some data specfic to your test setup might cause this. (even parts not in the repo are stored in the message)

I would need to add more debug logging to diagnose this

omniproc commented 8 months ago

If you add the logging I can reproduce the issue if you like. My guess is that's it's maybe proxy related. But can't tell from the error logs.

ChristopherHX commented 8 months ago

@omniproc you made changes via the deployment file that are not compatible with actions/runner k8s container hooks and I have no idea if using a deployment is possible. Actions-Runner-Controller might use helm charts + kubernetes api, not shure how they do that.

Unable to attach or mount volumes: unmounted volumes=[work], unattached volumes=[], failed to process volumes=[work]: error processing PVC default/runner-785778b969-v88f8-work: failed to fetch PVC from API server: persistentvolumeclaims "runner-785778b969-v88f8-work" not found

the workspace cannot be an empty dir volume, like in my example files it is required to be a persistentvolumeclaim

You can technically change the name of the pvc via ACTIONS_RUNNER_CLAIM_NAME env, but I don't know how to get a dynamically generated name of a volume. See https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/README.md, if that doesn't match it will error out.

allowPrivilegeEscalation: false can't currently be used because start.sh makes use of sudo to create the folder layout: sudo chown -R runner:docker /home/runner/_work and sudo chown -R runner:docker /data. I think a better approach would be to just create those folders within the mounted EmptyDir volume. The running user should already have all permissions there to create the folders so no sudo would be needed but I'm not sure what those folders are currently used for and how hardcoded those paths are.

This led mkdir /data fail and you get an error about a .runner file.

Would require an empty dir mount

          - mountPath: /data
            name: data

Maybe if I create that dir in the Dockerfile it would work without that as long your fs is read write

The nightly doesn't have sudo anymore in the start.sh file, but it can still certainly break existing non k8s setups as of now.

If you add the logging I can reproduce the issue if you like. My guess is that's it's maybe proxy related. But can't tell from the error logs.

I found a mistake in the python wrapper file, probably due to resource constaints to RAM has os.read read less than expected and shorten the message.

I also added some asserts about return values of pipe communication + env ACTIONS_RUNNER_WORKER_DEBUG would print the job message from python side.

Please try to use that nightly image https://github.com/ChristopherHX/gitea-actions-runner/pkgs/container/gitea-actions-runner/190660665?tag=nightly important change to the os/arch tab and copy full tag + sha variant, I had problems with old cached nightly images.

it should get you to the point that you omited the persistentvolumeclaims of my example and kubernetes cannot start the job pod (also make shure to create an empty dir mount at /data/)

xyziven commented 6 months ago

I'm now able to start the runner in k8s namespace with DinD mode. How can I scale up the runners by setting replica=2 or 3?

motoki317 commented 5 months ago

@ChristopherHX Hi, an interesting project there!

Just a little advice here:

You can technically change the name of the pvc via ACTIONS_RUNNER_CLAIM_NAME env, but I don't know how to get a dynamically generated name of a volume.

I used StatefulSet and its volumeClaimTemplates functionality to dynamically provision PVCs and get its PVC names into the container as env var. You can use https://kubernetes.io/docs/tasks/inject-data-application/define-interdependent-environment-variables/ to define env var that's dependent on another.

Like the following:

  volumeClaimTemplates:
    - metadata:
        name: work
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

and refer as

          env:
            - name: ACTIONS_RUNNER_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: ACTIONS_RUNNER_CLAIM_NAME
              value: work-$(ACTIONS_RUNNER_POD_NAME)

A full working example that I tested is also available from https://github.com/traPtitech/manifest/blob/3ff7e8e6dfa3e0e4fed9a9e8ca1ad09f9b132ff1/gitea/act-runner/gitea-act-runner.yaml.

ChristopherHX commented 5 months ago

Thanks for your example, it makes manual scaling pretty straightforward and works in minikube for testing purposes even with 4 replicas.

The first time I read your response I thought work-$(ACTIONS_RUNNER_POD_NAME) looks like dark magic as a kubernetes newby, since it looks like an indirect resource naming assumption

omniproc commented 3 months ago

@ChristopherHX

is it possible that the runner doesn't yet support no-proxy? With the http_proxy / https_proxy and no_proxy env vars set, I see the runner using the proxy:

[WORKER 2024-07-23 12:09:08Z INFO HostContext] Configuring anonymous proxy http://my.proxy/ for all HTTP requests.
[WORKER 2024-07-23 12:09:08Z INFO HostContext] Configuring anonymous proxy http://my.proxy/ for all HTTPS requests.

but it doesn't mention the no_proxy setting and later on errors when trying to connect to itself using it's pod IP (which is in the no_proxy list)

[WORKER 2024-07-23 12:09:09Z ERR  GitHubActionsService] GET request to http://172.27.1.66:42791/_apis/connectionData?connectOptions=1&lastChangeId=-1&lastChangeId64=-1 failed. HTTP Status: Forbidden

ChristopherHX commented 3 months ago

I didn't go myself through the limitations of actions/runner proxy support

https://github.com/actions/runner/blob/41bc0da6fe09466d23fd37d691feeb68dd3b4338/docs/adrs/0263-proxy-support.md?plain=1#L51

They seem to ignore ip exclusions

We will not support IP addresses for no_proxy, only hostnames.

https://github.com/actions/runner/blob/41bc0da6fe09466d23fd37d691feeb68dd3b4338/src/Runner.Sdk/RunnerWebProxy.cs#L171

Not shure how my gitea runner can switch to hostnames, maybe try to reverse dns the ip and automatically add it to NO_PROXY?

omniproc commented 3 months ago

You can simply use something like this to add it to no-proxy:

- name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: no_proxy
              value: localhost,127.0.0.1,.local,$(POD_IP)

But that won't work if they ignore IP addresses for no_proxy (as in fact, I tested it and it doesn't work). So, why does the runner try to contact itself via it's external interface anyway? Why not use localhost?

Besides you could always use the DNS service built in k8s, but that would only work if that DNS name is used by the runner instead of the IP, see https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pods

ChristopherHX commented 3 months ago

Why not use localhost?

I can send a single hostname/ip + port to the actions/runner

If I send localhost

artifacts are broken
cache is broken Those are requests from the job container that use the same endpoint

If my gitea runner adapter would be part of the Gitea backend, we would have a real hostname that forwards into nested containers, like the ARC has it

omniproc commented 3 months ago

well, then let's use k8s DNS service. That would work.

I can send a single hostname/ip + port to the actions/runner

Can you point me to where this is done? Can this be configured?

ChristopherHX commented 3 months ago

Can you point me to where this is done? Can this be configured?

Here I set the connection address, but there is some url entries, maybe one of them is still pointing to the gitea instance without redirection https://github.com/ChristopherHX/gitea-actions-runner/blob/main/runtime/task.go#L734-L745

omniproc commented 3 months ago

So, if I understand that correct nektos artifactcache allows you to set outboundIP as endpoint in StartHandler. Although the name implies an IP address it is actually just a string and there seems to be no validation of it, besides a fallback to use the interface IP if none was provided. But it seems like in gitea-actions-runner always the interface IP is provided to the handler.

Because gitea-act-runner always uses an IP and actions/runner doesn't support no_proxy for IPs as you pointed out, this means currently there is no functional proxy support.

So, what do you think about making the IP configurable via an environment variable? GITEA_CACHE_HOST or something similar?

TomTucka commented 3 months ago

Is it currently viable to run gitea actions on k8s or is this still very much a work in progress?

ChristopherHX commented 3 months ago

So, what do you think about making the IP configurable via an environment variable? GITEA_CACHE_HOST or something similar?

Yes I agree mostly about this. However I wouldn't put CACHE into the env. More something like GITEA_ACTIONS_RUNNER_HOST, because a fake actions runtime is also implemented in the runner (used more than one port for tcp listeners).

Eventually if unset prefer hostnames over ips, but that needs testing on my side. (probably behind a feature env var)

I would queue this tomorrow into my todo list, working on multiple projects...

ChristopherHX commented 3 months ago

@omniproc Proxy should now work in tag v0.0.13

Use the following env

            - name: GITEA_ACTIONS_RUNNER_RUNTIME_USE_DNS_NAME
              value: '1'
            - name: GITEA_ACTIONS_RUNNER_RUNTIME_APPEND_NO_PROXY # this appends the dns name of the pod to no_proxy
              value: '1'
            - name: http_proxy
              value: http://localhost:2939 # some random proxy address for testing, use the real one
            - name: no_proxy
              value: .fritz.box,10.96.0.1 # first exclusion for gitea, second for kubernetes adjust as needed

ChristopherHX commented 3 months ago

@TomTucka

Is it currently viable to run gitea actions on k8s or is this still very much a work in progress?

This pretty much depends on your requirements, so more would try to use it so more issues can be found & fixed

Actions-Runner-Controller itself won't work (yet)
- only the k8s mode from the runner provided by GitHub (not Gitea) is working
- proxy support is work in progress (not confirmed to work in real world k8s proxy environment)
- updating the runner fleet is experimental, seems like it kills jobs on changes as of now
- review behavior differences carefully and this is not supported by Gitea Maintainer / Company etc.
- currently you can find in this issue some examples for stateful replica sets
- Missing autoscaling, only static scaling as of now
if you heavily use Gitea Actions features or want Gitea Support then only the dind option remains as of now, this is discussed somewhere else e.g. on

omniproc commented 3 months ago

@ChristopherHX testing v0.0.13... getting closer... now TLS errors. I'm not sure from the log output if this happens due to the mentioned RFC 6066 issue (I was under the impression that now DNS names will be used so not sure why this is logged anyway) or because the CA of the proxy is missing. I'll try to mount the CA to the runner and see what happens. First have to find out what location it's looking up for trusted CAs.

Current runner version: '2.317.0'
Secret source: Actions
Runner is running behind proxy server 'http://myproxy:8080/' for all HTTP requests.
Runner is running behind proxy server 'http://myproxy:8080/' for all HTTPS requests.
Prepare workflow directory
Prepare all required actions
Getting action download info
Download action repository 'https~//github.com/actions/checkout@v4' (SHA:N/A)
Complete job name: test
##[group]Run '/home/runner/k8s/index.js'
shell: /home/runner/externals/node16/bin/node {0}
##[endgroup]
(node:50) [DEP0123] DeprecationWarning: Setting the TLS ServerName to an IP address is not permitted by RFC 6066. This will be ignored in a future version.
(Use `node --trace-deprecation ...` to show where the warning was created)
##[error]Error: unable to get local issuer certificate
##[error]Process completed with exit code 1.
##[error]Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[group]Run '/home/runner/k8s/index.js'
shell: /home/runner/externals/node16/bin/node {0}
##[endgroup]
(node:61) [DEP0123] DeprecationWarning: Setting the TLS ServerName to an IP address is not permitted by RFC 6066. This will be ignored in a future version.
##[error]Error: unable to get local issuer certificate
(Use `node --trace-deprecation ...` to show where the warning was created)
##[error]Process completed with exit code 1.
##[error]Executing the custom container implementation failed. Please contact your self hosted runner administrator.
Cleaning up orphan processes
Finished

ChristopherHX commented 3 months ago

@omniproc nodejs ignores ca certs from common locations on linux and does it's own thing, point env NODE_EXTRA_CA_CERTS to your certbundle file including your kubernetes api cert chain

that cert bundle needs to be mounted to the runner container.

I assume this undescriptive very short error comes from kubernetes api access via https from node Got somthing similar short if I didn't add it to no_proxy and my proxy didn't even exist.

your kubernetes api controller is defined to be accessed by an ip addr like mine
your kubernetes api controller uses https unlike mine
it's tls cert has it's name set to an ip or it would be never valid, I didn't know node deprecated that

For the dind backend I wrote an provisions script for my self signed certs for all containers run by actions/runner, I could look into creating containers using modified k8s hooks for cert provisioning.

By default every container you use is assumed to have env NODE_EXTRA_CA_CERTS set and the ca folders populated if you use selfsigned certs, not really practicable...

EDIT Is your kubernetes api accessed by your proxy?

omniproc commented 3 months ago

@ChristopherHX I can confirm it's working. It was two issues (as you expected):

I did not configure no_proxy for the K8S API.
The cert of the git server was not trusted, setting NODE_EXTRA_CA_CERTS worked as you mentioned worked
I tested the checkout plugin and basic container commands, DinD build is the next thing i have to test

A few UX improvement suggestions here from my side. As a user when I configure no_proxy I usually only have the URLs in my mind that i know should not be proxied but have to be reached by the runner. I know them because I usually configure them explicitly in my pipeline (e.g. Git repo). What I don't know is what other stuff the runner has to reach. Of course, on second thought, it's obvious why Node tries to reach the K8s API. But since it's the runner who wants to reach it I think it should be the runner's responsibility to setup everything it can to make this happen. So my suggestion would be:

if http_proxy was set, auto add the K8S API exposed via the EnvVars as used by Node to the no_proxy list (e.g. KUBERNETES_SERVICE_HOST). A optional EnvVar flag to disable this default behavior would prbl be nice for edge cases where the K8s you try to reach is not the K8s the runner is executed on and has to be reached via a proxy (but I'm not sure if that's a use-case for the runner ATM).
if http_proxy was set, auto add the K8s API certificate to the NODE_EXTRA_CA_CERTS by default (/var/run/secrets/kubernetes.io/serviceaccount/ca.crt)
You can read everything about the exposed EnvVars and mounted certificates at https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/

omniproc commented 3 months ago

So now that the runner starts a new pod for the workflow I was trying to get DinD to work in it using catthehacker/ubuntu:act-22.04 as the job container image, which doesn't work since the docker socket is not available. I know that in theory it's possible because gitea/act_runner:nightly-dind-rootless can run DinD but that image is of course missing all the Act components.

So before I start fiddling around building a hybrid of catthehacker and dind-rootless: how did you get DinD to run @ChristopherHX ?

ChristopherHX commented 3 months ago

@omniproc I might have caused confusion here, I have not set up dind in the job container yet.

Did this only for the runner (that is by default a docker cli client) outside of kubernetes

I would expect using a custom fork of https://github.com/actions/runner-container-hooks could configure a dind installation installed on the external tools (the folder that has node20 etc. for the runner) on any job container

in theory it's possible because gitea/act_runner:nightly-dind-rootless can run DinD

This is similar like I did it e.g. in docker compose (docker hook mode), https://github.com/ChristopherHX/gitea-actions-runner/blob/main/examples/docker-compose-dind-rootless/docker-compose.yml

This only works if you don't use the k8s container hooks, but I'm not shure if the docker.sock bind mount works in that setup as I didn't make use of it

This approuch has flaws if you try to run the following you get strange bugs

docker run -v $PWD:$PWD ....

omniproc commented 3 months ago

@ChristopherHX so I got a working prototype of this. Instead of using DinD, which arguably is a security nightmare more often then not (or comes with many limitations as of today when running unprivileged), I switched to buildkit, which doesn't require any privileges and can be executed in a non-root pod. So the process currently looks like so:

Span a stateful set (gitea-actions-runner), as given by the example @motoki317 - this could be a deployment pbl. but for now i'm rolling with this
The gitea-actions-runner registers with Gitea and can now accept workflows. For every workflow a new pod is started by the gitea-actions-runner. I'm using a image that is based on catthehacker/ubuntu:act extended by a installation of buildkit (buildkitctl actually) to maximize compatibility with many Github actions.
The workflow pod uses buildkitctl to perform the actions against a buildkit container moby/buildkit:master-rootless. Currently I'm running the buildkit container as a sidecar container of the gitea-actions-runner stateful set, so this would scale along with the stateful set. Other szenarios could be done as long as the builtkit container is reachable from the workflow pod.
result is a container build without special privileges with the workflow being executed in a dedicated, deterministic pod environment without DinD and it's flaws.

omniproc commented 3 months ago

@ChristopherHX is it possible that currently we can not pass environment variables using the env parameter? E.g.:

on:
  push

jobs:
  test:
    runs-on: myrunner
    container:
      image: ghcr.io/catthehacker/ubuntu:act-22.04 
    env:
      FOO: BAR

In this case FOO will not be set

ChristopherHX commented 3 months ago

@omniproc

@ChristopherHX is it possible that currently we can not pass environment variables using the env parameter? E.g.:

No Idea, I tried the following and it passes my test. (both ways to do that in a container job)

name: Gitea Actions Demo
on: [push]
jobs:
  build-docker:
    runs-on: trap-cluster
    container:
      image: buildpack-deps:noble
      env:
        MY_IMAGE_VAR: foo
    env:
      MY_GLOBAL_VAR: foo
    steps:
      - name: Checkout
        run: |
          echo "MY_GLOBAL_VAR: $MY_GLOBAL_VAR"
          echo "MY_IMAGE_VAR: $MY_IMAGE_VAR"

Current runner version: '2.317.0'
Secret source: Actions
Runner is running behind proxy server 'http://localhost:2939' for all HTTP requests.
Prepare workflow directory
Prepare all required actions
Complete job name: build-docker
##[group]Run '/home/runner/k8s/index.js'
shell: /home/runner/externals/node16/bin/node {0}
##[endgroup]
Checkout
3s
##[group]Run echo "MY_GLOBAL_VAR: $MY_GLOBAL_VAR"
echo "MY_GLOBAL_VAR: $MY_GLOBAL_VAR"
echo "MY_IMAGE_VAR: $MY_IMAGE_VAR"
shell: sh -e {0}
env:
  MY_GLOBAL_VAR: foo
##[endgroup]
##[group]Run '/home/runner/k8s/index.js'
shell: /home/runner/externals/node16/bin/node {0}
##[endgroup]
MY_GLOBAL_VAR: foo
MY_IMAGE_VAR: foo
Complete job
5s
##[group]Run '/home/runner/k8s/index.js'
shell: /home/runner/externals/node16/bin/node {0}
##[endgroup]
Cleaning up orphan processes
Finished

MY_GLOBAL_VAR is expected to be echoed for every step like in my log while MY_IMAGE_VAR is absent in ${{ env.* }}

omniproc commented 3 months ago

@ChristopherHX you're right. I expected MY_IMAGE_VAR to be available in the env scope. It is not. However it is available in the steps when running a shell. Also it is visible in the env of the pod definition. The MY_GLOBAL_VAR on the other hand is available in the env scope within the action and can be access from the shell but is not visible in the pod spec. Interesting. I didn't know that difference between those two envs before. Thanks for clarification.

querplis commented 2 months ago

from my experience with different tools that work with kuberentes:

controller is not really needed , it is possible to communicate directly with kubernetes api and create pods.
since all of the jobs are one shoot and we don't need to ensure any uptime for job, we can judge success or failure from exit codes of whatever we run in containers, there is no need for anything more then pod with containers in it, and containers in pod can communicate with each other, which gives opportunity to run , for example, disposable postgres for tests, with very simple setup.
we might need an abilty to define what containers we need in pod, resources, commands, args, node selectors and affinities, some tools just allow user to write its own manifest and inject its own container in it.
if secrets are stored in kubernetes other pods, that run in same namespace, have a chance to access them, which might be security issue.

Mo0rBy commented 2 months ago

I've been following this for some time now as I'd really love to switch over to Gitea actions and move away from Jenkins as a CI/CD tool, the main big thing that is preventing from creating a case for it is exactly this, a way to dynamically create temporary pods that only live for as long as the 'Jenkins job/Gitea workflow' so that resourcing can be controlled in a native Kubernetes way.

I agree with everything @querplis has said above, and can state that the Jenkins Kubernetes plugin has the same functionality.

No separate controller needed (I don't know if this is because Jenkins has a single Master and many Workers architecture or not)
Jenkins Master just initiates a job + pod, waits for the pod the be ready and then just starts the job script and waits for an exit code (this does sometimes mean that jobs need to be manually cleared out if something goes wrong but this is abnormal behaviour and is only required when something goes wrong)
You can write and load podTemplate yaml files to be used for different job scripts so that a Go service build pipeline has a Go container or a Java service pipeline has a Maven or Gradle container, etc
Secrets are managed in Jenkins and we typically use the CredentialsBinding plugin to load secret values like credentials or SSH keys. Not sure how this would be managed for this as I don't know how secrets are managed with Github Actions either which is the most similar tool I think. I know that secret management can be done externally from Jenkins though, for example with HashiCorp Vault.

omniproc commented 2 months ago

@Mo0rBy

big thing that is preventing from creating a case for it is exactly this, a way to dynamically create temporary pods that only live for as long as the 'Jenkins job/Gitea workflow' so that resourcing can be controlled in a native Kubernetes way.

This already works, as documented in this GH issue.

@querplis

controller is not really needed , it is possible to communicate directly with kubernetes api and create pods.

It's not needed but it's good practices to separate concerns ( and argueably this design also leads to single responsibility and possibly open-closed). The job of the controller is to talk to and monitor the K8s API and based on that do what it must on the backend system, and/or - vice versa - wait for instructions from the external system (Gitea) to get instructions and perform the required tasks in K8s.

since all of the jobs are one shoot and we don't need to ensure any uptime for job, we can judge success or failure from exit codes of whatever we run in containers, there is no need for anything more then pod with containers in it, and containers in pod can communicate with each other, which gives opportunity to run , for example, disposable postgres for tests, with very simple setup.

True, simple pod with multiple containers might do for most cases. More complex scenarios however might require pre or post steps. Take a look at the ecosystems of FluxCD and ArgoCD and how they involved from basicly what you argue for to something much bigger and more complex. But I agree that for an initial implementation having granular control over a pod - started as k8s job or simple pod - is good enough. However at that point, since it's just as easy as applying a manifest against the K8s API, why limit it? Just leave it up to the user what he wants to define in the manifest and have it applied as the Gitea action workflow starts. A simple label system could signal Gitea what pod to consider important for the workflow to fail (e.g. apply a label gitea-action-must-succeed to all pods that Gitea will consider relevant for the workflow to succeed).

we might need an abilty to define what containers we need in pod, resources, commands, args, node selectors and affinities, some tools just allow user to write its own manifest and inject its own container in it.

Same here. Simply allow the user to supply a K8s manifest with the workflow and the controller would apply it to k8s.

if secrets are stored in kubernetes other pods, that run in same namespace, have a chance to access them, which might be security issue.

Secrets in Kubernetes are not designed to be "secret". They are simply configmaps that can be protected using RBAC. Within a namespace, all pods can access them, just like any configmap. Don't want that? Create a separate namespace, use RBAC. Don't try to re-invent the wheel. It will only break 90% of k8s tools you might need since they pretty much all expect you to use secrets the way they're ment to be used.

djeinstine commented 2 months ago

Hi @omniproc , @ChristopherHX ,

I've set up my runner according to all of the examples you've provided here. However, I've come to a standstill, where I cannot clone a repository.

My runner is based on @ChristopherHX 's image: ghcr.io/christopherhx/gitea-actions-runner:v0.0.13

I've tried with every container in my workflow: ubuntu:latest, ghcr.io/omniproc/act-buildkit-runner:latest, ghcr/catthehacker/ubuntu:act-22.04

At this step it always fails

- name: checkout repository 
- uses: https://github.com/actions/checkout@v4

With this error: env: '/__e/node20/bin/node': No such file or directory

How did you fix it? We're there any additional environmental variables added to the workflow or the runner itself?

Thanks for everything you guys have done so far by the way!

omniproc commented 2 months ago

Seems like you either did not install node or your nodejs env var is pointing to an empty dir. The image will not just have any binary you need. Either you build your own image that bundles nodejs or you use one of the many install nodejs actions available as a pre step in your workflow.

I don't have any node dependencies so I never tested node builds.

djeinstine commented 2 months ago

@omniproc

Yeah my problem is that I don't have anything that uses node. I'm just trying to check out my repository. Basically with your example I cannot check out my repository.

This is my workflow

name: Gitea Actions Demo
run-name: Testing out Gitea Actions 🚀
on: [push]

jobs:
  Explore-Gitea-Actions:
    runs-on: beowulf-cluster
    container:
      image: ghcr.io/catthehacker/ubuntu:act-22.04 #ghcr.io/omniproc/act-buildkit-runner:latest #
    steps:
      - name: Check out repository code.
        uses: https://github.com/actions/checkout@v4

omniproc commented 2 months ago

@omniproc

Yeah my problem is that I don't have anything that uses node. I'm just trying to check out my repository. Basically with your example I cannot check out my repository.

This is my workflow
name: Gitea Actions Demo
run-name: Testing out Gitea Actions 🚀
on: [push]

jobs:
  Explore-Gitea-Actions:
    runs-on: beowulf-cluster
    container:
      image: ghcr.io/catthehacker/ubuntu:act-22.04 #ghcr.io/omniproc/act-buildkit-runner:latest #
    steps:
      - name: Check out repository code.
        uses: https://github.com/actions/checkout@v4

You are using the github checkout action, which is a javascript action executed by nodejs. Can't tell why it is failing from the little information you provided.

ChristopherHX commented 2 months ago

@djeinstine Do - run step work fine?

Can't tell why it is failing from the little information you provided.

Yes exactly. Something like parts of your kubernetes config could be helpful.

Node normally doesn't need to be installed, this node folder should be copied to the persisted volume claim by https://github.com/actions/runner-container-hooks k8s edition during the Setup Job Step.

djeinstine commented 2 months ago

@omniproc @ChristopherHX

Yes I didn't post my config. I took inspiration from this post, and the full working example from @motoki317 here https://github.com/traPtitech/manifest/blob/3ff7e8e6dfa3e0e4fed9a9e8ca1ad09f9b132ff1/gitea/act-runner/gitea-act-runner.yaml

and came up with the following:

relevant part of gitea-act-runner.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: gitea-act-runner
spec:
  serviceName: gitea-act-runner
  replicas: 1
  revisionHistoryLimit: 0
  volumeClaimTemplates:
    - metadata:
        name: work
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: "local-path"
        resources:
          requests:
            storage: 1Gi
  persistentVolumeClaimRetentionPolicy:
    whenScaled: Delete
    whenDeleted: Delete
  selector:
    matchLabels:
      app: gitea-act-runner
  template:
    metadata:
      labels:
        app: gitea-act-runner
    spec:
      serviceAccountName: gitea-act-runner
      containers:
        - name: runner
          image: ghcr.io/christopherhx/gitea-actions-runner:v0.0.13
          imagePullPolicy: Always
          env:
            - name: ACTIONS_RUNNER_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: ACTIONS_RUNNER_CLAIM_NAME
              value: work-$(ACTIONS_RUNNER_POD_NAME)
            - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
              value: "true"
            - name: ACTIONS_RUNNER_CONTAINER_HOOKS
              value: /home/runner/k8s/index.js
            - name: GITEA_INSTANCE_URL
              value: https://gitea.nas.homespace.ovh/
            - name: GITEA_RUNNER_REGISTRATION_TOKEN
              valueFrom:
                secretKeyRef:
                  name: act-runner
                  key: registration-token
            - name: GITEA_RUNNER_LABELS
              value: beowulf-cluster
            - name: GITEA_RUNNER_NAME
              value: beowulf-act-runner
          volumeMounts:
            - mountPath: /home/runner/_work
              name: work
          resources:
            requests:
              cpu: "100m"
              memory: "500Mi"
            limits:
              cpu: "1"
              memory: "2Gi"

demo workflow demo.yaml

name: Gitea Actions Demo
run-name: Testing out Gitea Actions 🚀
on: [push]

jobs:
  Explore-Gitea-Actions:
    runs-on: beowulf-cluster
    container:
      image: ghcr.io/catthehacker/ubuntu:act-22.04 #ghcr.io/omniproc/act-buildkit-runner:latest #
    steps:
      - name: Check out repository code.
        uses: https://github.com/actions/checkout@v4

output in Actions

ChristopherHX commented 2 months ago

Something is odd in your kubernetes cluster..., maybe try a different storage provider or change size limits?

I have no clue why the external files are not there, for me they are always copied back if I delete it manually.

The externals folder is intact in the image as well, otherwise the k8s hooks couldn't run as well.

storageClassName: "local-path"

This difference is not an issue for me, works with both default and this one.

Had to enable this provider in my minikube.

ubuntu@ubuntu:~$ minikube addons enable storage-provisioner-rancher
❗  storage-provisioner-rancher is a 3rd party addon and is not maintained or verified by minikube maintainers, enable at your own risk.
❗  storage-provisioner-rancher does not currently have an associated maintainer.
    ▪ Using image docker.io/rancher/local-path-provisioner:v0.0.22
    ▪ Using image docker.io/busybox:stable
🌟  The 'storage-provisioner-rancher' addon is enabled

Please add a run step before checkout that checks that the /__e contains the node tool by recusively enumerating the folder.

Maybe add a sleep 1000 run step then inspect the spawned pod mounts for more data of this issue.

e.g. if I look at my kubernetes, check that the externals folder has the node program inside of it and you can execute it. The path is from by kubernetes dashboard and I sshed onto minikube

ls -l /opt/local-path-provisioner/pvc-b32ef2bd-682f-42d0-a8c6-d79075247505_default_work-gitea-act-runner-x1-0
total 24
drwxr-xr-x 3 docker 1000 4096 Sep  4 19:32 _PipelineMapping
drwxr-xr-x 3 docker 1000 4096 Sep  4 19:32 _actions
drwxr-xr-x 2 docker 1000 4096 Sep  4 19:33 _temp
drwxr-xr-x 2 docker 1000 4096 Sep  4 19:32 _tool
drwxr-xr-x 4 docker 1000 4096 Sep  4 19:32 externals
drwxr-xr-x 3 docker 1000 4096 Sep  4 19:32 test-actions

The k8s container hooks you are using here are unchanged ones from https://github.com/actions/runner-container-hooks/releases/tag/v0.6.1 using unchanged actions/runner 2.317.0

actions/runner ok
k8s hooks / partially broken??

This is the function responsible to provide the externals that are not found: https://github.com/actions/runner-container-hooks/blob/73655d4639a62f6e4b3d70b5878bc4367c0a436e/packages/k8s/src/hooks/prepare-job.ts#L184-L193

djeinstine commented 2 months ago

@ChristopherHX Thank you for the tip. I had not thought about looking into mounts. I checked out my mounts and everything is correct on the node's side. I'm using Talos so I cannot ssh into the OS. I can, however, list the directories using the CLI.

Node/Storage Side Results:

I can see all of the relevant files listed in your response. The only difference being is the folder you have as test-actions, for me is named kubernetes. Listing all directories in the PV with recursive depth set to 2, I can find the relevant files that the runner claims are missing:

talosctl list -d 2 /var/local-path-provisioner/pvc-d1ef7f45-9eb9-44a0-a521-9cf084c829ba_gitea_work-gitea-act-runner-0 -n 192.168.2.99
NODE           NAME
192.168.2.99   .
192.168.2.99   _PipelineMapping
192.168.2.99   _PipelineMapping/lyons
192.168.2.99   _actions
192.168.2.99   _actions/https~
192.168.2.99   _temp
192.168.2.99   _tool
192.168.2.99   externals
192.168.2.99   externals/node16
192.168.2.99   externals/node16_alpine
192.168.2.99   externals/node20
192.168.2.99   externals/node20_alpine
192.168.2.99   kubernetes
192.168.2.99   kubernetes/kubernetes

Runner/Pod Side Results: So I checked the mounts and the correct mounts are indeed there. Including the '/__e' directory mapped to 'externals' that the runner claims is missing. Pod Mounts:

    Mounts:
      /__e from work (rw,path="externals")
      /__w from work (rw)
      /github/home from work (rw,path="_temp/_github_home")
      /github/workflow from work (rw,path="_temp/_github_workflow")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zkjg6 (ro)

So I did an ls at two folders above the entry point (ls .. and ls ../..) of the runner and these are my findings Runner:

So it seems to me the hooks are perfectly fine, but the mounts are not working properly. /__e from work (rw,path="externals") is somehow not being linked. That leads me to wonder how does the set up work at all? It doesn't seem to be a cluster problem but a simple pod/image mount issue. @ChristopherHX do you have an example deployment/statefulset that you use?

ChristopherHX commented 2 months ago

My example is pretty much the same as yours, but our cluster are the most different factor here. I'm using minikube on my arm64 based server. As omniproc and motoki317 have reported, this should actually work in other clusters as well.

example.yml (This is an export of my minikube with local-path povider as you have shown in your snipped, some generated fields have been removed again)

The /__e from work (rw,path="externals") volume always mount correctly on my end...

volume work (no sup path) mounts correctly in runner and job pod

What I didn't understand, does sub path mounts work on your cluster if you create a pod yourself with a subpath mount of a volume like done by the k8s hooks?

For me it looks like a k8s runner container hooks bug or missing functionality in your kubernetes cluster.

I read a while back that older versions of dockerd didn't support mounting a sub folder within a volume to a container, but idk if that is correct.

_I assume modifying k8s hooks (link to the code in earlier comments) so they don't mount /__e and instead execute an ln -s command in the setup job could workaround cluster incompatibility?_ This code managing the job pod is maintained by GitHub Employees and not by me.

djeinstine commented 2 months ago

@ChristopherHX I finally got it. I switched my storage class to longhorn and it works now. I have no idea why the local path provisioner doesn't work in my cluster.

omniproc commented 2 months ago

@ChristopherHX I finally got it. I switched my storage class to longhorn and it works now. I have no idea why the local path provisioner doesn't work in my cluster.

You mentioned you use Talos, which comes with some special requirements for local path provisioner. Maybe that was the issue? https://www.talos.dev/v1.7/kubernetes-guides/configuration/local-storage/

djeinstine commented 2 months ago

I just looked at my extra mounts section and I didn't mount the hostPath mounts. Looks like I skipped that section of the docs.

querplis commented 2 months ago

It's not needed but it's good practices to separate concerns ( and argueably this design also leads to single responsibility and possibly open-closed). The job of the controller is to talk to and monitor the K8s API and based on that do what it must on the backend system, and/or - vice versa - wait for instructions from the external system (Gitea) to get instructions and perform the required tasks in K8s.

starting from direct access and then moving to controller as/if neeeded , is a way to faster get there, since controller will add extra complexity.

True, simple pod with multiple containers might do for most cases. More complex scenarios however might require pre or post steps. Take a look at the ecosystems of FluxCD and ArgoCD and how they involved from basicly what you argue for to something much bigger and more complex. But I agree that for an initial implementation having granular control over a pod - started as k8s job or simple pod - is good enough. However at that point, since it's just as easy as applying a manifest against the K8s API, why limit it? Just leave it up to the user what he wants to define in the manifest and have it applied as the Gitea action workflow starts. A simple label system could signal Gitea what pod to consider important for the workflow to fail (e.g. apply a label gitea-action-must-succeed to all pods that Gitea will consider relevant for the workflow to succeed).

its not fluxcd and argocd that we should look in this case, since they do completely different things, but jenkns, gitlab and drone

Same here. Simply allow the user to supply a K8s manifest with the workflow and the controller would apply it to k8s.

exactly!

Secrets in Kubernetes are not designed to be "secret". They are simply configmaps that can be protected using RBAC. Within a namespace, all pods can access them, just like any configmap. Don't want that? Create a separate namespace, use RBAC. Don't try to re-invent the wheel. It will only break 90% of k8s tools you might need since they pretty much all expect you to use secrets the way they're ment to be used.

what i was trying to say that if there is an option to not use k8s secrets for storing job secrets, but instead inject them from somewhere else, directly into job, then that might be a better option, which lets people drastically reduce amount of namespaces they need just to isolate, in some cases, single secret value.

omniproc commented 1 month ago

what i was trying to say that if there is an option to not use k8s secrets for storing job secrets, but instead inject them from somewhere else, directly into job, then that might be a better option, which lets people drastically reduce amount of namespaces they need just to isolate, in some cases, single secret value.

I'd argue if you end up with 1 namespace per secret you either have only one secret per concern or your architecture should be refactored. "injecting them from somewhere else" however is always easy if the user has full access to the manifests that should be applied. I personally view the sideloading of secrets as an anti-pattern, but you might have a different opinion on the matter.

go-gitea / gitea

Actions-Runner-Controller support for Gitea Actions #29567

Feature Description

Screenshots