tinkerbell / rufio

Kubernetes Controller for BMC Interactions
Apache License 2.0
35 stars 16 forks source link

intermittent job failures #235

Closed ibrokethecloud closed 1 month ago

ibrokethecloud commented 1 month ago

While creating jobs with multiple tasks such as to reboot a host, it is common to see intermittent job failures because the rufio controller is unable to get an updated list of submitted tasks from the informer cache

for example following is a sample job to reboot a host.

apiVersion: bmc.tinkerbell.org/v1alpha1
kind: Job
metadata:
  name: arm-147-test
  namespace: tink-system
  resourceVersion: "135177051"
  uid: 151d60c2-5698-4069-9e8a-73cbe863ff28
spec:
  machineRef:
    name: arm-147
    namespace: tink-system
  tasks:
  - powerAction: "off"
  - powerAction: "on"
status:
  conditions:
  - status: "True"
    type: Running
  - message: 'failed to create Task tink-system/arm-147-test-task-0: tasks.bmc.tinkerbell.org
      "arm-147-test-task-0" already exists'
    status: "True"
    type: Failed
  startTime: "2024-07-10T12:18:58Z"

This seems to be caused by https://github.com/tinkerbell/rufio/blob/main/controller/job.go#L109-L138 the controller attempting to list tasks owned by the job which fails because the list returns no objects. As a result the controller attempts to create task-0 again, which fails as this task already exists.

Expected Behaviour

Current Behaviour

Possible Solution

One possible fix, that I have available in my local fork is disable caching for Job objects while setting up the manager with a minor change to controller options as shown below. This ensures that the objects are read from the apiserver and avoids this issue.

opts := ctrl.Options{
        Scheme: scheme,
        Metrics: metricsserver.Options{
            BindAddress: metricsAddr,
        },
        HealthProbeBindAddress: probeAddr,
        LeaderElection:         enableLeaderElection,
        LeaderElectionID:       "e74dec1a.tinkerbell.org",
        Client: client.Options{
            Cache: &client.CacheOptions{
                DisableFor: []client.Object{&v1alpha1.Job{}},
            },
        },
    }

Steps to Reproduce (for bugs)

  1. Submit job from example above
  2. It randomly fails to create all tasks in the job with error mentioned

Context

Your Environment