Job that cannot start pod -> orphaned

pchila commented 2 years ago

When Job Executor Service creates a k8s Job that cannot spawn pod (if the wrong serviceAccount for the task is specified for example) the sequence fails because of timeout but no logs can be retrieved as there are no pods to fetch them from. Furthermore the created job is not collected by the k8s TTL controller as it never finishes, so it will keep trying to spawn pods long after Job Executor Service gave up on it (possibly indefinitely if there's a configuration error) so it has to be manually removed by the user.

The Job Executor Service should detect that the job failed to start and add relevant information for the user extracted from the job status/events and explicitly delete the job if it didn't spawn any pods.

How to reproduce:

Use a job config with a wrong service account:

apiVersion: v2
actions:
  - name: "Hello World e2e test"
    events:
      - name: "sh.keptn.event.deployment.triggered"
    tasks:
      - name: "Greet the world"
        image: "alpine"
        serviceAccount: "inexistentServiceAccount"
        cmd:
          - echo
        args:
          - "Hello World"

christian-kreuzberger-dtx commented 2 years ago

Nice catch. We will address this as soon as possible. Related to https://github.com/keptn-contrib/job-executor-service/issues/234

christian-kreuzberger-dtx commented 2 years ago

Example output of describe job:

Name:           job-executor-service-job-1067030a-bc7f-4e65-98b1-669e-1
Namespace:      keptn-jes
Selector:       controller-uid=1da7d309-aab9-4a91-9487-bc710dfea8a9
Labels:         controller-uid=1da7d309-aab9-4a91-9487-bc710dfea8a9
                job-name=job-executor-service-job-1067030a-bc7f-4e65-98b1-669e-1
Annotations:    <none>
Parallelism:    1
Completions:    1
Pods Statuses:  0 Active / 0 Succeeded / 0 Failed
...
Events:
  Type     Reason        Age                 From            Message
  ----     ------        ----                ----            -------
  Warning  FailedCreate  56s (x4 over 2m6s)  job-controller  Error creating: pods "job-executor-service-job-1067030a-bc7f-4e65-98b1-669e-1-" is forbidden: error looking up service account keptn-jes/inexistentServiceAccount: serviceaccount "inexistentServiceAccount" not found

christian-kreuzberger-dtx commented 2 years ago

re-opening this, as the issue still exists - we can probaly solve this via refactoring (https://github.com/keptn-contrib/job-executor-service/issues/244)

keptn-contrib / job-executor-service

Job that cannot start pod -> orphaned #235

How to reproduce: