lwolf / kube-cleanup-operator

Kubernetes Operator to automatically delete completed Jobs and their Pods
MIT License
498 stars 109 forks source link

Jobs are not getting deleted #43

Closed ratnadeep007 closed 4 years ago

ratnadeep007 commented 4 years ago

Brief

kube-clean-up deletes pods for jobs but unable to delete job itself. Logs shows:

[timestamp] Deleting job '<job_name>'
[timestamp] Deleting pod '<pod_name>'

Expected Behavior

Delete job and pod both

More context

Managed Kubernets: Yes (EKS on AWS) Kubernetes Version: 1.15

lwolf commented 4 years ago

need more context here. deploy manifests for the operator, example of job and pod

ratnadeep007 commented 4 years ago

Deploy manifets Ran given kubectl command to deploy operator in cluster. Default deploy manifest are used.

Job Manifest

apiVersion: batch/v1
kind: Job
metadata:
  name: django-migrate
  labels:
    app: django-migrate
spec:
  backoffLimit: 4
  template:
    metadata:
      labels:
        app: django-migrate
    spec:
      restartPolicy: Never
      containers:
      - name: django-migrate
        image: django-migrate-image # from private repo
        ports:
        - containerPort: 3000
        resources:
          requests:
            memory: 1800Mi
          limits:
            memory: 1800Mi
        args:
        - python
        - manage.py
        - migrate
        - --noinput
lwolf commented 4 years ago

please attach kubectl get job django-migrate -o yaml > job.yaml kubectl get pod django-migrate-<POD_ID> -o yaml > job.yaml

ratnadeep007 commented 4 years ago

kubectl get job django-migrate -o yaml > job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  annotations: <annotations>
  creationTimestamp: "2020-05-18T13:11:29Z"
  labels:
    app: django-migrate
    app.kubernetes.io/managed-by: skaffold-v1.7.0
    skaffold.dev/builder: local
    skaffold.dev/cleanup: "true"
    skaffold.dev/deployer: kubectl
    skaffold.dev/docker-api-version: "1.40"
    skaffold.dev/profile.0: dev
    skaffold.dev/run-id: 685c60f0-25a3-4ad4-8b12-2e149b82cc0d
    skaffold.dev/tag-policy: git-commit
    skaffold.dev/tail: "true"
  name: django-migrate
  namespace: default
  resourceVersion: "442017"
  selfLink: /apis/batch/v1/namespaces/default/jobs/django-migrate
  uid: 80da5af6-8b5f-44ba-b285-ed87a1f87139
spec:
  backoffLimit: 4
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: 80da5af6-8b5f-44ba-b285-ed87a1f87139
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: django-migrate
        app.kubernetes.io/managed-by: skaffold-v1.7.0
        controller-uid: 80da5af6-8b5f-44ba-b285-ed87a1f87139
        job-name: django-migrate
        skaffold.dev/builder: local
        skaffold.dev/cleanup: "true"
        skaffold.dev/deployer: kubectl
        skaffold.dev/docker-api-version: "1.40"
        skaffold.dev/profile.0: dev
        skaffold.dev/run-id: 685c60f0-25a3-4ad4-8b12-2e149b82cc0d
        skaffold.dev/tag-policy: git-commit
        skaffold.dev/tail: "true"
    spec:
      containers:
      - args:
        - python
        - manage.py
        - migrate
        - --noinput
        env:
        - name: DATABASE_URL
          valueFrom:
            configMapKeyRef:
              key: database_url
              name: django-config
        image: <ecr_repo_link>/django_migrate:latest
        imagePullPolicy: IfNotPresent
        name: django
        ports:
        - containerPort: 3000
          protocol: TCP
        resources:
          limits:
            memory: 1000Mi
          requests:
            memory: 1000Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  completionTime: "2020-05-18T13:11:56Z"
  conditions:
  - lastProbeTime: "2020-05-18T13:11:56Z"
    lastTransitionTime: "2020-05-18T13:11:56Z"
    status: "True"
    type: Complete
  startTime: "2020-05-18T13:11:29Z"
  succeeded: 1

kubectl get pod django-migrate-<POD_ID> -o yaml > job.yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2020-05-18T13:11:29Z"
  generateName: django-migrate-
  labels:
    app: django-migrate
    app.kubernetes.io/managed-by: skaffold-v1.7.0
    controller-uid: 80da5af6-8b5f-44ba-b285-ed87a1f87139
    job-name: django-migrate
    skaffold.dev/builder: local
    skaffold.dev/cleanup: "true"
    skaffold.dev/deployer: kubectl
    skaffold.dev/docker-api-version: "1.40"
    skaffold.dev/profile.0: dev
    skaffold.dev/run-id: 685c60f0-25a3-4ad4-8b12-2e149b82cc0d
    skaffold.dev/tag-policy: git-commit
    skaffold.dev/tail: "true"
  name: django-migrate-ppw78
  namespace: default
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: django-migrate
    uid: 80da5af6-8b5f-44ba-b285-ed87a1f87139
  resourceVersion: "442016"
  selfLink: /api/v1/namespaces/default/pods/django-migrate-ppw78
  uid: 25ae3126-a16c-462a-aae8-4b47f88ec8ac
spec:
  containers:
  - args:
    - python
    - manage.py
    - migrate
    - --noinput
    env:
    - name: DATABASE_URL
      valueFrom:
        configMapKeyRef:
          key: database_url
          name: django-config
    image: <ecr_repo_link>/django_migrate:latest
    imagePullPolicy: IfNotPresent
    name: django
    ports:
    - containerPort: 3000
      protocol: TCP
    resources:
      limits:
        memory: 1000Mi
      requests:
        memory: 1000Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-sc5hv
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-192-168-17-65.ap-south-1.compute.internal
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-sc5hv
    secret:
      defaultMode: 420
      secretName: default-token-sc5hv
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-05-18T13:11:29Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-05-18T13:11:56Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-05-18T13:11:56Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-05-18T13:11:29Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://5123b39ea7484d3c9c648dfa84dde30bec092b595eaeabc1104ae16a0d245d6f
    image: <ecr_repo_link>/django_migrate:latest
    imageID: docker-pullable://<ecr_repo_link>/django_migrate:latest
    lastState: {}
    name: django
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: docker://5123b39ea7484d3c9c648dfa84dde30bec092b595eaeabc1104ae16a0d245d6f
        exitCode: 0
        finishedAt: "2020-05-18T13:11:56Z"
        reason: Completed
        startedAt: "2020-05-18T13:11:50Z"
  hostIP: 192.168.17.65
  phase: Succeeded
  podIP: 192.168.15.73
  qosClass: Burstable
  startTime: "2020-05-18T13:11:29Z"
lwolf commented 4 years ago

Could you try the new 0.7 version and let me know if you still experience this issue? https://github.com/lwolf/kube-cleanup-operator/releases/tag/v0.7.0

It now has separate loops for jobs and pods to make sure that everything gets cleaned properly.

Make sure to run it with -legacy-mode=false to be able to use this feature.

ratnadeep007 commented 4 years ago

I tried using a eks cluster where I have to deploy job. Using example job from documentation. Still not working.

rbac.yml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: cleanup-operator
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cleanup-operator
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - watch
  - delete
- apiGroups: ["batch", "extensions"]
  resources:
  - jobs
  verbs:
  - delete
  - get
  - list
  - watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cleanup-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cleanup-operator
subjects:
- kind: ServiceAccount
  name: cleanup-operator
  namespace: default

deployment.yml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: cleanup-operator
  name: cleanup-operator
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      run: cleanup-operator
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        run: cleanup-operator
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "7000"
    spec:
      serviceAccountName: cleanup-operator
      containers:
      - args:
        - --namespace=default
        - --legacy-mode=false
        - --delete-successful-after=0s
        image: quay.io/lwolf/kube-cleanup-operator
        imagePullPolicy: Always
        name: cleanup-operator
        ports:
          - containerPort: 7000
        resources:
          requests:
            cpu: 50m
            memory: 50Mi
          limits:
            cpu: 50m
            memory: 50Mi
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      terminationGracePeriodSeconds: 30

Logs from pod

2020/06/05 12:43:21 Starting the application.
2020/06/05 12:43:21 Provided options:
    namespace: default
    dry-run: false
    delete-successful-after: 0s
    delete-failed-after: 0s
    delete-pending-after: 0s
    delete-orphaned-after: 1h0m0s
    delete-evicted-after: 15m0s

    legacy-mode: false
    keep-successful: 0
    keep-failures: -1
    keep-pending: -1

W0605 12:43:21.975384       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020/06/05 12:43:21 Controller started...
2020/06/05 12:43:21 Listening at 0.0.0.0:7000
2020/06/05 12:43:21 Listening for changes...