operate-first / support

This repo should serve as a central source for users to raise issues/questions/requests for Operate First.
GNU General Public License v3.0
15 stars 25 forks source link

Issues about curator running in smaug cluster #515

Closed skanthed closed 2 years ago

skanthed commented 2 years ago

Describe the bug

Cron jobs not starting for curator project -

Pod backup-to-bucket says - "Unable to attach or mount volumes: unmounted volumes=[koku-metrics-operator-data]"

Link to the issue - https://console-openshift-console.apps.smaug.na.operate-first.cloud/k8s/ns/koku-metrics-operator/pods/backup-to-bucket-27385920-crmjz/events

Screenshots

backup-to-bucket

larsks commented 2 years ago

It's likely this job failed because the pod was scheduled on a node other than the one on which the volume was available. Despite this error it looks like your backups are otherwise completing regularly; right now I see:

NAME                                              READY   STATUS      RESTARTS         AGE
backup-to-bucket-27306391--1-zvkcw                0/2     Completed   0                56d
backup-to-bucket-27387360--1-fd68p                0/2     Completed   0                15h
backup-to-bucket-27387720--1-n9c4m                0/2     Completed   0                9h
backup-to-bucket-27388080--1-98rqk                0/2     Completed   0                3h30m

We may be able to avoid this problem by providing some sort of scheduling hint that ensures the backup pod runs in the right place. I'm going to investigate what our options are for that solution.

HumairAK commented 2 years ago

Just to confirm @larsks that was indeed the issue. We changed the node for the cronjob to point to the same node as the controller pod, and the job ran successfully.

skanthed commented 2 years ago

But the pod that was created has nothing in logs, It pushed nothing to the database and the status was completed. Waiting for another few pods to be created and will update it here.

HumairAK commented 2 years ago

@skanthed apologies by successfully I meant the pod was scheduled and ran to completion. Not that the container process within it performed as it was supposed to.

That I have no idea about.

skanthed commented 2 years ago

@HumairAK I understand. No issues, I was just clarifying the details.

larsks commented 2 years ago

We should be able to solve this by configuring pod affinity constraints. That allows us to schedule one pod (e.g., the backup pod) by requiring that it runs on a node that is already running other pods with specific labels.

For this to work, we need to ensure the koku metrics pod has appropriate labels, which means modifying the relevant Deployment resource. How is koku metrics being deployed; is this via an operator or some other mechanism?

skanthed commented 2 years ago

Deployed via an operator, no other mechanism.

larsks commented 2 years ago

@skanthed where does the operator come from? It doesn't provide a very rich set of labels on the controller pod, but there is one label we can use:

$ oc -n koku-metrics-operator get pod koku-metrics-controller-manager-784bf87577-k4dhx -o jsonpath='{.metadata.labels}{"\n"}'
{"control-plane":"controller-manager","pod-template-hash":"784bf87577"}

So if we match on the control-plane label, we would add an affinity section something like this to the backup pod:

  affinity:
    podAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution: 
      - labelSelector:
          matchExpressions:
          - key: control-plane
            operator: In 
            values:
            - controller-manager
        topologyKey: kubernetes.io/hostname

This should arrange for the backup-to-bucket pod to run on the same node as the koku-metrics-controller-manager pod.

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta commented 2 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 2 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/operate-first/support/issues/515#issuecomment-1172869900): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.