helm: backup and repairs are ignored

Describe the bug Backup and repairs declared in the yaml are not working or ignored. I see no relevant log in the operator, the agent or the manager after applying the update. sctool does not report the newly added backup or repair operation.

However I can start a backup or repair manually right from sctool (from within a scylla-manager pod) and it works as expected, I can see the backed up files in the gcs bucket.

To Reproduce Steps to reproduce the behavior:

Install scylla-operator, scylla-manager and a scylla cluster on GKE using Helm
Add a backup / repair rule in the values.yaml override of the scylla cluster
Use helm upgrade

Expected behavior

The backup / repair is supposed to work.
I expect to see the relevant lines in sctool task list -a
I also expect to see a line in the logs of the operator or the manager confirming that the configuration was accepted

Config Files

Cluster configuration (helm)

cpuset: false # if the node has four or fewer CPUs, don’t use this option.
hostNetworking: true
developerMode: false

sysctls:
- "fs.aio-max-nr=2097152"

scyllaImage:
  repository: scylladb/scylla
  # Overrides the image tag whose default is the chart appVersion.
  tag: 4.2.3 # 4.3.0 bug: https://github.com/scylladb/scylla/issues/8032

backups:
- name: "daily backup"
  location: ["gcs:scylla-backups"]
  interval: "1d"
  retention: 3
repairs:
- name: "cluster repair"
  interval: "2d"
  intensity: "2"

datacenter: us-central1
racks:
- name: us-central1-c
  members: 3
  storage:
    capacity: 40G
  resources:
    limits:
      cpu: 1
      memory: 5G
    requests:
      cpu: 1
      memory: 5G
  placement:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            # for 1.17+ : topology.kubernetes.io/zone
            - key: failure-domain.beta.kubernetes.io/zone
              operator: In
              values:
              - us-central1-c
    tolerations:
      - key: role
        operator: Equal
        value: scylla-clusters
        effect: NoSchedule
  agentVolumeMounts:
    - name: scylla-manager-service-account
      mountPath: /var/run/secret/scylla/
      readOnly: true
  volumes:
    - name: scylla-manager-service-account
      secret:
        secretName: scylla-manager-service-account

Output of kubectl -n database get scyllacluster scylla -o yaml after applying the update.

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
metadata:
  annotations:
    meta.helm.sh/release-name: scylla
    meta.helm.sh/release-namespace: database
  creationTimestamp: "2021-02-05T14:33:47Z"
  generation: 8
  labels:
    app.kubernetes.io/managed-by: Helm
  name: scylla
  namespace: database
  resourceVersion: "167337173"
  selfLink: /apis/scylla.scylladb.com/v1/namespaces/database/scyllaclusters/scylla
  uid: 7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6
spec:
  agentRepository: scylladb/scylla-manager-agent
  agentVersion: 2.2.1
  backups:
  - interval: 1d
    location:
    - gcs:scylla-backups
    name: daily backup
    numRetries: 3
    retention: 3
    startDate: now
  datacenter:
    name: us-central1
    racks:
    - agentResources: {}
      agentVolumeMounts:
      - mountPath: /var/run/secret/scylla/
        name: scylla-manager-service-account
        readOnly: true
      members: 3
      name: us-central1-c
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: failure-domain.beta.kubernetes.io/zone
                operator: In
                values:
                - us-central1-c
        tolerations:
        - effect: NoSchedule
          key: role
          operator: Equal
          value: scylla-clusters
      resources:
        limits:
          cpu: "1"
          memory: 5G
        requests:
          cpu: "1"
          memory: 5G
      scyllaAgentConfig: scylla-agent-config
      scyllaConfig: scylla-config
      storage:
        capacity: 40G
      volumes:
      - name: scylla-manager-service-account
        secret:
          secretName: scylla-manager-service-account
  genericUpgrade:
    failureStrategy: Retry
    pollInterval: 1s
  network:
    hostNetworking: true
  repairs:
  - intensity: "2"
    interval: 2d
    name: cluster repair
    numRetries: 3
    parallel: 0
    smallTableThreshold: 1GiB
    startDate: now
  repository: scylladb/scylla
  sysctls:
  - fs.aio-max-nr=2097152
  version: 4.2.3
status:
  racks:
    us-central1-c:
      members: 3
      readyMembers: 3
      version: 4.2.3

Output of sctool task list -a

Cluster:  (ce142f17-ac1a-4f9d-9720-e1a850d78be7)
+-------------------------------------------------------------+-----------+-------------------------------+--------+
| Task                                                        | Arguments | Next run                      | Status |
+-------------------------------------------------------------+-----------+-------------------------------+--------+
| healthcheck/b7d01d2c-99cc-49fa-8fef-ec1c68d7e76b            |           | 05 Feb 21 18:08:59 UTC (+15s) | DONE   |
| healthcheck_alternator/bc54a2c1-8c29-4c34-a174-688229bd43d6 |           | 05 Feb 21 18:08:59 UTC (+15s) | DONE   |
| healthcheck_rest/c20146aa-e6f2-4128-9c35-22121b7d89c3       |           | 05 Feb 21 18:09:44 UTC (+1m)  | DONE   |
| repair/b455dd0d-7058-4b4b-9311-465e6890079f                 |           | 06 Feb 21 00:00:00 UTC (+7d)  | NEW    |
+-------------------------------------------------------------+-----------+-------------------------------+--------+

Environment:

Platform: GKE
Kubernetes version: v1.16.15-gke.6000
Scylla version: 4.2.3
Scylla-operator version: nightly (because of bug in 1.0.0 -> cad5572f075a8908f0fcaa1d9d5c3144e1ebb629)

Additional context

I'm using auth_token but since I can use sctool I think it's not relevant to this issue.

Please attach logs of each pod in scylla manager namespace.

Logs of the manager pods:

scylla-manager-0.txt scylla-manager-7ddd6648c9-6xdz9.txt scylla-operator-0.txt

Log of an agent running on one of the Scylladb node: scylla-manager-agent.txt

Notes:

I updated the name of the repair and the backup rules at 18:52 UTC waited 2 minutes and took a snapshot of the logs.
During the lifetime of the pods I did a manual backup from sctool.

I'm still debugging, trying to understand what could be the issue here.

I know that the operator is monitoring the ScyllaCluster since I can scale down a rack.

{"L":"INFO","T":"2021-02-05T22:54:36.822Z","N":"cluster-controller","M":"Next Action: Scale-Down rack","cluster":"database/scylla","resourceVersion":"167461087","name":"us-central1-c","_trace_id":"lf6c3hgkTZycvi-ifW5q5Q"}

But when I try to add/remove/edit a backup it doesn't feel right. The only lines I can get from the operator are:

{"L":"DEBUG","T":"2021-02-05T22:57:52.907Z","N":"cluster-controller","M":"Reconcile request","request":"database/scylla","_trace_id":"yAjjrAPxTeq7mXrxROB0nQ"}
{"L":"INFO","T":"2021-02-05T22:57:52.911Z","N":"cluster-controller","M":"Starting reconciliation...","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"DEBUG","T":"2021-02-05T22:57:52.911Z","N":"cluster-controller","M":"Cluster State","cluster":"database/scylla","resourceVersion":"167462316","object":{"metadata":{"name":"scylla","namespace":"database","selfLink":"/apis/scylla.scylladb.com/v1/namespaces/database/scyllaclusters/scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","resourceVersion":"167462316","generation":17,"creationTimestamp":"2021-02-05T14:33:47Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"scylla","meta.helm.sh/release-namespace":"database"}},"spec":{"version":"4.2.3","repository":"scylladb/scylla","agentVersion":"2.2.1","agentRepository":"scylladb/scylla-manager-agent","genericUpgrade":{"failureStrategy":"Retry","pollInterval":"1s"},"datacenter":{"name":"us-central1","racks":[{"name":"us-central1-c","members":2,"storage":{"capacity":"40G"},"placement":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["us-central1-c"]}]}]}},"tolerations":[{"key":"role","operator":"Equal","value":"scylla-clusters","effect":"NoSchedule"}]},"resources":{"limits":{"cpu":"1","memory":"5G"},"requests":{"cpu":"1","memory":"5G"}},"agentResources":{},"volumes":[{"name":"scylla-manager-service-account","secret":{"secretName":"scylla-manager-service-account"}}],"agentVolumeMounts":[{"name":"scylla-manager-service-account","readOnly":true,"mountPath":"/var/run/secret/scylla/"}],"scyllaConfig":"scylla-config","scyllaAgentConfig":"scylla-agent-config"}]},"sysctls":["fs.aio-max-nr=2097152"],"network":{"hostNetworking":true},"repairs":[{"name":"test cluster repair","startDate":"now","interval":"2d","numRetries":3,"intensity":"2","parallel":0,"smallTableThreshold":"1GiB"}],"backups":[{"name":"test42","startDate":"now","interval":"1d","numRetries":3,"location":["gcs:scylla-backups"],"retention":3}]},"status":{"racks":{"us-central1-c":{"version":"4.2.3","members":2,"readyMembers":2}}}},"_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"DEBUG","T":"2021-02-05T22:57:52.911Z","N":"cluster-controller","M":"All StatefulSets are up-to-date!","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"DEBUG","T":"2021-02-05T22:57:52.912Z","N":"cluster-controller","M":"Cleanup: service list","cluster":"database/scylla","resourceVersion":"167462316","len":3,"items":[{"kind":"Service","apiVersion":"v1","metadata":{"name":"scylla-us-central1-us-central1-c-0","namespace":"database","selfLink":"/api/v1/namespaces/database/services/scylla-us-central1-us-central1-c-0","uid":"ffc0a206-dcc1-48b1-a68d-9ed7337f4d97","resourceVersion":"167284599","creationTimestamp":"2021-02-05T14:33:48Z","labels":{"app":"scylla","app.kubernetes.io/managed-by":"scylla-operator","app.kubernetes.io/name":"scylla","scylla/cluster":"scylla","scylla/datacenter":"us-central1","scylla/rack":"us-central1-c","scylla/seed":""},"ownerReferences":[{"apiVersion":"scylla.scylladb.com/v1","kind":"ScyllaCluster","name":"scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","controller":true,"blockOwnerDeletion":true}]},"spec":{"ports":[{"name":"inter-node-communication","protocol":"TCP","port":7000,"targetPort":7000},{"name":"ssl-inter-node-communication","protocol":"TCP","port":7001,"targetPort":7001},{"name":"jmx-monitoring","protocol":"TCP","port":7199,"targetPort":7199},{"name":"agent-api","protocol":"TCP","port":10001,"targetPort":10001},{"name":"cql","protocol":"TCP","port":9042,"targetPort":9042},{"name":"cql-ssl","protocol":"TCP","port":9142,"targetPort":9142},{"name":"thrift","protocol":"TCP","port":9160,"targetPort":9160}],"selector":{"statefulset.kubernetes.io/pod-name":"scylla-us-central1-us-central1-c-0"},"clusterIP":"10.7.248.241","type":"ClusterIP","sessionAffinity":"None","publishNotReadyAddresses":true},"status":{"loadBalancer":{}}},{"kind":"Service","apiVersion":"v1","metadata":{"name":"scylla-us-central1-us-central1-c-1","namespace":"database","selfLink":"/api/v1/namespaces/database/services/scylla-us-central1-us-central1-c-1","uid":"e63836a7-a4a5-41c0-b5d3-7f08128007d9","resourceVersion":"167285913","creationTimestamp":"2021-02-05T14:37:22Z","labels":{"app":"scylla","app.kubernetes.io/managed-by":"scylla-operator","app.kubernetes.io/name":"scylla","scylla/cluster":"scylla","scylla/datacenter":"us-central1","scylla/rack":"us-central1-c","scylla/seed":""},"ownerReferences":[{"apiVersion":"scylla.scylladb.com/v1","kind":"ScyllaCluster","name":"scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","controller":true,"blockOwnerDeletion":true}]},"spec":{"ports":[{"name":"inter-node-communication","protocol":"TCP","port":7000,"targetPort":7000},{"name":"ssl-inter-node-communication","protocol":"TCP","port":7001,"targetPort":7001},{"name":"jmx-monitoring","protocol":"TCP","port":7199,"targetPort":7199},{"name":"agent-api","protocol":"TCP","port":10001,"targetPort":10001},{"name":"cql","protocol":"TCP","port":9042,"targetPort":9042},{"name":"cql-ssl","protocol":"TCP","port":9142,"targetPort":9142},{"name":"thrift","protocol":"TCP","port":9160,"targetPort":9160}],"selector":{"statefulset.kubernetes.io/pod-name":"scylla-us-central1-us-central1-c-1"},"clusterIP":"10.7.247.196","type":"ClusterIP","sessionAffinity":"None","publishNotReadyAddresses":true},"status":{"loadBalancer":{}}},{"kind":"Service","apiVersion":"v1","metadata":{"name":"scylla-us-central1-us-central1-c-2","namespace":"database","selfLink":"/api/v1/namespaces/database/services/scylla-us-central1-us-central1-c-2","uid":"077ef56c-bc7a-438c-9643-ffa49283fb64","resourceVersion":"167461550","creationTimestamp":"2021-02-05T22:55:53Z","labels":{"app":"scylla","app.kubernetes.io/managed-by":"scylla-operator","app.kubernetes.io/name":"scylla","scylla/cluster":"scylla","scylla/datacenter":"us-central1","scylla/rack":"us-central1-c"},"ownerReferences":[{"apiVersion":"scylla.scylladb.com/v1","kind":"ScyllaCluster","name":"scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","controller":true,"blockOwnerDeletion":true}]},"spec":{"ports":[{"name":"inter-node-communication","protocol":"TCP","port":7000,"targetPort":7000},{"name":"ssl-inter-node-communication","protocol":"TCP","port":7001,"targetPort":7001},{"name":"jmx-monitoring","protocol":"TCP","port":7199,"targetPort":7199},{"name":"agent-api","protocol":"TCP","port":10001,"targetPort":10001},{"name":"cql","protocol":"TCP","port":9042,"targetPort":9042},{"name":"cql-ssl","protocol":"TCP","port":9142,"targetPort":9142},{"name":"thrift","protocol":"TCP","port":9160,"targetPort":9160}],"selector":{"statefulset.kubernetes.io/pod-name":"scylla-us-central1-us-central1-c-2"},"clusterIP":"10.7.247.150","type":"ClusterIP","sessionAffinity":"None","publishNotReadyAddresses":true},"status":{"loadBalancer":{}}}],"_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"INFO","T":"2021-02-05T22:57:52.913Z","N":"cluster-controller","M":"Calculating cluster status...","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"INFO","T":"2021-02-05T22:57:52.913Z","N":"cluster-controller","M":"Writing cluster status.","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"yAjjrAPxTeq7mXrxROB0nQ"}
{"L":"INFO","T":"2021-02-05T22:57:52.920Z","N":"cluster-controller","M":"Reconciliation successful","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"yAjjrAPxTeq7mXrxROB0nQ"}

I'm not sure what's the expected output here (if any), but it seems that no action is scheduled at all.

You won't spot anything in Operator logs regarding backups/repairs. There is pod called "Scylla Manager Controller" which watches these fields and synchronize them with Scylla Manager state. You can read about it in our documentation: https://operator.docs.scylladb.com/stable/manager.html Your backup was registered and successfully run, see Scylla Manager logs:

{"L":"INFO","T":"2021-02-05T16:44:13.701Z","N":"scheduler","M":"Task started","cluster_id":"ce142f17-ac1a-4f9d-9720-e1a850d78be7","task_type":"backup","task_id":"3495b4e7-ab35-47ce-b4cc-1a79d03ffd94","run_id":"4f856cf8-67d1-11eb-bee8-468ba01b8bf3","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
[...]
{"L":"INFO","T":"2021-02-05T16:44:14.734Z","N":"backup.snapshot","M":"Taking snapshots...","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
[...]
{"L":"INFO","T":"2021-02-05T16:44:17.406Z","N":"backup.upload","M":"Uploading snapshot files...","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
[...]
{"L":"INFO","T":"2021-02-05T16:44:25.306Z","N":"scheduler","M":"Task ended","cluster_id":"ce142f17-ac1a-4f9d-9720-e1a850d78be7","task_type":"backup","task_id":"3495b4e7-ab35-47ce-b4cc-1a79d03ffd94","run_id":"4f856cf8-67d1-11eb-bee8-468ba01b8bf3","status":"DONE","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}

But since scylla-manager-0 logs are almost empty, it looks like resources allocated to this pod were too small, and pod is getting killed. Check events in Scylla Manager namespace.

Those lines are the result of a manual backup task made using sctool directly.

During the lifetime of the pods I did a manual backup from sctool.

What doesn't work are tasks declared in the yaml directly, these are ignored. In the logs I've provided earlier I made at least 4 changes to the backups and not a single task has been created or started.

About the pod scylla-manager-0, the pod is till running since I started it few days ago and no new line appeared. It hasn't been killed either.

I made small progress based on your comment. After focusing on the manager I changed the logLevel of scylla-manager to debug and tried to add a new backup rule in the cluster and this is what I got from scylla-manager-0:

{"L":"DEBUG","T":"2021-02-08T14:46:09.827Z","N":"scylla-manager-controller","M":"ignoring reconcile","cluster":"database/scylla"}

I guess that's not the expected behavior?

Do you deploy Scylla Manager and your Scylla cluster in the same namespace?

@zimnx yes I do, in a "database" namespace. Does the manager ignore events from his own namespace ?

So that's the reason. Because SM also uses Scylla as internal database, Scylla Clusters deployed in the same namespace are ignored. This filter should be either stricter or perhaps removed. As a workaround you can deployt SM in different namespace then your Scylla Cluster.

Deploying scylla-manager in its own namespace did the trick. Thanks @zimnx.

Cluster: database/scylla (90651744-9f03-4c5e-9e8c-fbf6cb51349d)
+-------------------------------------------------------------+------------------------------------------------------------+-------------------------------+--------+
| Task                                                        | Arguments                                                  | Next run                      | Status |
+-------------------------------------------------------------+------------------------------------------------------------+-------------------------------+--------+
| healthcheck/ea1b2f3b-3b93-4e8b-851c-f46959b79e58            |                                                            | 08 Feb 21 15:21:08 UTC (+15s) | DONE   |
| healthcheck_alternator/149f84f6-294a-40b2-9561-9dd94ae89f34 |                                                            | 08 Feb 21 15:21:08 UTC (+15s) | DONE   |
| healthcheck_rest/268a3ea6-3a08-43bf-a84e-1c5221a229dc       |                                                            | 08 Feb 21 15:31:08 UTC (+1m)  | NEW    |
| repair/42b96cb1-cb93-4eeb-a5d3-d9d390872986                 | --intensity 2 --parallel 0 --small-table-threshold 1.00GiB | 10 Feb 21 15:19:39 UTC (+2d)  | DONE   |
| backup/4a8d2de5-5367-4a74-8f19-2723563beee1                 | -L gcs:scylla-backups --retention 3                        | 11 Feb 21 15:19:39 UTC (+3d)  | DONE   |
+-------------------------------------------------------------+------------------------------------------------------------+-------------------------------+--------+

To avoid other people wasting their time debugging the same issue I think it would be a nice improvement to add an annotation to the manager's internal scylla cluster and ignore it based on that instead of ignoring events from the whole namespace.

scylladb / scylla-operator

helm: backup and repairs are ignored #368