scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
337 stars 175 forks source link

helm: backup and repairs are ignored #368

Closed etix closed 3 years ago

etix commented 3 years ago

Describe the bug Backup and repairs declared in the yaml are not working or ignored. I see no relevant log in the operator, the agent or the manager after applying the update. sctool does not report the newly added backup or repair operation.

However I can start a backup or repair manually right from sctool (from within a scylla-manager pod) and it works as expected, I can see the backed up files in the gcs bucket.

To Reproduce Steps to reproduce the behavior:

  1. Install scylla-operator, scylla-manager and a scylla cluster on GKE using Helm
  2. Add a backup / repair rule in the values.yaml override of the scylla cluster
  3. Use helm upgrade

Expected behavior

  1. The backup / repair is supposed to work.
  2. I expect to see the relevant lines in sctool task list -a
  3. I also expect to see a line in the logs of the operator or the manager confirming that the configuration was accepted

Config Files

Cluster configuration (helm)

cpuset: false # if the node has four or fewer CPUs, don’t use this option.
hostNetworking: true
developerMode: false

sysctls:
- "fs.aio-max-nr=2097152"

scyllaImage:
  repository: scylladb/scylla
  # Overrides the image tag whose default is the chart appVersion.
  tag: 4.2.3 # 4.3.0 bug: https://github.com/scylladb/scylla/issues/8032

backups:
- name: "daily backup"
  location: ["gcs:scylla-backups"]
  interval: "1d"
  retention: 3
repairs:
- name: "cluster repair"
  interval: "2d"
  intensity: "2"

datacenter: us-central1
racks:
- name: us-central1-c
  members: 3
  storage:
    capacity: 40G
  resources:
    limits:
      cpu: 1
      memory: 5G
    requests:
      cpu: 1
      memory: 5G
  placement:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            # for 1.17+ : topology.kubernetes.io/zone
            - key: failure-domain.beta.kubernetes.io/zone
              operator: In
              values:
              - us-central1-c
    tolerations:
      - key: role
        operator: Equal
        value: scylla-clusters
        effect: NoSchedule
  agentVolumeMounts:
    - name: scylla-manager-service-account
      mountPath: /var/run/secret/scylla/
      readOnly: true
  volumes:
    - name: scylla-manager-service-account
      secret:
        secretName: scylla-manager-service-account

Output of kubectl -n database get scyllacluster scylla -o yaml after applying the update.

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
metadata:
  annotations:
    meta.helm.sh/release-name: scylla
    meta.helm.sh/release-namespace: database
  creationTimestamp: "2021-02-05T14:33:47Z"
  generation: 8
  labels:
    app.kubernetes.io/managed-by: Helm
  name: scylla
  namespace: database
  resourceVersion: "167337173"
  selfLink: /apis/scylla.scylladb.com/v1/namespaces/database/scyllaclusters/scylla
  uid: 7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6
spec:
  agentRepository: scylladb/scylla-manager-agent
  agentVersion: 2.2.1
  backups:
  - interval: 1d
    location:
    - gcs:scylla-backups
    name: daily backup
    numRetries: 3
    retention: 3
    startDate: now
  datacenter:
    name: us-central1
    racks:
    - agentResources: {}
      agentVolumeMounts:
      - mountPath: /var/run/secret/scylla/
        name: scylla-manager-service-account
        readOnly: true
      members: 3
      name: us-central1-c
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: failure-domain.beta.kubernetes.io/zone
                operator: In
                values:
                - us-central1-c
        tolerations:
        - effect: NoSchedule
          key: role
          operator: Equal
          value: scylla-clusters
      resources:
        limits:
          cpu: "1"
          memory: 5G
        requests:
          cpu: "1"
          memory: 5G
      scyllaAgentConfig: scylla-agent-config
      scyllaConfig: scylla-config
      storage:
        capacity: 40G
      volumes:
      - name: scylla-manager-service-account
        secret:
          secretName: scylla-manager-service-account
  genericUpgrade:
    failureStrategy: Retry
    pollInterval: 1s
  network:
    hostNetworking: true
  repairs:
  - intensity: "2"
    interval: 2d
    name: cluster repair
    numRetries: 3
    parallel: 0
    smallTableThreshold: 1GiB
    startDate: now
  repository: scylladb/scylla
  sysctls:
  - fs.aio-max-nr=2097152
  version: 4.2.3
status:
  racks:
    us-central1-c:
      members: 3
      readyMembers: 3
      version: 4.2.3

Output of sctool task list -a

Cluster:  (ce142f17-ac1a-4f9d-9720-e1a850d78be7)
+-------------------------------------------------------------+-----------+-------------------------------+--------+
| Task                                                        | Arguments | Next run                      | Status |
+-------------------------------------------------------------+-----------+-------------------------------+--------+
| healthcheck/b7d01d2c-99cc-49fa-8fef-ec1c68d7e76b            |           | 05 Feb 21 18:08:59 UTC (+15s) | DONE   |
| healthcheck_alternator/bc54a2c1-8c29-4c34-a174-688229bd43d6 |           | 05 Feb 21 18:08:59 UTC (+15s) | DONE   |
| healthcheck_rest/c20146aa-e6f2-4128-9c35-22121b7d89c3       |           | 05 Feb 21 18:09:44 UTC (+1m)  | DONE   |
| repair/b455dd0d-7058-4b4b-9311-465e6890079f                 |           | 06 Feb 21 00:00:00 UTC (+7d)  | NEW    |
+-------------------------------------------------------------+-----------+-------------------------------+--------+

Environment:

Additional context

I'm using auth_token but since I can use sctool I think it's not relevant to this issue.

zimnx commented 3 years ago

Please attach logs of each pod in scylla manager namespace.

etix commented 3 years ago

Logs of the manager pods:

scylla-manager-0.txt scylla-manager-7ddd6648c9-6xdz9.txt scylla-operator-0.txt

Log of an agent running on one of the Scylladb node: scylla-manager-agent.txt

Notes:

etix commented 3 years ago

I'm still debugging, trying to understand what could be the issue here.

I know that the operator is monitoring the ScyllaCluster since I can scale down a rack.

{"L":"INFO","T":"2021-02-05T22:54:36.822Z","N":"cluster-controller","M":"Next Action: Scale-Down rack","cluster":"database/scylla","resourceVersion":"167461087","name":"us-central1-c","_trace_id":"lf6c3hgkTZycvi-ifW5q5Q"}

But when I try to add/remove/edit a backup it doesn't feel right. The only lines I can get from the operator are:

{"L":"DEBUG","T":"2021-02-05T22:57:52.907Z","N":"cluster-controller","M":"Reconcile request","request":"database/scylla","_trace_id":"yAjjrAPxTeq7mXrxROB0nQ"}
{"L":"INFO","T":"2021-02-05T22:57:52.911Z","N":"cluster-controller","M":"Starting reconciliation...","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"DEBUG","T":"2021-02-05T22:57:52.911Z","N":"cluster-controller","M":"Cluster State","cluster":"database/scylla","resourceVersion":"167462316","object":{"metadata":{"name":"scylla","namespace":"database","selfLink":"/apis/scylla.scylladb.com/v1/namespaces/database/scyllaclusters/scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","resourceVersion":"167462316","generation":17,"creationTimestamp":"2021-02-05T14:33:47Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"scylla","meta.helm.sh/release-namespace":"database"}},"spec":{"version":"4.2.3","repository":"scylladb/scylla","agentVersion":"2.2.1","agentRepository":"scylladb/scylla-manager-agent","genericUpgrade":{"failureStrategy":"Retry","pollInterval":"1s"},"datacenter":{"name":"us-central1","racks":[{"name":"us-central1-c","members":2,"storage":{"capacity":"40G"},"placement":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["us-central1-c"]}]}]}},"tolerations":[{"key":"role","operator":"Equal","value":"scylla-clusters","effect":"NoSchedule"}]},"resources":{"limits":{"cpu":"1","memory":"5G"},"requests":{"cpu":"1","memory":"5G"}},"agentResources":{},"volumes":[{"name":"scylla-manager-service-account","secret":{"secretName":"scylla-manager-service-account"}}],"agentVolumeMounts":[{"name":"scylla-manager-service-account","readOnly":true,"mountPath":"/var/run/secret/scylla/"}],"scyllaConfig":"scylla-config","scyllaAgentConfig":"scylla-agent-config"}]},"sysctls":["fs.aio-max-nr=2097152"],"network":{"hostNetworking":true},"repairs":[{"name":"test cluster repair","startDate":"now","interval":"2d","numRetries":3,"intensity":"2","parallel":0,"smallTableThreshold":"1GiB"}],"backups":[{"name":"test42","startDate":"now","interval":"1d","numRetries":3,"location":["gcs:scylla-backups"],"retention":3}]},"status":{"racks":{"us-central1-c":{"version":"4.2.3","members":2,"readyMembers":2}}}},"_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"DEBUG","T":"2021-02-05T22:57:52.911Z","N":"cluster-controller","M":"All StatefulSets are up-to-date!","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"DEBUG","T":"2021-02-05T22:57:52.912Z","N":"cluster-controller","M":"Cleanup: service list","cluster":"database/scylla","resourceVersion":"167462316","len":3,"items":[{"kind":"Service","apiVersion":"v1","metadata":{"name":"scylla-us-central1-us-central1-c-0","namespace":"database","selfLink":"/api/v1/namespaces/database/services/scylla-us-central1-us-central1-c-0","uid":"ffc0a206-dcc1-48b1-a68d-9ed7337f4d97","resourceVersion":"167284599","creationTimestamp":"2021-02-05T14:33:48Z","labels":{"app":"scylla","app.kubernetes.io/managed-by":"scylla-operator","app.kubernetes.io/name":"scylla","scylla/cluster":"scylla","scylla/datacenter":"us-central1","scylla/rack":"us-central1-c","scylla/seed":""},"ownerReferences":[{"apiVersion":"scylla.scylladb.com/v1","kind":"ScyllaCluster","name":"scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","controller":true,"blockOwnerDeletion":true}]},"spec":{"ports":[{"name":"inter-node-communication","protocol":"TCP","port":7000,"targetPort":7000},{"name":"ssl-inter-node-communication","protocol":"TCP","port":7001,"targetPort":7001},{"name":"jmx-monitoring","protocol":"TCP","port":7199,"targetPort":7199},{"name":"agent-api","protocol":"TCP","port":10001,"targetPort":10001},{"name":"cql","protocol":"TCP","port":9042,"targetPort":9042},{"name":"cql-ssl","protocol":"TCP","port":9142,"targetPort":9142},{"name":"thrift","protocol":"TCP","port":9160,"targetPort":9160}],"selector":{"statefulset.kubernetes.io/pod-name":"scylla-us-central1-us-central1-c-0"},"clusterIP":"10.7.248.241","type":"ClusterIP","sessionAffinity":"None","publishNotReadyAddresses":true},"status":{"loadBalancer":{}}},{"kind":"Service","apiVersion":"v1","metadata":{"name":"scylla-us-central1-us-central1-c-1","namespace":"database","selfLink":"/api/v1/namespaces/database/services/scylla-us-central1-us-central1-c-1","uid":"e63836a7-a4a5-41c0-b5d3-7f08128007d9","resourceVersion":"167285913","creationTimestamp":"2021-02-05T14:37:22Z","labels":{"app":"scylla","app.kubernetes.io/managed-by":"scylla-operator","app.kubernetes.io/name":"scylla","scylla/cluster":"scylla","scylla/datacenter":"us-central1","scylla/rack":"us-central1-c","scylla/seed":""},"ownerReferences":[{"apiVersion":"scylla.scylladb.com/v1","kind":"ScyllaCluster","name":"scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","controller":true,"blockOwnerDeletion":true}]},"spec":{"ports":[{"name":"inter-node-communication","protocol":"TCP","port":7000,"targetPort":7000},{"name":"ssl-inter-node-communication","protocol":"TCP","port":7001,"targetPort":7001},{"name":"jmx-monitoring","protocol":"TCP","port":7199,"targetPort":7199},{"name":"agent-api","protocol":"TCP","port":10001,"targetPort":10001},{"name":"cql","protocol":"TCP","port":9042,"targetPort":9042},{"name":"cql-ssl","protocol":"TCP","port":9142,"targetPort":9142},{"name":"thrift","protocol":"TCP","port":9160,"targetPort":9160}],"selector":{"statefulset.kubernetes.io/pod-name":"scylla-us-central1-us-central1-c-1"},"clusterIP":"10.7.247.196","type":"ClusterIP","sessionAffinity":"None","publishNotReadyAddresses":true},"status":{"loadBalancer":{}}},{"kind":"Service","apiVersion":"v1","metadata":{"name":"scylla-us-central1-us-central1-c-2","namespace":"database","selfLink":"/api/v1/namespaces/database/services/scylla-us-central1-us-central1-c-2","uid":"077ef56c-bc7a-438c-9643-ffa49283fb64","resourceVersion":"167461550","creationTimestamp":"2021-02-05T22:55:53Z","labels":{"app":"scylla","app.kubernetes.io/managed-by":"scylla-operator","app.kubernetes.io/name":"scylla","scylla/cluster":"scylla","scylla/datacenter":"us-central1","scylla/rack":"us-central1-c"},"ownerReferences":[{"apiVersion":"scylla.scylladb.com/v1","kind":"ScyllaCluster","name":"scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","controller":true,"blockOwnerDeletion":true}]},"spec":{"ports":[{"name":"inter-node-communication","protocol":"TCP","port":7000,"targetPort":7000},{"name":"ssl-inter-node-communication","protocol":"TCP","port":7001,"targetPort":7001},{"name":"jmx-monitoring","protocol":"TCP","port":7199,"targetPort":7199},{"name":"agent-api","protocol":"TCP","port":10001,"targetPort":10001},{"name":"cql","protocol":"TCP","port":9042,"targetPort":9042},{"name":"cql-ssl","protocol":"TCP","port":9142,"targetPort":9142},{"name":"thrift","protocol":"TCP","port":9160,"targetPort":9160}],"selector":{"statefulset.kubernetes.io/pod-name":"scylla-us-central1-us-central1-c-2"},"clusterIP":"10.7.247.150","type":"ClusterIP","sessionAffinity":"None","publishNotReadyAddresses":true},"status":{"loadBalancer":{}}}],"_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"INFO","T":"2021-02-05T22:57:52.913Z","N":"cluster-controller","M":"Calculating cluster status...","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"INFO","T":"2021-02-05T22:57:52.913Z","N":"cluster-controller","M":"Writing cluster status.","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"yAjjrAPxTeq7mXrxROB0nQ"}
{"L":"INFO","T":"2021-02-05T22:57:52.920Z","N":"cluster-controller","M":"Reconciliation successful","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"yAjjrAPxTeq7mXrxROB0nQ"}

I'm not sure what's the expected output here (if any), but it seems that no action is scheduled at all.

zimnx commented 3 years ago

You won't spot anything in Operator logs regarding backups/repairs. There is pod called "Scylla Manager Controller" which watches these fields and synchronize them with Scylla Manager state. You can read about it in our documentation: https://operator.docs.scylladb.com/stable/manager.html Your backup was registered and successfully run, see Scylla Manager logs:

{"L":"INFO","T":"2021-02-05T16:44:13.701Z","N":"scheduler","M":"Task started","cluster_id":"ce142f17-ac1a-4f9d-9720-e1a850d78be7","task_type":"backup","task_id":"3495b4e7-ab35-47ce-b4cc-1a79d03ffd94","run_id":"4f856cf8-67d1-11eb-bee8-468ba01b8bf3","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
[...]
{"L":"INFO","T":"2021-02-05T16:44:14.734Z","N":"backup.snapshot","M":"Taking snapshots...","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
[...]
{"L":"INFO","T":"2021-02-05T16:44:17.406Z","N":"backup.upload","M":"Uploading snapshot files...","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
[...]
{"L":"INFO","T":"2021-02-05T16:44:25.306Z","N":"scheduler","M":"Task ended","cluster_id":"ce142f17-ac1a-4f9d-9720-e1a850d78be7","task_type":"backup","task_id":"3495b4e7-ab35-47ce-b4cc-1a79d03ffd94","run_id":"4f856cf8-67d1-11eb-bee8-468ba01b8bf3","status":"DONE","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}

But since scylla-manager-0 logs are almost empty, it looks like resources allocated to this pod were too small, and pod is getting killed. Check events in Scylla Manager namespace.

etix commented 3 years ago

Those lines are the result of a manual backup task made using sctool directly.

During the lifetime of the pods I did a manual backup from sctool.

What doesn't work are tasks declared in the yaml directly, these are ignored. In the logs I've provided earlier I made at least 4 changes to the backups and not a single task has been created or started.

About the pod scylla-manager-0, the pod is till running since I started it few days ago and no new line appeared. It hasn't been killed either.

etix commented 3 years ago

I made small progress based on your comment. After focusing on the manager I changed the logLevel of scylla-manager to debug and tried to add a new backup rule in the cluster and this is what I got from scylla-manager-0:

{"L":"DEBUG","T":"2021-02-08T14:46:09.827Z","N":"scylla-manager-controller","M":"ignoring reconcile","cluster":"database/scylla"}

I guess that's not the expected behavior?

zimnx commented 3 years ago

Do you deploy Scylla Manager and your Scylla cluster in the same namespace?

etix commented 3 years ago

@zimnx yes I do, in a "database" namespace. Does the manager ignore events from his own namespace ?

zimnx commented 3 years ago

So that's the reason. Because SM also uses Scylla as internal database, Scylla Clusters deployed in the same namespace are ignored. This filter should be either stricter or perhaps removed. As a workaround you can deployt SM in different namespace then your Scylla Cluster.

etix commented 3 years ago

Deploying scylla-manager in its own namespace did the trick. Thanks @zimnx.

Cluster: database/scylla (90651744-9f03-4c5e-9e8c-fbf6cb51349d)
+-------------------------------------------------------------+------------------------------------------------------------+-------------------------------+--------+
| Task                                                        | Arguments                                                  | Next run                      | Status |
+-------------------------------------------------------------+------------------------------------------------------------+-------------------------------+--------+
| healthcheck/ea1b2f3b-3b93-4e8b-851c-f46959b79e58            |                                                            | 08 Feb 21 15:21:08 UTC (+15s) | DONE   |
| healthcheck_alternator/149f84f6-294a-40b2-9561-9dd94ae89f34 |                                                            | 08 Feb 21 15:21:08 UTC (+15s) | DONE   |
| healthcheck_rest/268a3ea6-3a08-43bf-a84e-1c5221a229dc       |                                                            | 08 Feb 21 15:31:08 UTC (+1m)  | NEW    |
| repair/42b96cb1-cb93-4eeb-a5d3-d9d390872986                 | --intensity 2 --parallel 0 --small-table-threshold 1.00GiB | 10 Feb 21 15:19:39 UTC (+2d)  | DONE   |
| backup/4a8d2de5-5367-4a74-8f19-2723563beee1                 | -L gcs:scylla-backups --retention 3                        | 11 Feb 21 15:19:39 UTC (+3d)  | DONE   |
+-------------------------------------------------------------+------------------------------------------------------------+-------------------------------+--------+

To avoid other people wasting their time debugging the same issue I think it would be a nice improvement to add an annotation to the manager's internal scylla cluster and ignore it based on that instead of ignoring events from the whole namespace.