Closed etix closed 3 years ago
Please attach logs of each pod in scylla manager namespace.
Logs of the manager pods:
scylla-manager-0.txt scylla-manager-7ddd6648c9-6xdz9.txt scylla-operator-0.txt
Log of an agent running on one of the Scylladb node: scylla-manager-agent.txt
Notes:
I'm still debugging, trying to understand what could be the issue here.
I know that the operator is monitoring the ScyllaCluster since I can scale down a rack.
{"L":"INFO","T":"2021-02-05T22:54:36.822Z","N":"cluster-controller","M":"Next Action: Scale-Down rack","cluster":"database/scylla","resourceVersion":"167461087","name":"us-central1-c","_trace_id":"lf6c3hgkTZycvi-ifW5q5Q"}
But when I try to add/remove/edit a backup it doesn't feel right. The only lines I can get from the operator are:
{"L":"DEBUG","T":"2021-02-05T22:57:52.907Z","N":"cluster-controller","M":"Reconcile request","request":"database/scylla","_trace_id":"yAjjrAPxTeq7mXrxROB0nQ"}
{"L":"INFO","T":"2021-02-05T22:57:52.911Z","N":"cluster-controller","M":"Starting reconciliation...","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"DEBUG","T":"2021-02-05T22:57:52.911Z","N":"cluster-controller","M":"Cluster State","cluster":"database/scylla","resourceVersion":"167462316","object":{"metadata":{"name":"scylla","namespace":"database","selfLink":"/apis/scylla.scylladb.com/v1/namespaces/database/scyllaclusters/scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","resourceVersion":"167462316","generation":17,"creationTimestamp":"2021-02-05T14:33:47Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"scylla","meta.helm.sh/release-namespace":"database"}},"spec":{"version":"4.2.3","repository":"scylladb/scylla","agentVersion":"2.2.1","agentRepository":"scylladb/scylla-manager-agent","genericUpgrade":{"failureStrategy":"Retry","pollInterval":"1s"},"datacenter":{"name":"us-central1","racks":[{"name":"us-central1-c","members":2,"storage":{"capacity":"40G"},"placement":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["us-central1-c"]}]}]}},"tolerations":[{"key":"role","operator":"Equal","value":"scylla-clusters","effect":"NoSchedule"}]},"resources":{"limits":{"cpu":"1","memory":"5G"},"requests":{"cpu":"1","memory":"5G"}},"agentResources":{},"volumes":[{"name":"scylla-manager-service-account","secret":{"secretName":"scylla-manager-service-account"}}],"agentVolumeMounts":[{"name":"scylla-manager-service-account","readOnly":true,"mountPath":"/var/run/secret/scylla/"}],"scyllaConfig":"scylla-config","scyllaAgentConfig":"scylla-agent-config"}]},"sysctls":["fs.aio-max-nr=2097152"],"network":{"hostNetworking":true},"repairs":[{"name":"test cluster repair","startDate":"now","interval":"2d","numRetries":3,"intensity":"2","parallel":0,"smallTableThreshold":"1GiB"}],"backups":[{"name":"test42","startDate":"now","interval":"1d","numRetries":3,"location":["gcs:scylla-backups"],"retention":3}]},"status":{"racks":{"us-central1-c":{"version":"4.2.3","members":2,"readyMembers":2}}}},"_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"DEBUG","T":"2021-02-05T22:57:52.911Z","N":"cluster-controller","M":"All StatefulSets are up-to-date!","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"DEBUG","T":"2021-02-05T22:57:52.912Z","N":"cluster-controller","M":"Cleanup: service list","cluster":"database/scylla","resourceVersion":"167462316","len":3,"items":[{"kind":"Service","apiVersion":"v1","metadata":{"name":"scylla-us-central1-us-central1-c-0","namespace":"database","selfLink":"/api/v1/namespaces/database/services/scylla-us-central1-us-central1-c-0","uid":"ffc0a206-dcc1-48b1-a68d-9ed7337f4d97","resourceVersion":"167284599","creationTimestamp":"2021-02-05T14:33:48Z","labels":{"app":"scylla","app.kubernetes.io/managed-by":"scylla-operator","app.kubernetes.io/name":"scylla","scylla/cluster":"scylla","scylla/datacenter":"us-central1","scylla/rack":"us-central1-c","scylla/seed":""},"ownerReferences":[{"apiVersion":"scylla.scylladb.com/v1","kind":"ScyllaCluster","name":"scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","controller":true,"blockOwnerDeletion":true}]},"spec":{"ports":[{"name":"inter-node-communication","protocol":"TCP","port":7000,"targetPort":7000},{"name":"ssl-inter-node-communication","protocol":"TCP","port":7001,"targetPort":7001},{"name":"jmx-monitoring","protocol":"TCP","port":7199,"targetPort":7199},{"name":"agent-api","protocol":"TCP","port":10001,"targetPort":10001},{"name":"cql","protocol":"TCP","port":9042,"targetPort":9042},{"name":"cql-ssl","protocol":"TCP","port":9142,"targetPort":9142},{"name":"thrift","protocol":"TCP","port":9160,"targetPort":9160}],"selector":{"statefulset.kubernetes.io/pod-name":"scylla-us-central1-us-central1-c-0"},"clusterIP":"10.7.248.241","type":"ClusterIP","sessionAffinity":"None","publishNotReadyAddresses":true},"status":{"loadBalancer":{}}},{"kind":"Service","apiVersion":"v1","metadata":{"name":"scylla-us-central1-us-central1-c-1","namespace":"database","selfLink":"/api/v1/namespaces/database/services/scylla-us-central1-us-central1-c-1","uid":"e63836a7-a4a5-41c0-b5d3-7f08128007d9","resourceVersion":"167285913","creationTimestamp":"2021-02-05T14:37:22Z","labels":{"app":"scylla","app.kubernetes.io/managed-by":"scylla-operator","app.kubernetes.io/name":"scylla","scylla/cluster":"scylla","scylla/datacenter":"us-central1","scylla/rack":"us-central1-c","scylla/seed":""},"ownerReferences":[{"apiVersion":"scylla.scylladb.com/v1","kind":"ScyllaCluster","name":"scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","controller":true,"blockOwnerDeletion":true}]},"spec":{"ports":[{"name":"inter-node-communication","protocol":"TCP","port":7000,"targetPort":7000},{"name":"ssl-inter-node-communication","protocol":"TCP","port":7001,"targetPort":7001},{"name":"jmx-monitoring","protocol":"TCP","port":7199,"targetPort":7199},{"name":"agent-api","protocol":"TCP","port":10001,"targetPort":10001},{"name":"cql","protocol":"TCP","port":9042,"targetPort":9042},{"name":"cql-ssl","protocol":"TCP","port":9142,"targetPort":9142},{"name":"thrift","protocol":"TCP","port":9160,"targetPort":9160}],"selector":{"statefulset.kubernetes.io/pod-name":"scylla-us-central1-us-central1-c-1"},"clusterIP":"10.7.247.196","type":"ClusterIP","sessionAffinity":"None","publishNotReadyAddresses":true},"status":{"loadBalancer":{}}},{"kind":"Service","apiVersion":"v1","metadata":{"name":"scylla-us-central1-us-central1-c-2","namespace":"database","selfLink":"/api/v1/namespaces/database/services/scylla-us-central1-us-central1-c-2","uid":"077ef56c-bc7a-438c-9643-ffa49283fb64","resourceVersion":"167461550","creationTimestamp":"2021-02-05T22:55:53Z","labels":{"app":"scylla","app.kubernetes.io/managed-by":"scylla-operator","app.kubernetes.io/name":"scylla","scylla/cluster":"scylla","scylla/datacenter":"us-central1","scylla/rack":"us-central1-c"},"ownerReferences":[{"apiVersion":"scylla.scylladb.com/v1","kind":"ScyllaCluster","name":"scylla","uid":"7f6d1afe-dcc2-4e6f-9a0f-e198fc43d8d6","controller":true,"blockOwnerDeletion":true}]},"spec":{"ports":[{"name":"inter-node-communication","protocol":"TCP","port":7000,"targetPort":7000},{"name":"ssl-inter-node-communication","protocol":"TCP","port":7001,"targetPort":7001},{"name":"jmx-monitoring","protocol":"TCP","port":7199,"targetPort":7199},{"name":"agent-api","protocol":"TCP","port":10001,"targetPort":10001},{"name":"cql","protocol":"TCP","port":9042,"targetPort":9042},{"name":"cql-ssl","protocol":"TCP","port":9142,"targetPort":9142},{"name":"thrift","protocol":"TCP","port":9160,"targetPort":9160}],"selector":{"statefulset.kubernetes.io/pod-name":"scylla-us-central1-us-central1-c-2"},"clusterIP":"10.7.247.150","type":"ClusterIP","sessionAffinity":"None","publishNotReadyAddresses":true},"status":{"loadBalancer":{}}}],"_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"INFO","T":"2021-02-05T22:57:52.913Z","N":"cluster-controller","M":"Calculating cluster status...","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"SUmka5abT1-No6s2JxZ1nA"}
{"L":"INFO","T":"2021-02-05T22:57:52.913Z","N":"cluster-controller","M":"Writing cluster status.","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"yAjjrAPxTeq7mXrxROB0nQ"}
{"L":"INFO","T":"2021-02-05T22:57:52.920Z","N":"cluster-controller","M":"Reconciliation successful","cluster":"database/scylla","resourceVersion":"167462316","_trace_id":"yAjjrAPxTeq7mXrxROB0nQ"}
I'm not sure what's the expected output here (if any), but it seems that no action is scheduled at all.
You won't spot anything in Operator logs regarding backups/repairs. There is pod called "Scylla Manager Controller" which watches these fields and synchronize them with Scylla Manager state. You can read about it in our documentation: https://operator.docs.scylladb.com/stable/manager.html Your backup was registered and successfully run, see Scylla Manager logs:
{"L":"INFO","T":"2021-02-05T16:44:13.701Z","N":"scheduler","M":"Task started","cluster_id":"ce142f17-ac1a-4f9d-9720-e1a850d78be7","task_type":"backup","task_id":"3495b4e7-ab35-47ce-b4cc-1a79d03ffd94","run_id":"4f856cf8-67d1-11eb-bee8-468ba01b8bf3","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
[...]
{"L":"INFO","T":"2021-02-05T16:44:14.734Z","N":"backup.snapshot","M":"Taking snapshots...","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
[...]
{"L":"INFO","T":"2021-02-05T16:44:17.406Z","N":"backup.upload","M":"Uploading snapshot files...","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
[...]
{"L":"INFO","T":"2021-02-05T16:44:25.306Z","N":"scheduler","M":"Task ended","cluster_id":"ce142f17-ac1a-4f9d-9720-e1a850d78be7","task_type":"backup","task_id":"3495b4e7-ab35-47ce-b4cc-1a79d03ffd94","run_id":"4f856cf8-67d1-11eb-bee8-468ba01b8bf3","status":"DONE","_trace_id":"ieQE4_2eTRKMjgDv5yfmLw"}
But since scylla-manager-0
logs are almost empty, it looks like resources allocated to this pod were too small, and pod is getting killed. Check events in Scylla Manager namespace.
Those lines are the result of a manual backup task made using sctool
directly.
During the lifetime of the pods I did a manual backup from sctool.
What doesn't work are tasks declared in the yaml directly, these are ignored. In the logs I've provided earlier I made at least 4 changes to the backups and not a single task has been created or started.
About the pod scylla-manager-0
, the pod is till running since I started it few days ago and no new line appeared. It hasn't been killed either.
I made small progress based on your comment. After focusing on the manager I changed the logLevel
of scylla-manager
to debug
and tried to add a new backup rule in the cluster and this is what I got from scylla-manager-0
:
{"L":"DEBUG","T":"2021-02-08T14:46:09.827Z","N":"scylla-manager-controller","M":"ignoring reconcile","cluster":"database/scylla"}
I guess that's not the expected behavior?
Do you deploy Scylla Manager and your Scylla cluster in the same namespace?
@zimnx yes I do, in a "database" namespace. Does the manager ignore events from his own namespace ?
So that's the reason. Because SM also uses Scylla as internal database, Scylla Clusters deployed in the same namespace are ignored. This filter should be either stricter or perhaps removed. As a workaround you can deployt SM in different namespace then your Scylla Cluster.
Deploying scylla-manager in its own namespace did the trick. Thanks @zimnx.
Cluster: database/scylla (90651744-9f03-4c5e-9e8c-fbf6cb51349d)
+-------------------------------------------------------------+------------------------------------------------------------+-------------------------------+--------+
| Task | Arguments | Next run | Status |
+-------------------------------------------------------------+------------------------------------------------------------+-------------------------------+--------+
| healthcheck/ea1b2f3b-3b93-4e8b-851c-f46959b79e58 | | 08 Feb 21 15:21:08 UTC (+15s) | DONE |
| healthcheck_alternator/149f84f6-294a-40b2-9561-9dd94ae89f34 | | 08 Feb 21 15:21:08 UTC (+15s) | DONE |
| healthcheck_rest/268a3ea6-3a08-43bf-a84e-1c5221a229dc | | 08 Feb 21 15:31:08 UTC (+1m) | NEW |
| repair/42b96cb1-cb93-4eeb-a5d3-d9d390872986 | --intensity 2 --parallel 0 --small-table-threshold 1.00GiB | 10 Feb 21 15:19:39 UTC (+2d) | DONE |
| backup/4a8d2de5-5367-4a74-8f19-2723563beee1 | -L gcs:scylla-backups --retention 3 | 11 Feb 21 15:19:39 UTC (+3d) | DONE |
+-------------------------------------------------------------+------------------------------------------------------------+-------------------------------+--------+
To avoid other people wasting their time debugging the same issue I think it would be a nice improvement to add an annotation to the manager's internal scylla cluster and ignore it based on that instead of ignoring events from the whole namespace.
Describe the bug Backup and repairs declared in the yaml are not working or ignored. I see no relevant log in the operator, the agent or the manager after applying the update.
sctool
does not report the newly added backup or repair operation.However I can start a backup or repair manually right from
sctool
(from within a scylla-manager pod) and it works as expected, I can see the backed up files in the gcs bucket.To Reproduce Steps to reproduce the behavior:
values.yaml
override of the scylla clusterExpected behavior
sctool task list -a
Config Files
Cluster configuration (helm)
Output of
kubectl -n database get scyllacluster scylla -o yaml
after applying the update.Output of
sctool task list -a
Environment:
Additional context
I'm using
auth_token
but since I can usesctool
I think it's not relevant to this issue.