scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
339 stars 175 forks source link

Errors like `alternator: get node info: no host config available` and `CQL: no host config available` when running `sctools status` after an update #2016

Open gdubicki opened 4 months ago

gdubicki commented 4 months ago

What happened?

After an update of Scylla from 5.2.9 to 5.4.7, Scylla Operator from 1.9.x to 1.12.2 (latest that supports Scylla 5.2.x and 5.4.x), Scylla Manager from 3.1.x to 3.2.8, we started to observe that sctool status doesn't provide all the node info anymore and returns errors:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla/scylla
Datacenter: XXX
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
|    | Alternator  | CQL         | REST     | Address      | Uptime | CPUs | Memory | Scylla | Agent | Host ID                              |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.241.130 | -      | -    | -      | -      | -     | 8a24c600-5525-490e-a3cd-314f6062d6a1 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (6ms) | 10.7.241.174 | -      | -    | -      | -      | -     | f14fcd59-8d90-4d8e-af22-ace87ceced22 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.241.175 | -      | -    | -      | -      | -     | 050dcc67-7bb8-4d5d-89b1-5dbe0bcbb8b2 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (5ms) | 10.7.243.109 | -      | -    | -      | -      | -     | 4a3ff045-bba2-4537-a4d7-a213d25ae713 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.248.124 | -      | -    | -      | -      | -     | 028023f5-9d4e-404c-8537-467ac3d4538c |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.249.238 | -      | -    | -      | -      | -     | b8f68c62-c462-4a30-a505-5ece9ae1ab0b |
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.252.229 | -      | -    | -      | -      | -     | 1ff1b8df-7a90-4321-a309-7cd69e20bd70 |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.241.130 alternator: get node info: no host config available
- 10.7.241.130 CQL: no host config available
- 10.7.241.174 alternator: get node info: no host config available
- 10.7.241.174 CQL: no host config available
- 10.7.241.175 alternator: get node info: no host config available
- 10.7.241.175 CQL: no host config available
- 10.7.243.109 alternator: get node info: no host config available
- 10.7.243.109 CQL: no host config available
- 10.7.248.124 alternator: get node info: no host config available
- 10.7.248.124 CQL: no host config available
- 10.7.249.238 alternator: get node info: no host config available
- 10.7.249.238 CQL: no host config available
- 10.7.252.229 alternator: get node info: no host config available
- 10.7.252.229 CQL: no host config available

Note that our scylla.yaml didn't have any config for TLS up to that point.

This problem has been worked around by setting this:

client_encryption_options:
  optional: true

However, we still have a problem with the Scylla Manager's cluster:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla-manager/scylla-manager
Datacenter: manager-dc
+----+-------------+-------------+-----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
|    | Alternator  | CQL         | REST      | Address      | Uptime | CPUs | Memory | Scylla | Agent | Host ID                              |
+----+-------------+-------------+-----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (92ms) | 10.7.255.190 | -      | -    | -      | -      | -     | 8ec8a729-8225-4278-a9da-ad0f23f47e01 |
+----+-------------+-------------+-----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.255.190 alternator: get node info: no host config available
- 10.7.255.190 CQL: no host config available

...and it seems to only have a generated ConfigMap named scylladb-managed-config:

apiVersion: v1
data:
  scylladb-managed-config.yaml: |
    cluster_name: "scylla"
    rpc_address: "0.0.0.0"
    endpoint_snitch: "GossipingPropertyFileSnitch"
    internode_compression: "all"
    native_transport_port_ssl: 9142
    native_shard_aware_transport_port_ssl: 19142
    client_encryption_options:
      enabled: true
      optional: false
      certificate: "/var/run/secrets/scylla-operator.scylladb.com/scylladb/serving-certs/tls.crt"
      keyfile: "/var/run/secrets/scylla-operator.scylladb.com/scylladb/serving-certs/tls.key"
      require_client_auth: true
      truststore: "/var/run/secrets/scylla-operator.scylladb.com/scylladb/client-ca/tls.crt"
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: scylla
    meta.helm.sh/release-namespace: scylla
    scylla-operator.scylladb.com/managed-hash: <redacted>
==
  creationTimestamp: "<redacted>"
  labels:
    app.kubernetes.io/managed-by: Helm
    scylla/cluster: scylla
  name: scylla-managed-config
  namespace: scylla
  ownerReferences:
  - apiVersion: scylla.scylladb.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ScyllaCluster
    name: scylla
    uid: <redacted>
  resourceVersion: "<redacted>"
  uid: <redacted>

...and I can't find anything about modifying it in the https://operator.docs.scylladb.com/stable/helm.html...

Since then we have updated Scylla to 5.4.9, Operator to 1.13.0, and Manager to 3.3.0 but it did not help.

What did you expect to happen?

sctool status should work without errors for both main cluster as well as Scylla Manager's one after an update.

I shouldn't have to reconfigure TLS as the defaults shown in https://github.com/scylladb/scylladb/blob/scylla-5.4.7/conf/scylla.yaml#L472-L474 say that it should be disabled.

How can we reproduce it (as minimally and precisely as possible)?

  1. Set up versions like mentioned above
  2. Use this scylla.yaml, as we had before:
    
    read_request_timeout_in_ms: 5000
    write_request_timeout_in_ms: 2000
    cas_contention_timeout_in_ms: 1000

consistent_cluster_management: true

3. Update to the versions mentioned above
4. Check `sctool status`

### Scylla Operator version

1.13.0

### Kubernetes platform name and version

$ kubectl version Client Version: v1.29.6 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.5-gke.1192000



### Please attach the must-gather archive.

[scylla-operator-must-gather-77t6kvnghzss.zip](https://github.com/user-attachments/files/16196482/scylla-operator-must-gather-77t6kvnghzss.zip)

### Anything else we need to know?

The must-gather archive has been anonymized additionally by me manually, see https://github.com/scylladb/scylla-operator/issues/2015.

This problem has originally been reported here https://github.com/scylladb/scylla-manager/issues/3889, but that issue was originally about a (probably?) different problem, so I was suggested to create a new one.
gdubicki commented 4 months ago

I am also seeing this in the scylladb-api-status-probe container logs of the Scylla pod:

I0712 14:17:47.251251       1 operator/cmd.go:21] maxprocs: Leaving GOMAXPROCS=[1]: CPU quota undefined
I0712 14:17:47.251718       1 probeserver/scylladbapistatus.go:133] scylladb-api-status version "v1.13.0-rc.0-2-g7f37771"
I0712 14:17:47.251740       1 flag/flags.go:64] FLAG: --address=""
I0712 14:17:47.251749       1 flag/flags.go:64] FLAG: --burst="75"
I0712 14:17:47.251754       1 flag/flags.go:64] FLAG: --feature-gates=""
I0712 14:17:47.251758       1 flag/flags.go:64] FLAG: --help="false"
I0712 14:17:47.251762       1 flag/flags.go:64] FLAG: --kubeconfig=""
I0712 14:17:47.251764       1 flag/flags.go:64] FLAG: --loglevel="2"
I0712 14:17:47.251767       1 flag/flags.go:64] FLAG: --namespace="scylla"
I0712 14:17:47.251770       1 flag/flags.go:64] FLAG: --port="8080"
I0712 14:17:47.251773       1 flag/flags.go:64] FLAG: --qps="50"
I0712 14:17:47.251777       1 flag/flags.go:64] FLAG: --service-name="scylla-us-west1-us-west1-b-0"
I0712 14:17:47.251780       1 flag/flags.go:64] FLAG: --v="2"
I0712 14:17:47.252016       1 cache/shared_informer.go:311] Waiting for caches to sync for Prober
I0712 14:17:47.258338       1 cache/reflector.go:351] Caches populated for *v1.Service from k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229
I0712 14:17:47.353007       1 cache/shared_informer.go:318] Caches are synced for Prober
I0712 14:17:47.353249       1 probeserver/serveprobes.go:78] "Starting probe server" Address=":8080"
E0712 14:17:55.645952       1 scylladbapistatus/prober.go:82] "readyz probe: can't get scylla node status" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0"
E0712 14:18:05.073105       1 scylladbapistatus/prober.go:101] "readyz probe: can't get scylla native transport" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0" Node="10.7.252.229"
E0712 14:18:14.999478       1 scylladbapistatus/prober.go:101] "readyz probe: can't get scylla native transport" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0" Node="10.7.252.229"

..and in the scylla-manager-agent container logs occasionally this:

{"L":"INFO","T":"2024-07-12T16:01:53.797Z","M":"http: TLS handshake error from 10.6.241.80:48086: EOF"}
{"L":"INFO","T":"2024-07-12T16:02:05.297Z","M":"http: TLS handshake error from 10.6.241.80:49746: read tcp 10.138.0.93:10001->10.6.241.80:49746: read: connection reset by peer"}

This is for both the main cluster and the Scylla Manager's cluster, although the former has the workaround applied.

gdubicki commented 4 months ago

Also in the scylla-operator Deployment I am seeing this in the logs:

I0712 14:14:38.692544       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:14:48.696285       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:14:48.709939       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
E0712 14:14:52.702896       1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-2" not found
E0712 14:14:52.741036       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
E0712 14:14:52.746390       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
E0712 14:14:52.756784       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
I0712 14:14:52.777037       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5 created"
I0712 14:15:03.717394       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5 updated"
I0712 14:15:08.716438       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:08.725014       1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
I0712 14:15:18.729079       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:18.743043       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:15:28.746705       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:28.755023       1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
I0712 14:15:58.765715       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:58.773584       1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
E0712 14:16:16.164708       1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-1" not found
E0712 14:16:16.205687       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
E0712 14:16:16.210974       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
E0712 14:16:16.221368       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
I0712 14:16:16.241720       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81 created"
I0712 14:16:28.783327       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:16:28.797779       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:16:29.192472       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81 updated"
I0712 14:17:18.817963       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:17:18.826596       1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
E0712 14:17:34.627808       1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-0" not found
E0712 14:17:34.675797       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.681062       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.691344       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.711688       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
I0712 14:17:34.755596       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684 created"
I0712 14:17:48.651971       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684 updated"
I0712 14:17:48.834780       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:17:48.849950       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:18:38.865922       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
gdubicki commented 4 months ago

We are also seeing disk usage constantly growing on all the nodes since the update, although our cluster usage has not changed, but apart from that the cluster itself seems to be working rather normally.

(I reported this issue separately here https://github.com/scylladb/scylladb/issues/19793 as I don't think it's related this this one.)

tnozicka commented 4 months ago

Scylla Operator from 1.9.x to 1.12.2

scylla operator only supports n+1 upgrades, otherwise you may miss a migration step

Alternator should be configured through the API, see:

Is the Alternator API working on it's own? I'd expect you need to take some extra steps to configure the certificates with it. For the manager integration CQL and Alternator certs are not supported yet.

gdubicki commented 4 months ago

Scylla Operator from 1.9.x to 1.12.2

scylla operator only supports n+1 upgrades, otherwise you may miss a migration step

Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html...

But it's a fact that I forgot about CRD updates completely. 😞

Alternator should be configured through the API, see:

* https://operator.docs.scylladb.com/stable/clients/alternator.html

* https://operator.docs.scylladb.com/stable/api-reference/groups/scylla.scylladb.com/scyllaclusters.html#api-scylla-scylladb-com-scyllaclusters-v1-spec-alternator

Is the Alternator API working on it's own? I'd expect you need to take some extra steps to configure the certificates with it. For the manager integration CQL and Alternator certs are not supported yet.

We are not using Alternator.

gdubicki commented 4 months ago

How to fix this now, @tnozicka? Should I apply CRDs from the each version 1.10., 1.11., ..., 1.13. as documented in the 2nd step of https://operator.docs.scylladb.com/stable/upgrade.html#upgrade-via-helm?

tnozicka commented 4 months ago

Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html

It only shows the X.Y.Z to X.Y+1.Z upgrades https://operator.docs.scylladb.com/stable/upgrade.html#v1-2-0-v1-3-0 but I though we had it in some place generically too

How to fix this now

Rollback the operator deployment manifest an image back to where it started and follow the upgrade guide for each Y+1 from there (operator + CRD + wait for rollouts for each bump)

gdubicki commented 3 months ago

Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html

It only shows the X.Y.Z to X.Y+1.Z upgrades https://operator.docs.scylladb.com/stable/upgrade.html#v1-2-0-v1-3-0 but I though we had it in some place generically too

Oh, you were right, in https://operator.docs.scylladb.com/stable/upgrade.html#upgrade-via-helm there is a step with the CRD updates. 🤦‍♂️ Sorry!

gdubicki commented 3 months ago

How to fix this now

Rollback the operator deployment manifest an image back to where it started and follow the upgrade guide for each Y+1 from there (operator + CRD + wait for rollouts for each bump)

We did this but I am still seeing:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla-manager/scylla-manager
Datacenter: manager-dc
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
|    | Alternator  | CQL         | REST     | Address      | Uptime | CPUs | Memory | Scylla | Agent | Host ID                              |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.255.190 | -      | -    | -      | -      | -     | 8ec8a729-8225-4278-a9da-ad0f23f47e01 |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.255.190 alternator: get node info: no host config available
- 10.7.255.190 CQL: no host config available

What's next?

gdubicki commented 2 months ago

We have updated the Scylla Manager to 3.3.1 and we are still having this problem.

I don't really care that much about the ugly output of sctool status for the Scylla Manager's cluster, but we also see that an update to the Scylla Manager's cluster tasks configs is not working, perhaps because of this.

The appropriate logs of scylla-manager-controller from the scylla-manager-controller Deployment:

E0827 09:34:57.898551 1 manager/controller.go:154] syncing key 'scylla-manager/scylla-manager' failed: can't execute action: can't update task "manager-daily-backup": [PUT /cluster/{cluster_id}/task/{task_type}/{task_id}][404] PutClusterClusterIDTaskTaskTypeTaskID default &{Details: Message:get resource: create backup target: create cluster session: TLS/SSL key/cert is not registered: not found TraceID:s-j603PPTLC2kyO2xXY6hA} E0827 09:34:58.328620 1 manager/sync.go:136] "Failed to execute action" err="can't update task \"manager-daily-backup\": [PUT /cluster/{cluster_id}/task/{task_type}/{task_id}][404] PutClusterClusterIDTaskTaskTypeTaskID default &{Details: Message:get resource: create backup target: create cluster session: TLS/SSL key/cert is not registered: not found TraceID:Ii-0uK49T3-K70PTOVmD5Q}" action="update task &{ClusterID: Enabled:true ID:0db86eed-6ec-4aa2-879d-05e1b84fb428 Name:manager-daily-backup Properties:map[dc:[manager-dc] location:[gcs:fetlife-scylla-manager-backups] retention:7] Schedule:0xc000213dc0 Tags:[] Type:backup}"

gdubicki commented 2 months ago

The backups themselves are not working too:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool tasks --cluster scylla-manager/scylla-manage
r
+------------------------------+--------+----------+--------+----------+---------+-------+------------------------+------------------------+--------+------------------------+
| Task                         | Labels | Schedule | Window | Timezone | Success | Error | Last Success           | Last Error             | Status | Next                   |
+------------------------------+--------+----------+--------+----------+---------+-------+------------------------+------------------------+--------+------------------------+
| backup/manager-daily-backup  |        | 1d       |        |          | 658     | 60    | 12 Jul 24 11:00:23 UTC | 01 Sep 24 11:00:00 UTC | ERROR  | 02 Sep 24 11:00:00 UTC |
| healthcheck/rest             |        | 1m       |        |          | 1093493 | 0     | 02 Sep 24 08:28:56 UTC |                        | DONE   | 02 Sep 24 08:29:56 UTC |
| healthcheck/alternator       |        | 15s      |        |          | 4373968 | 1     | 02 Sep 24 08:29:26 UTC | 17 Apr 23 02:15:41 UTC | DONE   | 02 Sep 24 08:29:41 UTC |
| healthcheck/cql              |        | 15s      |        |          | 4373936 | 1     | 02 Sep 24 08:29:26 UTC | 17 Apr 23 02:15:41 UTC | DONE   | 02 Sep 24 08:29:41 UTC |
| repair/manager-weekly-repair |        | 7d       |        |          | 101     | 0     | 31 Aug 24 11:30:02 UTC |                        | DONE   | 07 Sep 24 11:30:00 UTC |
+------------------------------+--------+----------+--------+----------+---------+-------+------------------------+------------------------+--------+------------------------+
$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool progress  --cluster scylla-manager/scylla-ma
nager backup/manager-daily-backup
Run:        550809c3-6851-11ef-a3b5-b2c3114a5b19
Status:     ERROR (initialising)
Cause:      get backup target: create cluster session: TLS/SSL key/cert is not registered: not found
Start time: 01 Sep 24 11:00:00 UTC
End time:   01 Sep 24 11:00:00 UTC
Duration:   0s
Progress:   -
scylla-operator-bot[bot] commented 1 month ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle stale

scylla-operator-bot[bot] commented 3 weeks ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle rotten