Open gdubicki opened 4 months ago
I am also seeing this in the scylladb-api-status-probe
container logs of the Scylla pod:
I0712 14:17:47.251251 1 operator/cmd.go:21] maxprocs: Leaving GOMAXPROCS=[1]: CPU quota undefined
I0712 14:17:47.251718 1 probeserver/scylladbapistatus.go:133] scylladb-api-status version "v1.13.0-rc.0-2-g7f37771"
I0712 14:17:47.251740 1 flag/flags.go:64] FLAG: --address=""
I0712 14:17:47.251749 1 flag/flags.go:64] FLAG: --burst="75"
I0712 14:17:47.251754 1 flag/flags.go:64] FLAG: --feature-gates=""
I0712 14:17:47.251758 1 flag/flags.go:64] FLAG: --help="false"
I0712 14:17:47.251762 1 flag/flags.go:64] FLAG: --kubeconfig=""
I0712 14:17:47.251764 1 flag/flags.go:64] FLAG: --loglevel="2"
I0712 14:17:47.251767 1 flag/flags.go:64] FLAG: --namespace="scylla"
I0712 14:17:47.251770 1 flag/flags.go:64] FLAG: --port="8080"
I0712 14:17:47.251773 1 flag/flags.go:64] FLAG: --qps="50"
I0712 14:17:47.251777 1 flag/flags.go:64] FLAG: --service-name="scylla-us-west1-us-west1-b-0"
I0712 14:17:47.251780 1 flag/flags.go:64] FLAG: --v="2"
I0712 14:17:47.252016 1 cache/shared_informer.go:311] Waiting for caches to sync for Prober
I0712 14:17:47.258338 1 cache/reflector.go:351] Caches populated for *v1.Service from k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229
I0712 14:17:47.353007 1 cache/shared_informer.go:318] Caches are synced for Prober
I0712 14:17:47.353249 1 probeserver/serveprobes.go:78] "Starting probe server" Address=":8080"
E0712 14:17:55.645952 1 scylladbapistatus/prober.go:82] "readyz probe: can't get scylla node status" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0"
E0712 14:18:05.073105 1 scylladbapistatus/prober.go:101] "readyz probe: can't get scylla native transport" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0" Node="10.7.252.229"
E0712 14:18:14.999478 1 scylladbapistatus/prober.go:101] "readyz probe: can't get scylla native transport" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0" Node="10.7.252.229"
..and in the scylla-manager-agent
container logs occasionally this:
{"L":"INFO","T":"2024-07-12T16:01:53.797Z","M":"http: TLS handshake error from 10.6.241.80:48086: EOF"}
{"L":"INFO","T":"2024-07-12T16:02:05.297Z","M":"http: TLS handshake error from 10.6.241.80:49746: read tcp 10.138.0.93:10001->10.6.241.80:49746: read: connection reset by peer"}
This is for both the main cluster and the Scylla Manager's cluster, although the former has the workaround applied.
Also in the scylla-operator
Deployment I am seeing this in the logs:
I0712 14:14:38.692544 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:14:48.696285 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:14:48.709939 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
E0712 14:14:52.702896 1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-2" not found
E0712 14:14:52.741036 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
E0712 14:14:52.746390 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
E0712 14:14:52.756784 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
I0712 14:14:52.777037 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5 created"
I0712 14:15:03.717394 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5 updated"
I0712 14:15:08.716438 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:08.725014 1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
I0712 14:15:18.729079 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:18.743043 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:15:28.746705 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:28.755023 1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
I0712 14:15:58.765715 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:58.773584 1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
E0712 14:16:16.164708 1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-1" not found
E0712 14:16:16.205687 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
E0712 14:16:16.210974 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
E0712 14:16:16.221368 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
I0712 14:16:16.241720 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81 created"
I0712 14:16:28.783327 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:16:28.797779 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:16:29.192472 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81 updated"
I0712 14:17:18.817963 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:17:18.826596 1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
E0712 14:17:34.627808 1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-0" not found
E0712 14:17:34.675797 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.681062 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.691344 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.711688 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
I0712 14:17:34.755596 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684 created"
I0712 14:17:48.651971 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684 updated"
I0712 14:17:48.834780 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:17:48.849950 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:18:38.865922 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
We are also seeing disk usage constantly growing on all the nodes since the update, although our cluster usage has not changed, but apart from that the cluster itself seems to be working rather normally.
(I reported this issue separately here https://github.com/scylladb/scylladb/issues/19793 as I don't think it's related this this one.)
Scylla Operator from 1.9.x to 1.12.2
scylla operator only supports n+1 upgrades, otherwise you may miss a migration step
Alternator should be configured through the API, see:
Is the Alternator API working on it's own? I'd expect you need to take some extra steps to configure the certificates with it. For the manager integration CQL and Alternator certs are not supported yet.
Scylla Operator from 1.9.x to 1.12.2
scylla operator only supports n+1 upgrades, otherwise you may miss a migration step
Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html...
But it's a fact that I forgot about CRD updates completely. 😞
Alternator should be configured through the API, see:
* https://operator.docs.scylladb.com/stable/clients/alternator.html * https://operator.docs.scylladb.com/stable/api-reference/groups/scylla.scylladb.com/scyllaclusters.html#api-scylla-scylladb-com-scyllaclusters-v1-spec-alternator
Is the Alternator API working on it's own? I'd expect you need to take some extra steps to configure the certificates with it. For the manager integration CQL and Alternator certs are not supported yet.
We are not using Alternator.
How to fix this now, @tnozicka? Should I apply CRDs from the each version 1.10.
Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html
It only shows the X.Y.Z to X.Y+1.Z upgrades https://operator.docs.scylladb.com/stable/upgrade.html#v1-2-0-v1-3-0 but I though we had it in some place generically too
How to fix this now
Rollback the operator deployment manifest an image back to where it started and follow the upgrade guide for each Y+1 from there (operator + CRD + wait for rollouts for each bump)
Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html
It only shows the X.Y.Z to X.Y+1.Z upgrades https://operator.docs.scylladb.com/stable/upgrade.html#v1-2-0-v1-3-0 but I though we had it in some place generically too
Oh, you were right, in https://operator.docs.scylladb.com/stable/upgrade.html#upgrade-via-helm there is a step with the CRD updates. 🤦♂️ Sorry!
How to fix this now
Rollback the operator deployment manifest an image back to where it started and follow the upgrade guide for each Y+1 from there (operator + CRD + wait for rollouts for each bump)
We did this but I am still seeing:
$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla-manager/scylla-manager
Datacenter: manager-dc
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| | Alternator | CQL | REST | Address | Uptime | CPUs | Memory | Scylla | Agent | Host ID |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.255.190 | - | - | - | - | - | 8ec8a729-8225-4278-a9da-ad0f23f47e01 |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.255.190 alternator: get node info: no host config available
- 10.7.255.190 CQL: no host config available
What's next?
We have updated the Scylla Manager to 3.3.1 and we are still having this problem.
I don't really care that much about the ugly output of sctool status
for the Scylla Manager's cluster, but we also see that an update to the Scylla Manager's cluster tasks configs is not working, perhaps because of this.
The appropriate logs of scylla-manager-controller
from the scylla-manager-controller
Deployment:
E0827 09:34:57.898551 1 manager/controller.go:154] syncing key 'scylla-manager/scylla-manager' failed: can't execute action: can't update task "manager-daily-backup": [PUT /cluster/{cluster_id}/task/{task_type}/{task_id}][404] PutClusterClusterIDTaskTaskTypeTaskID default &{Details: Message:get resource: create backup target: create cluster session: TLS/SSL key/cert is not registered: not found TraceID:s-j603PPTLC2kyO2xXY6hA} E0827 09:34:58.328620 1 manager/sync.go:136] "Failed to execute action" err="can't update task \"manager-daily-backup\": [PUT /cluster/{cluster_id}/task/{task_type}/{task_id}][404] PutClusterClusterIDTaskTaskTypeTaskID default &{Details: Message:get resource: create backup target: create cluster session: TLS/SSL key/cert is not registered: not found TraceID:Ii-0uK49T3-K70PTOVmD5Q}" action="update task &{ClusterID: Enabled:true ID:0db86eed-6ec-4aa2-879d-05e1b84fb428 Name:manager-daily-backup Properties:map[dc:[manager-dc] location:[gcs:fetlife-scylla-manager-backups] retention:7] Schedule:0xc000213dc0 Tags:[] Type:backup}"
The backups themselves are not working too:
$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool tasks --cluster scylla-manager/scylla-manage
r
+------------------------------+--------+----------+--------+----------+---------+-------+------------------------+------------------------+--------+------------------------+
| Task | Labels | Schedule | Window | Timezone | Success | Error | Last Success | Last Error | Status | Next |
+------------------------------+--------+----------+--------+----------+---------+-------+------------------------+------------------------+--------+------------------------+
| backup/manager-daily-backup | | 1d | | | 658 | 60 | 12 Jul 24 11:00:23 UTC | 01 Sep 24 11:00:00 UTC | ERROR | 02 Sep 24 11:00:00 UTC |
| healthcheck/rest | | 1m | | | 1093493 | 0 | 02 Sep 24 08:28:56 UTC | | DONE | 02 Sep 24 08:29:56 UTC |
| healthcheck/alternator | | 15s | | | 4373968 | 1 | 02 Sep 24 08:29:26 UTC | 17 Apr 23 02:15:41 UTC | DONE | 02 Sep 24 08:29:41 UTC |
| healthcheck/cql | | 15s | | | 4373936 | 1 | 02 Sep 24 08:29:26 UTC | 17 Apr 23 02:15:41 UTC | DONE | 02 Sep 24 08:29:41 UTC |
| repair/manager-weekly-repair | | 7d | | | 101 | 0 | 31 Aug 24 11:30:02 UTC | | DONE | 07 Sep 24 11:30:00 UTC |
+------------------------------+--------+----------+--------+----------+---------+-------+------------------------+------------------------+--------+------------------------+
$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool progress --cluster scylla-manager/scylla-ma
nager backup/manager-daily-backup
Run: 550809c3-6851-11ef-a3b5-b2c3114a5b19
Status: ERROR (initialising)
Cause: get backup target: create cluster session: TLS/SSL key/cert is not registered: not found
Start time: 01 Sep 24 11:00:00 UTC
End time: 01 Sep 24 11:00:00 UTC
Duration: 0s
Progress: -
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
/lifecycle stale
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
/lifecycle rotten
What happened?
After an update of Scylla from 5.2.9 to 5.4.7, Scylla Operator from 1.9.x to 1.12.2 (latest that supports Scylla 5.2.x and 5.4.x), Scylla Manager from 3.1.x to 3.2.8, we started to observe that
sctool status
doesn't provide all the node info anymore and returns errors:Note that our
scylla.yaml
didn't have any config for TLS up to that point.This problem has been worked around by setting this:
However, we still have a problem with the Scylla Manager's cluster:
...and it seems to only have a generated ConfigMap named scylladb-managed-config:
...and I can't find anything about modifying it in the https://operator.docs.scylladb.com/stable/helm.html...
Since then we have updated Scylla to 5.4.9, Operator to 1.13.0, and Manager to 3.3.0 but it did not help.
What did you expect to happen?
sctool status
should work without errors for both main cluster as well as Scylla Manager's one after an update.I shouldn't have to reconfigure TLS as the defaults shown in https://github.com/scylladb/scylladb/blob/scylla-5.4.7/conf/scylla.yaml#L472-L474 say that it should be disabled.
How can we reproduce it (as minimally and precisely as possible)?
scylla.yaml
, as we had before:consistent_cluster_management: true
$ kubectl version Client Version: v1.29.6 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.5-gke.1192000