Open gdubicki opened 1 month ago
I am also seeing this in the scylladb-api-status-probe
container logs of the Scylla pod:
I0712 14:17:47.251251 1 operator/cmd.go:21] maxprocs: Leaving GOMAXPROCS=[1]: CPU quota undefined
I0712 14:17:47.251718 1 probeserver/scylladbapistatus.go:133] scylladb-api-status version "v1.13.0-rc.0-2-g7f37771"
I0712 14:17:47.251740 1 flag/flags.go:64] FLAG: --address=""
I0712 14:17:47.251749 1 flag/flags.go:64] FLAG: --burst="75"
I0712 14:17:47.251754 1 flag/flags.go:64] FLAG: --feature-gates=""
I0712 14:17:47.251758 1 flag/flags.go:64] FLAG: --help="false"
I0712 14:17:47.251762 1 flag/flags.go:64] FLAG: --kubeconfig=""
I0712 14:17:47.251764 1 flag/flags.go:64] FLAG: --loglevel="2"
I0712 14:17:47.251767 1 flag/flags.go:64] FLAG: --namespace="scylla"
I0712 14:17:47.251770 1 flag/flags.go:64] FLAG: --port="8080"
I0712 14:17:47.251773 1 flag/flags.go:64] FLAG: --qps="50"
I0712 14:17:47.251777 1 flag/flags.go:64] FLAG: --service-name="scylla-us-west1-us-west1-b-0"
I0712 14:17:47.251780 1 flag/flags.go:64] FLAG: --v="2"
I0712 14:17:47.252016 1 cache/shared_informer.go:311] Waiting for caches to sync for Prober
I0712 14:17:47.258338 1 cache/reflector.go:351] Caches populated for *v1.Service from k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229
I0712 14:17:47.353007 1 cache/shared_informer.go:318] Caches are synced for Prober
I0712 14:17:47.353249 1 probeserver/serveprobes.go:78] "Starting probe server" Address=":8080"
E0712 14:17:55.645952 1 scylladbapistatus/prober.go:82] "readyz probe: can't get scylla node status" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0"
E0712 14:18:05.073105 1 scylladbapistatus/prober.go:101] "readyz probe: can't get scylla native transport" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0" Node="10.7.252.229"
E0712 14:18:14.999478 1 scylladbapistatus/prober.go:101] "readyz probe: can't get scylla native transport" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0" Node="10.7.252.229"
..and in the scylla-manager-agent
container logs occasionally this:
{"L":"INFO","T":"2024-07-12T16:01:53.797Z","M":"http: TLS handshake error from 10.6.241.80:48086: EOF"}
{"L":"INFO","T":"2024-07-12T16:02:05.297Z","M":"http: TLS handshake error from 10.6.241.80:49746: read tcp 10.138.0.93:10001->10.6.241.80:49746: read: connection reset by peer"}
This is for both the main cluster and the Scylla Manager's cluster, although the former has the workaround applied.
Also in the scylla-operator
Deployment I am seeing this in the logs:
I0712 14:14:38.692544 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:14:48.696285 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:14:48.709939 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
E0712 14:14:52.702896 1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-2" not found
E0712 14:14:52.741036 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
E0712 14:14:52.746390 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
E0712 14:14:52.756784 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
I0712 14:14:52.777037 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5 created"
I0712 14:15:03.717394 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5 updated"
I0712 14:15:08.716438 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:08.725014 1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
I0712 14:15:18.729079 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:18.743043 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:15:28.746705 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:28.755023 1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
I0712 14:15:58.765715 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:58.773584 1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
E0712 14:16:16.164708 1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-1" not found
E0712 14:16:16.205687 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
E0712 14:16:16.210974 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
E0712 14:16:16.221368 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
I0712 14:16:16.241720 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81 created"
I0712 14:16:28.783327 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:16:28.797779 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:16:29.192472 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81 updated"
I0712 14:17:18.817963 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:17:18.826596 1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
E0712 14:17:34.627808 1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-0" not found
E0712 14:17:34.675797 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.681062 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.691344 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.711688 1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
I0712 14:17:34.755596 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684 created"
I0712 14:17:48.651971 1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684 updated"
I0712 14:17:48.834780 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:17:48.849950 1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:18:38.865922 1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
We are also seeing disk usage constantly growing on all the nodes since the update, although our cluster usage has not changed, but apart from that the cluster itself seems to be working rather normally.
(I reported this issue separately here https://github.com/scylladb/scylladb/issues/19793 as I don't think it's related this this one.)
Scylla Operator from 1.9.x to 1.12.2
scylla operator only supports n+1 upgrades, otherwise you may miss a migration step
Alternator should be configured through the API, see:
Is the Alternator API working on it's own? I'd expect you need to take some extra steps to configure the certificates with it. For the manager integration CQL and Alternator certs are not supported yet.
Scylla Operator from 1.9.x to 1.12.2
scylla operator only supports n+1 upgrades, otherwise you may miss a migration step
Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html...
But it's a fact that I forgot about CRD updates completely. 😞
Alternator should be configured through the API, see:
* https://operator.docs.scylladb.com/stable/clients/alternator.html * https://operator.docs.scylladb.com/stable/api-reference/groups/scylla.scylladb.com/scyllaclusters.html#api-scylla-scylladb-com-scyllaclusters-v1-spec-alternator
Is the Alternator API working on it's own? I'd expect you need to take some extra steps to configure the certificates with it. For the manager integration CQL and Alternator certs are not supported yet.
We are not using Alternator.
How to fix this now, @tnozicka? Should I apply CRDs from the each version 1.10.
Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html
It only shows the X.Y.Z to X.Y+1.Z upgrades https://operator.docs.scylladb.com/stable/upgrade.html#v1-2-0-v1-3-0 but I though we had it in some place generically too
How to fix this now
Rollback the operator deployment manifest an image back to where it started and follow the upgrade guide for each Y+1 from there (operator + CRD + wait for rollouts for each bump)
Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html
It only shows the X.Y.Z to X.Y+1.Z upgrades https://operator.docs.scylladb.com/stable/upgrade.html#v1-2-0-v1-3-0 but I though we had it in some place generically too
Oh, you were right, in https://operator.docs.scylladb.com/stable/upgrade.html#upgrade-via-helm there is a step with the CRD updates. 🤦♂️ Sorry!
How to fix this now
Rollback the operator deployment manifest an image back to where it started and follow the upgrade guide for each Y+1 from there (operator + CRD + wait for rollouts for each bump)
We did this but I am still seeing:
$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla-manager/scylla-manager
Datacenter: manager-dc
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| | Alternator | CQL | REST | Address | Uptime | CPUs | Memory | Scylla | Agent | Host ID |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.255.190 | - | - | - | - | - | 8ec8a729-8225-4278-a9da-ad0f23f47e01 |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.255.190 alternator: get node info: no host config available
- 10.7.255.190 CQL: no host config available
What's next?
What happened?
After an update of Scylla from 5.2.9 to 5.4.7, Scylla Operator from 1.9.x to 1.12.2 (latest that supports Scylla 5.2.x and 5.4.x), Scylla Manager from 3.1.x to 3.2.8, we started to observe that
sctool status
doesn't provide all the node info anymore and returns errors:Note that our
scylla.yaml
didn't have any config for TLS up to that point.This problem has been worked around by setting this:
However, we still have a problem with the Scylla Manager's cluster:
...and it seems to only have a generated ConfigMap named scylladb-managed-config:
...and I can't find anything about modifying it in the https://operator.docs.scylladb.com/stable/helm.html...
Since then we have updated Scylla to 5.4.9, Operator to 1.13.0, and Manager to 3.3.0 but it did not help.
What did you expect to happen?
sctool status
should work without errors for both main cluster as well as Scylla Manager's one after an update.I shouldn't have to reconfigure TLS as the defaults shown in https://github.com/scylladb/scylladb/blob/scylla-5.4.7/conf/scylla.yaml#L472-L474 say that it should be disabled.
How can we reproduce it (as minimally and precisely as possible)?
scylla.yaml
, as we had before:consistent_cluster_management: true
$ kubectl version Client Version: v1.29.6 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.5-gke.1192000