scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
330 stars 162 forks source link

Errors like `alternator: get node info: no host config available` and `CQL: no host config available` when running `sctools status` after an update #2016

Open gdubicki opened 1 month ago

gdubicki commented 1 month ago

What happened?

After an update of Scylla from 5.2.9 to 5.4.7, Scylla Operator from 1.9.x to 1.12.2 (latest that supports Scylla 5.2.x and 5.4.x), Scylla Manager from 3.1.x to 3.2.8, we started to observe that sctool status doesn't provide all the node info anymore and returns errors:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla/scylla
Datacenter: XXX
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
|    | Alternator  | CQL         | REST     | Address      | Uptime | CPUs | Memory | Scylla | Agent | Host ID                              |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.241.130 | -      | -    | -      | -      | -     | 8a24c600-5525-490e-a3cd-314f6062d6a1 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (6ms) | 10.7.241.174 | -      | -    | -      | -      | -     | f14fcd59-8d90-4d8e-af22-ace87ceced22 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.241.175 | -      | -    | -      | -      | -     | 050dcc67-7bb8-4d5d-89b1-5dbe0bcbb8b2 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (5ms) | 10.7.243.109 | -      | -    | -      | -      | -     | 4a3ff045-bba2-4537-a4d7-a213d25ae713 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.248.124 | -      | -    | -      | -      | -     | 028023f5-9d4e-404c-8537-467ac3d4538c |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.249.238 | -      | -    | -      | -      | -     | b8f68c62-c462-4a30-a505-5ece9ae1ab0b |
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.252.229 | -      | -    | -      | -      | -     | 1ff1b8df-7a90-4321-a309-7cd69e20bd70 |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.241.130 alternator: get node info: no host config available
- 10.7.241.130 CQL: no host config available
- 10.7.241.174 alternator: get node info: no host config available
- 10.7.241.174 CQL: no host config available
- 10.7.241.175 alternator: get node info: no host config available
- 10.7.241.175 CQL: no host config available
- 10.7.243.109 alternator: get node info: no host config available
- 10.7.243.109 CQL: no host config available
- 10.7.248.124 alternator: get node info: no host config available
- 10.7.248.124 CQL: no host config available
- 10.7.249.238 alternator: get node info: no host config available
- 10.7.249.238 CQL: no host config available
- 10.7.252.229 alternator: get node info: no host config available
- 10.7.252.229 CQL: no host config available

Note that our scylla.yaml didn't have any config for TLS up to that point.

This problem has been worked around by setting this:

client_encryption_options:
  optional: true

However, we still have a problem with the Scylla Manager's cluster:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla-manager/scylla-manager
Datacenter: manager-dc
+----+-------------+-------------+-----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
|    | Alternator  | CQL         | REST      | Address      | Uptime | CPUs | Memory | Scylla | Agent | Host ID                              |
+----+-------------+-------------+-----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (92ms) | 10.7.255.190 | -      | -    | -      | -      | -     | 8ec8a729-8225-4278-a9da-ad0f23f47e01 |
+----+-------------+-------------+-----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.255.190 alternator: get node info: no host config available
- 10.7.255.190 CQL: no host config available

...and it seems to only have a generated ConfigMap named scylladb-managed-config:

apiVersion: v1
data:
  scylladb-managed-config.yaml: |
    cluster_name: "scylla"
    rpc_address: "0.0.0.0"
    endpoint_snitch: "GossipingPropertyFileSnitch"
    internode_compression: "all"
    native_transport_port_ssl: 9142
    native_shard_aware_transport_port_ssl: 19142
    client_encryption_options:
      enabled: true
      optional: false
      certificate: "/var/run/secrets/scylla-operator.scylladb.com/scylladb/serving-certs/tls.crt"
      keyfile: "/var/run/secrets/scylla-operator.scylladb.com/scylladb/serving-certs/tls.key"
      require_client_auth: true
      truststore: "/var/run/secrets/scylla-operator.scylladb.com/scylladb/client-ca/tls.crt"
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: scylla
    meta.helm.sh/release-namespace: scylla
    scylla-operator.scylladb.com/managed-hash: <redacted>
==
  creationTimestamp: "<redacted>"
  labels:
    app.kubernetes.io/managed-by: Helm
    scylla/cluster: scylla
  name: scylla-managed-config
  namespace: scylla
  ownerReferences:
  - apiVersion: scylla.scylladb.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ScyllaCluster
    name: scylla
    uid: <redacted>
  resourceVersion: "<redacted>"
  uid: <redacted>

...and I can't find anything about modifying it in the https://operator.docs.scylladb.com/stable/helm.html...

Since then we have updated Scylla to 5.4.9, Operator to 1.13.0, and Manager to 3.3.0 but it did not help.

What did you expect to happen?

sctool status should work without errors for both main cluster as well as Scylla Manager's one after an update.

I shouldn't have to reconfigure TLS as the defaults shown in https://github.com/scylladb/scylladb/blob/scylla-5.4.7/conf/scylla.yaml#L472-L474 say that it should be disabled.

How can we reproduce it (as minimally and precisely as possible)?

  1. Set up versions like mentioned above
  2. Use this scylla.yaml, as we had before:
    
    read_request_timeout_in_ms: 5000
    write_request_timeout_in_ms: 2000
    cas_contention_timeout_in_ms: 1000

consistent_cluster_management: true

3. Update to the versions mentioned above
4. Check `sctool status`

### Scylla Operator version

1.13.0

### Kubernetes platform name and version

$ kubectl version Client Version: v1.29.6 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.5-gke.1192000



### Please attach the must-gather archive.

[scylla-operator-must-gather-77t6kvnghzss.zip](https://github.com/user-attachments/files/16196482/scylla-operator-must-gather-77t6kvnghzss.zip)

### Anything else we need to know?

The must-gather archive has been anonymized additionally by me manually, see https://github.com/scylladb/scylla-operator/issues/2015.

This problem has originally been reported here https://github.com/scylladb/scylla-manager/issues/3889, but that issue was originally about a (probably?) different problem, so I was suggested to create a new one.
gdubicki commented 1 month ago

I am also seeing this in the scylladb-api-status-probe container logs of the Scylla pod:

I0712 14:17:47.251251       1 operator/cmd.go:21] maxprocs: Leaving GOMAXPROCS=[1]: CPU quota undefined
I0712 14:17:47.251718       1 probeserver/scylladbapistatus.go:133] scylladb-api-status version "v1.13.0-rc.0-2-g7f37771"
I0712 14:17:47.251740       1 flag/flags.go:64] FLAG: --address=""
I0712 14:17:47.251749       1 flag/flags.go:64] FLAG: --burst="75"
I0712 14:17:47.251754       1 flag/flags.go:64] FLAG: --feature-gates=""
I0712 14:17:47.251758       1 flag/flags.go:64] FLAG: --help="false"
I0712 14:17:47.251762       1 flag/flags.go:64] FLAG: --kubeconfig=""
I0712 14:17:47.251764       1 flag/flags.go:64] FLAG: --loglevel="2"
I0712 14:17:47.251767       1 flag/flags.go:64] FLAG: --namespace="scylla"
I0712 14:17:47.251770       1 flag/flags.go:64] FLAG: --port="8080"
I0712 14:17:47.251773       1 flag/flags.go:64] FLAG: --qps="50"
I0712 14:17:47.251777       1 flag/flags.go:64] FLAG: --service-name="scylla-us-west1-us-west1-b-0"
I0712 14:17:47.251780       1 flag/flags.go:64] FLAG: --v="2"
I0712 14:17:47.252016       1 cache/shared_informer.go:311] Waiting for caches to sync for Prober
I0712 14:17:47.258338       1 cache/reflector.go:351] Caches populated for *v1.Service from k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229
I0712 14:17:47.353007       1 cache/shared_informer.go:318] Caches are synced for Prober
I0712 14:17:47.353249       1 probeserver/serveprobes.go:78] "Starting probe server" Address=":8080"
E0712 14:17:55.645952       1 scylladbapistatus/prober.go:82] "readyz probe: can't get scylla node status" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0"
E0712 14:18:05.073105       1 scylladbapistatus/prober.go:101] "readyz probe: can't get scylla native transport" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0" Node="10.7.252.229"
E0712 14:18:14.999478       1 scylladbapistatus/prober.go:101] "readyz probe: can't get scylla native transport" err="agent [HTTP 404] Not found" Service="scylla/scylla-us-west1-us-west1-b-0" Node="10.7.252.229"

..and in the scylla-manager-agent container logs occasionally this:

{"L":"INFO","T":"2024-07-12T16:01:53.797Z","M":"http: TLS handshake error from 10.6.241.80:48086: EOF"}
{"L":"INFO","T":"2024-07-12T16:02:05.297Z","M":"http: TLS handshake error from 10.6.241.80:49746: read tcp 10.138.0.93:10001->10.6.241.80:49746: read: connection reset by peer"}

This is for both the main cluster and the Scylla Manager's cluster, although the former has the workaround applied.

gdubicki commented 1 month ago

Also in the scylla-operator Deployment I am seeing this in the logs:

I0712 14:14:38.692544       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:14:48.696285       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:14:48.709939       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
E0712 14:14:52.702896       1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-2" not found
E0712 14:14:52.741036       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
E0712 14:14:52.746390       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
E0712 14:14:52.756784       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-2' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-2": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-2"
I0712 14:14:52.777037       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5 created"
I0712 14:15:03.717394       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-fd90882c-f1e1-4050-ae6b-ef294b5d4cb5 updated"
I0712 14:15:08.716438       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:08.725014       1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
I0712 14:15:18.729079       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:18.743043       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:15:28.746705       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:28.755023       1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
I0712 14:15:58.765715       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:15:58.773584       1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
E0712 14:16:16.164708       1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-1" not found
E0712 14:16:16.205687       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
E0712 14:16:16.210974       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
E0712 14:16:16.221368       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-1' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-1": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-1"
I0712 14:16:16.241720       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81 created"
I0712 14:16:28.783327       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:16:28.797779       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:16:29.192472       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-7c4ac91a-f439-4869-8cc0-ad4f1fdfea81 updated"
I0712 14:17:18.817963       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:17:18.826596       1 scyllacluster/controller.go:257] "Hit conflict, will retry in a bit" Key="scylla/scylla" Error="Operation cannot be fulfilled on scyllaclusters.scylla.scylladb.com \"scylla\": the object has been modified; please apply your changes to the latest version and try again"
E0712 14:17:34.627808       1 controllerhelpers/handlers.go:117] pod "scylla-us-west1-us-west1-b-0" not found
E0712 14:17:34.675797       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.681062       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.691344       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
E0712 14:17:34.711688       1 nodeconfigpod/controller.go:291] syncing key 'scylla/scylla-us-west1-us-west1-b-0' failed: can't make configmap for pod "scylla/scylla-us-west1-us-west1-b-0": can't get container id: no scylla container found in pod "scylla/scylla-us-west1-us-west1-b-0"
I0712 14:17:34.755596       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapCreated" message="ConfigMap scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684 created"
I0712 14:17:48.651971       1 record/event.go:376] "Event occurred" object="scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="ConfigMapUpdated" message="ConfigMap scylla/nodeconfig-podinfo-784f0acf-f384-4efb-b2af-4dfbeecaf684 updated"
I0712 14:17:48.834780       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
I0712 14:17:48.849950       1 scyllacluster/status.go:36] "Status updated" ScyllaCluster="scylla/scylla"
I0712 14:18:38.865922       1 scyllacluster/status.go:29] "Updating status" ScyllaCluster="scylla/scylla"
gdubicki commented 1 month ago

We are also seeing disk usage constantly growing on all the nodes since the update, although our cluster usage has not changed, but apart from that the cluster itself seems to be working rather normally.

(I reported this issue separately here https://github.com/scylladb/scylladb/issues/19793 as I don't think it's related this this one.)

tnozicka commented 1 month ago

Scylla Operator from 1.9.x to 1.12.2

scylla operator only supports n+1 upgrades, otherwise you may miss a migration step

Alternator should be configured through the API, see:

Is the Alternator API working on it's own? I'd expect you need to take some extra steps to configure the certificates with it. For the manager integration CQL and Alternator certs are not supported yet.

gdubicki commented 1 month ago

Scylla Operator from 1.9.x to 1.12.2

scylla operator only supports n+1 upgrades, otherwise you may miss a migration step

Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html...

But it's a fact that I forgot about CRD updates completely. 😞

Alternator should be configured through the API, see:

* https://operator.docs.scylladb.com/stable/clients/alternator.html

* https://operator.docs.scylladb.com/stable/api-reference/groups/scylla.scylladb.com/scyllaclusters.html#api-scylla-scylladb-com-scyllaclusters-v1-spec-alternator

Is the Alternator API working on it's own? I'd expect you need to take some extra steps to configure the certificates with it. For the manager integration CQL and Alternator certs are not supported yet.

We are not using Alternator.

gdubicki commented 1 month ago

How to fix this now, @tnozicka? Should I apply CRDs from the each version 1.10., 1.11., ..., 1.13. as documented in the 2nd step of https://operator.docs.scylladb.com/stable/upgrade.html#upgrade-via-helm?

tnozicka commented 1 month ago

Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html

It only shows the X.Y.Z to X.Y+1.Z upgrades https://operator.docs.scylladb.com/stable/upgrade.html#v1-2-0-v1-3-0 but I though we had it in some place generically too

How to fix this now

Rollback the operator deployment manifest an image back to where it started and follow the upgrade guide for each Y+1 from there (operator + CRD + wait for rollouts for each bump)

gdubicki commented 3 weeks ago

Oh, got it now but I didn't do it this way as it was not documented at https://operator.docs.scylladb.com/stable/upgrade.html

It only shows the X.Y.Z to X.Y+1.Z upgrades https://operator.docs.scylladb.com/stable/upgrade.html#v1-2-0-v1-3-0 but I though we had it in some place generically too

Oh, you were right, in https://operator.docs.scylladb.com/stable/upgrade.html#upgrade-via-helm there is a step with the CRD updates. 🤦‍♂️ Sorry!

gdubicki commented 3 weeks ago

How to fix this now

Rollback the operator deployment manifest an image back to where it started and follow the upgrade guide for each Y+1 from there (operator + CRD + wait for rollouts for each bump)

We did this but I am still seeing:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla-manager/scylla-manager
Datacenter: manager-dc
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
|    | Alternator  | CQL         | REST     | Address      | Uptime | CPUs | Memory | Scylla | Agent | Host ID                              |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.255.190 | -      | -    | -      | -      | -     | 8ec8a729-8225-4278-a9da-ad0f23f47e01 |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.255.190 alternator: get node info: no host config available
- 10.7.255.190 CQL: no host config available

What's next?