Open mikliapko opened 1 week ago
We seem to be having this or a related problem after an upgrade to Scylla Manager 3.2.8 in production.
We are seeing:
$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla/scylla
Datacenter: XXX
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| | Alternator | CQL | REST | Address | Uptime | CPUs | Memory | Scylla | Agent | Host ID |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.241.130 | - | - | - | - | - | 8a24c600-5525-490e-a3cd-314f6062d6a1 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (6ms) | 10.7.241.174 | - | - | - | - | - | f14fcd59-8d90-4d8e-af22-ace87ceced22 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.241.175 | - | - | - | - | - | 050dcc67-7bb8-4d5d-89b1-5dbe0bcbb8b2 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (5ms) | 10.7.243.109 | - | - | - | - | - | 4a3ff045-bba2-4537-a4d7-a213d25ae713 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.248.124 | - | - | - | - | - | 028023f5-9d4e-404c-8537-467ac3d4538c |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.249.238 | - | - | - | - | - | b8f68c62-c462-4a30-a505-5ece9ae1ab0b |
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.252.229 | - | - | - | - | - | 1ff1b8df-7a90-4321-a309-7cd69e20bd70 |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.241.130 alternator: get node info: no host config available
- 10.7.241.130 CQL: no host config available
- 10.7.241.174 alternator: get node info: no host config available
- 10.7.241.174 CQL: no host config available
- 10.7.241.175 alternator: get node info: no host config available
- 10.7.241.175 CQL: no host config available
- 10.7.243.109 alternator: get node info: no host config available
- 10.7.243.109 CQL: no host config available
- 10.7.248.124 alternator: get node info: no host config available
- 10.7.248.124 CQL: no host config available
- 10.7.249.238 alternator: get node info: no host config available
- 10.7.249.238 CQL: no host config available
- 10.7.252.229 alternator: get node info: no host config available
- 10.7.252.229 CQL: no host config available
...while in the Scylla Manager logs we see entries like this:
"host": "10.7.241.174",
"service": "scylla-manager",
"attributes": {
"cluster": "b3580ac3-4e6d-4f1c-8217-2672280c0ab8",
"S": "github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:84\ngithub.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateSingle.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:182",
"T": "2024-06-29T15:36:23.402Z",
"host": "10.7.241.174",
"_trace_id": "SuPC1lNRTSy78zeDrDV9JA",
"L": "ERROR",
"error": "retrieve cluster host configuration: building node config: unable to create TLS configuration for CQL session: client encryption is enabled, but certificate is missing: not found",
"M": "Couldn't read cluster host config",
"errorStack": "github.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).retrieveNodeConfig\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:240\ngithub.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateSingle.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:180\nruntime.goexit\n\truntime/asm_amd64.s:1695\n",
"N": "Cluster config update.Cluster host config update"
}
We haven't enabled or even changed any SSL-related config during the update.
Should we just downgrade to 3.2.7 until it's resolved? Or is it fixed in 3.3.0 and we should upgrade?
@gdubicki
- 10.7.241.130 alternator: get node info: no host config available
- 10.7.241.130 CQL: no host config available
These errors comes from the config cache service introduced with manager 3.2.8. and they indicate that service couldn't update cache with the latest / correct scylla node configuration, as some of the information is missing in manager DB.
"error": "retrieve cluster host configuration: building node config: unable to create TLS configuration for CQL session: client encryption is enabled, but certificate is missing: not found",
Means that even though the scylla.yaml
enables scylla encryption, and requires the client authentication, the certificates are not provided to manager. Due to that, it cannot establish secured CQL session.
Check your scylla.yaml
files on nodes against:
client_encryption_options:
enabled: true <------ this seems to be enabled
certificate: /etc/scylla/db.crt <- missing in manager db
keyfile: /etc/scylla/db.key <- missing in manager db
# truststore: <none, use system trust>
require_client_auth: true <---- this seems to be enabled
# priority_string: <not set, use default>
Check https://manager.docs.scylladb.com/stable/sctool/cluster.html#ssl-user-cert-file https://manager.docs.scylladb.com/stable/sctool/cluster.html#ssl-user-key-file
If you want to disable TLS, use this flag https://manager.docs.scylladb.com/stable/sctool/cluster.html#force-tls-disabled , but pls make sure that non-TLS CQL session is allowed in your scylla configuration.
TLDR; you don't need to downgrade to 3.2.7 or upgrade to 3.3.0. You need to upgrade to 3.3.0 if your cluster uses Scylla OOS 6.0
Thanks @karol-kokoszka!
But the thing is that our scylla.yaml
has only this config:
read_request_timeout_in_ms: 5000
write_request_timeout_in_ms: 2000
cas_contention_timeout_in_ms: 1000
consistent_cluster_management: true
...and if I am reading https://github.com/scylladb/scylladb/blob/scylla-5.4.7/conf/scylla.yaml#L474 right, the default setting is disabled. 😕
@gdubicki for some reason the node in you cluster reported the encryption enabled. Here is the SM part building the cached node configuration: https://github.com/scylladb/scylla-manager/blob/8d9190b5a0e12e0ec0ef611aa9295e703b30d741/pkg/service/configcache/tlsconfig.go#L22-L47
Here is the API call to scylla-server checking the encryption options: https://github.com/scylladb/scylla-manager/blob/8d9190b5a0e12e0ec0ef611aa9295e703b30d741/swagger/gen/scylla/v2/client/config/config_client.go#L1153-L1181
You can jump to some of the cluster node and call
curl 127.0.0.1:10000/v2/config/client_encryption_options
to see what scylla-server reports.
Test description: The test starts a (second) cluster with ssl disabled, and adds it to the manager.
Afterwards, the test enables ssl encryption for the cluster, without updating the manager, and because of that the manager cannot communicate with the cluster through cql.
At the end, the test requests the status of the cluster from the manager, and makes sure that the cql status of all of the nodes is ERROR, and that proper error messages were printed for each of the nodes, since the manager ('s agents) fail to communicate with the cluster due to the missing ssl keys.
Actual result: Manager 3.2.8 returns incorrect error messages:
Expected result: The manager should return error messages indicating that SSL client encryption is enabled but the certificate is missing. Example from version 3.2.7:
Environment: Scylla manager - 3.2.8
Additional Info: