scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Incorrect error messages in health_check for SSL configuration issues in Manager 3.2.8 #3889

Open mikliapko opened 1 week ago

mikliapko commented 1 week ago

Test description: The test starts a (second) cluster with ssl disabled, and adds it to the manager.

Afterwards, the test enables ssl encryption for the cluster, without updating the manager, and because of that the manager cannot communicate with the cluster through cql.

At the end, the test requests the status of the cluster from the manager, and makes sure that the cql status of all of the nodes is ERROR, and that proper error messages were printed for each of the nodes, since the manager ('s agents) fail to communicate with the cluster due to the missing ssl keys.

Actual result: Manager 3.2.8 returns incorrect error messages:

- 127.0.71.1 alternator: get node info: no host config available
- 127.0.71.1 CQL: no host config available

Expected result: The manager should return error messages indicating that SSL client encryption is enabled but the certificate is missing. Example from version 3.2.7:

- 127.0.70.1 alternator: get node info: client encryption is enabled, but certificate is missing: get SSL user cert from secrets store: not found
- 127.0.70.1 CQL: client encryption is enabled, but certificate is missing: get SSL user cert from secrets store: not found

Environment: Scylla manager - 3.2.8

Additional Info:

gdubicki commented 3 days ago

We seem to be having this or a related problem after an upgrade to Scylla Manager 3.2.8 in production.

We are seeing:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla/scylla
Datacenter: XXX
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
|    | Alternator  | CQL         | REST     | Address      | Uptime | CPUs | Memory | Scylla | Agent | Host ID                              |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.241.130 | -      | -    | -      | -      | -     | 8a24c600-5525-490e-a3cd-314f6062d6a1 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (6ms) | 10.7.241.174 | -      | -    | -      | -      | -     | f14fcd59-8d90-4d8e-af22-ace87ceced22 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.241.175 | -      | -    | -      | -      | -     | 050dcc67-7bb8-4d5d-89b1-5dbe0bcbb8b2 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (5ms) | 10.7.243.109 | -      | -    | -      | -      | -     | 4a3ff045-bba2-4537-a4d7-a213d25ae713 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.248.124 | -      | -    | -      | -      | -     | 028023f5-9d4e-404c-8537-467ac3d4538c |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.249.238 | -      | -    | -      | -      | -     | b8f68c62-c462-4a30-a505-5ece9ae1ab0b |
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.252.229 | -      | -    | -      | -      | -     | 1ff1b8df-7a90-4321-a309-7cd69e20bd70 |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.241.130 alternator: get node info: no host config available
- 10.7.241.130 CQL: no host config available
- 10.7.241.174 alternator: get node info: no host config available
- 10.7.241.174 CQL: no host config available
- 10.7.241.175 alternator: get node info: no host config available
- 10.7.241.175 CQL: no host config available
- 10.7.243.109 alternator: get node info: no host config available
- 10.7.243.109 CQL: no host config available
- 10.7.248.124 alternator: get node info: no host config available
- 10.7.248.124 CQL: no host config available
- 10.7.249.238 alternator: get node info: no host config available
- 10.7.249.238 CQL: no host config available
- 10.7.252.229 alternator: get node info: no host config available
- 10.7.252.229 CQL: no host config available

...while in the Scylla Manager logs we see entries like this:

"host": "10.7.241.174",
"service": "scylla-manager",
"attributes": {
    "cluster": "b3580ac3-4e6d-4f1c-8217-2672280c0ab8",
    "S": "github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:84\ngithub.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateSingle.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:182",
    "T": "2024-06-29T15:36:23.402Z",
    "host": "10.7.241.174",
    "_trace_id": "SuPC1lNRTSy78zeDrDV9JA",
    "L": "ERROR",
    "error": "retrieve cluster host configuration: building node config: unable to create TLS configuration for CQL session: client encryption is enabled, but certificate is missing: not found",
    "M": "Couldn't read cluster host config",
    "errorStack": "github.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).retrieveNodeConfig\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:240\ngithub.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateSingle.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:180\nruntime.goexit\n\truntime/asm_amd64.s:1695\n",
    "N": "Cluster config update.Cluster host config update"
}

We haven't enabled or even changed any SSL-related config during the update.

Should we just downgrade to 3.2.7 until it's resolved? Or is it fixed in 3.3.0 and we should upgrade?

karol-kokoszka commented 1 day ago

@gdubicki

- 10.7.241.130 alternator: get node info: no host config available
- 10.7.241.130 CQL: no host config available

These errors comes from the config cache service introduced with manager 3.2.8. and they indicate that service couldn't update cache with the latest / correct scylla node configuration, as some of the information is missing in manager DB.

    "error": "retrieve cluster host configuration: building node config: unable to create TLS configuration for CQL session: client encryption is enabled, but certificate is missing: not found",

Means that even though the scylla.yaml enables scylla encryption, and requires the client authentication, the certificates are not provided to manager. Due to that, it cannot establish secured CQL session.

Check your scylla.yaml files on nodes against:

client_encryption_options:
  enabled: true     <------ this seems to be enabled
  certificate: /etc/scylla/db.crt     <- missing in manager db
  keyfile: /etc/scylla/db.key   <- missing in manager db
#    truststore: <none, use system trust>
  require_client_auth: true    <---- this seems to be enabled
#    priority_string: <not set, use default>

Check https://manager.docs.scylladb.com/stable/sctool/cluster.html#ssl-user-cert-file https://manager.docs.scylladb.com/stable/sctool/cluster.html#ssl-user-key-file

If you want to disable TLS, use this flag https://manager.docs.scylladb.com/stable/sctool/cluster.html#force-tls-disabled , but pls make sure that non-TLS CQL session is allowed in your scylla configuration.

TLDR; you don't need to downgrade to 3.2.7 or upgrade to 3.3.0. You need to upgrade to 3.3.0 if your cluster uses Scylla OOS 6.0

gdubicki commented 16 hours ago

Thanks @karol-kokoszka!

But the thing is that our scylla.yaml has only this config:

read_request_timeout_in_ms: 5000
write_request_timeout_in_ms: 2000
cas_contention_timeout_in_ms: 1000

consistent_cluster_management: true

...and if I am reading https://github.com/scylladb/scylladb/blob/scylla-5.4.7/conf/scylla.yaml#L474 right, the default setting is disabled. 😕

karol-kokoszka commented 13 hours ago

@gdubicki for some reason the node in you cluster reported the encryption enabled. Here is the SM part building the cached node configuration: https://github.com/scylladb/scylla-manager/blob/8d9190b5a0e12e0ec0ef611aa9295e703b30d741/pkg/service/configcache/tlsconfig.go#L22-L47

Here is the API call to scylla-server checking the encryption options: https://github.com/scylladb/scylla-manager/blob/8d9190b5a0e12e0ec0ef611aa9295e703b30d741/swagger/gen/scylla/v2/client/config/config_client.go#L1153-L1181

You can jump to some of the cluster node and call

curl 127.0.0.1:10000/v2/config/client_encryption_options

to see what scylla-server reports.