scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
52 stars 34 forks source link

Restoring a snapshot into a cluster with authentication enabled fails: `unable to create session: unable to discover protocol version: authentication required` #3495

Closed ShlomiBalalis closed 1 year ago

ShlomiBalalis commented 1 year ago

Issue description

First, we added the cluster to the manager

< t:2023-07-20 10:21:37,239 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool cluster add --host=10.4.3.18  --name=longevity-200gb-48h-verify-limited--db-cluster-f296d884 --auth-token f296d884-316b-41b4-9406-a9090ec196ea --username cassandra --password cassandra" finished with status 0
< t:2023-07-20 10:21:37,244 f:cli.py          l:1114 c:sdcm.mgmt.cli        p:DEBUG > sctool output: ae818ceb-b9c2-405e-9d37-5021b423f487

Afterwards, we tried to restore a (previously created) snapshot, and the restore command failed due to missing authentication:

< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > Command: 'sudo sctool restore -c ae818ceb-b9c2-405e-9d37-5021b423f487 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230702173940UTC'
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > 
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > Exit code: 1
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > 
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > Stdout:
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > 
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > 
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > 
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > Stderr:
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > 
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > Error: create restore units: get CQL cluster session: gocql: unable to create session: unable to discover protocol version: authentication required (using "org.apache.cassandra.auth.PasswordAuthenticator")
< t:2023-07-20 10:21:43,898 f:nemesis.py      l:4591 c:sdcm.nemesis         p:ERROR > Trace ID: lZwtNmQ3QXqCUj1RrEgaEQ (grep in scylla-manager logs)

Grepped rows from the log:

monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.186Z","N":"backup","M":"GetRestoreTarget","cluster_id":"ae818ceb-b9c2-405e-9d37-5021b423f487","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.186Z","N":"backup","M":"No datacenter specified for location - using all nodes for this location","location":"s3:manager-backup-tests-permanent-snapshots-us-east-1","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.186Z","N":"cluster","M":"Creating new Scylla REST client","cluster_id":"ae818ceb-b9c2-405e-9d37-5021b423f487","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.205Z","N":"cluster.client","M":"Measuring datacenter latencies","dcs":["eu-west"],"_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.234Z","N":"cluster.client","M":"Checking hosts connectivity","hosts":["10.4.1.140","10.4.3.18","10.4.3.190","10.4.3.39"],"_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.236Z","N":"cluster.client","M":"Host check OK","host":"10.4.3.39","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.236Z","N":"cluster.client","M":"Host check OK","host":"10.4.3.18","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.237Z","N":"cluster.client","M":"Host check OK","host":"10.4.3.190","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.243Z","N":"cluster.client","M":"Host check OK","host":"10.4.1.140","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:40 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:40.243Z","N":"cluster.client","M":"Done checking hosts connectivity","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:41 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:41.903Z","N":"cluster.client","M":"Host location access check OK","host":"10.4.3.190","location":"s3:manager-backup-tests-permanent-snapshots-us-east-1","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:41 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:41.930Z","N":"cluster.client","M":"Host location access check OK","host":"10.4.3.18","location":"s3:manager-backup-tests-permanent-snapshots-us-east-1","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:41 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:41.934Z","N":"cluster.client","M":"Host location access check OK","host":"10.4.3.39","location":"s3:manager-backup-tests-permanent-snapshots-us-east-1","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:42 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:42.039Z","N":"cluster.client","M":"Host location access check OK","host":"10.4.1.140","location":"s3:manager-backup-tests-permanent-snapshots-us-east-1","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:42 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:42.039Z","N":"cluster","M":"Creating new Scylla REST client","cluster_id":"ae818ceb-b9c2-405e-9d37-5021b423f487","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:42 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:42.054Z","N":"cluster.client","M":"Measuring datacenter latencies","dcs":["eu-west"],"_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}
monitor-set-f296d884/longevity-200gb-48h-verify-limited--monitor-node-f296d884-1/scylla_manager.log:Jul 20 10:21:42 longevity-200gb-48h-verify-limited--monitor-node-f296d884-1 scylla-manager[30171]: {"L":"INFO","T":"2023-07-20T10:21:42.125Z","N":"http","M":"POST /api/v1/cluster/ae818ceb-b9c2-405e-9d37-5021b423f487/tasks","from":"127.0.0.1:54542","status":500,"bytes":251,"duration":"1941ms","error":"create restore units: get CQL cluster session: gocql: unable to create session: unable to discover protocol version: authentication required (using \"org.apache.cassandra.auth.PasswordAuthenticator\")","_trace_id":"lZwtNmQ3QXqCUj1RrEgaEQ"}

Impact

Users that have authentication enabled on their clusters could potentially be unable to use the restore ability.

How frequently does it reproduce?

Reproduce constantly over the last several runs.

Installation details

Kernel Version: 5.15.0-1039-aws Scylla version (or git commit hash): 2022.2.11-20230705.27d29485de90 with build-id f467a0ad8869d61384d8bbc8f20e4fb8fd281f4b Client version: 3.1.2-0.20230704.bd349aa4 Server version: 3.1.2-0.20230704.bd349aa4

Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0e981bde054209883 (aws: eu-west-1)

Test: longevity-200gb-48h-test_restore-nemesis Test id: f296d884-316b-41b4-9406-a9090ec196ea Test name: scylla-staging/Shlomo/longevity-200gb-48h-test_restore-nemesis Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor f296d884-316b-41b4-9406-a9090ec196ea` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=f296d884-316b-41b4-9406-a9090ec196ea) - Show all stored logs command: `$ hydra investigate show-logs f296d884-316b-41b4-9406-a9090ec196ea` ## Logs: - **db-cluster-f296d884.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/db-cluster-f296d884.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/db-cluster-f296d884.tar.gz) - **sct-runner-events-f296d884.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/sct-runner-events-f296d884.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/sct-runner-events-f296d884.tar.gz) - **sct-f296d884.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/sct-f296d884.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/sct-f296d884.log.tar.gz) - **loader-set-f296d884.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/loader-set-f296d884.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/loader-set-f296d884.tar.gz) - **monitor-set-f296d884.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/monitor-set-f296d884.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f296d884-316b-41b4-9406-a9090ec196ea/20230720_115224/monitor-set-f296d884.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/Shlomo/job/longevity-200gb-48h-test_restore-nemesis/11/) [Argus](https://argus.scylladb.com/test/226c0f08-de6f-4d69-8f77-b01161019748/runs?additionalRuns[]=f296d884-316b-41b4-9406-a9090ec196ea)
karol-kokoszka commented 1 year ago

Hi @ShlomiBalalis, I double checked the default configuration of scylla we use in our docker environment. From 3.1-branch -> https://github.com/scylladb/scylla-manager/blob/bd349aa44d9bcd8352c15bfcf9c66a6ee5bb4534/testing/scylla/config/scylla.yaml#L222

authenticator: PasswordAuthenticator is enabled and all our tests are executed against this configuration.

Makefile copying this config file https://github.com/scylladb/scylla-manager/blob/bd349aa44d9bcd8352c15bfcf9c66a6ee5bb4534/testing/Makefile#L33 ..and docker compose binding it to the container https://github.com/scylladb/scylla-manager/blob/bd349aa44d9bcd8352c15bfcf9c66a6ee5bb4534/testing/docker-compose.yaml#L35-L37

karol-kokoszka commented 1 year ago

@ShlomiBalalis OK, I found the reason. Whenever SM wants to create the session, it calls scylla first to get any node_info. This call return the scylla.yaml configuration values.

One of the checks includes validating if the CQL password protection is enabled. https://github.com/scylladb/scylla-manager/blob/22d7e33905c3fd91514619a039f7f634bbc94616/pkg/scyllaclient/config_client.go#L153-L162

To be sure that the authentication is enabled we compare scylla endpoint payload to "PasswordAuthenticator" string. And this is perfectly fine according to https://opensource.docs.scylladb.com/stable/operating-scylla/security/authentication.html#procedure

authenticator: PasswordAuthenticator

I realized that in your test, you use

authenticator: org.apache.cassandra.auth.PasswordAuthenticator

(see db-cluster-f296d884.tar.gz /db-cluster-f296d884/longevity-200gb-48h-verify-limited--db-node-f296d884-1/scylla.yaml)

You must change it to just

authenticator: PasswordAuthenticator

... to make it working.

Feel free to check and close the issue if the config change solves it (I guess so).

dkropachev commented 1 year ago

@karol-kokoszka, I think we need to take care of it in s-m, since we are referring to cassandra-style authenticator names in some of our doc https://github.com/scylladb/scylladb/blob/37ceef23a6877748379a76ac2c6462553275ab36/conf/scylla.yaml#L232C1-L243C39

Also some of our customers that have migrated from cassandra will face this issue too.