scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
49 stars 33 forks source link

(ipv6) Manager repair able to run on encrypted cluster without having cert file #3158

Open ShlomiBalalis opened 2 years ago

ShlomiBalalis commented 2 years ago

The last test function in the sanity test goes as follows:

  1. First, we create a a repair task (using the fail-fast flag)
  2. We then enable client enryption in the cluster (without updating the manager)
  3. We wait for the repair to fail
  4. The manager is updated with the cert files
  5. We start the repair again

Here is a proper working run in the ipv6 sanity of 2.6: sctool repair -c 39bc1f68-6fa6-4b54-93ad-802d6ef58061 --fail-fast repair/7b236f58-abb8-47a0-8032-73257ca8f8be After we enable client encrypt throughout the cluster:

sudo sctool task progress repair/7b236f58-abb8-47a0-8032-73257ca8f8be -c 39bc1f68-6fa6-4b54-93ad-802d6ef58061

Arguments:        --fail-fast
Status:           ERROR
Cause:            repair 768 token ranges out of 9984
Start time:       09 May 22 20:33:45 UTC
End time: 09 May 22 20:35:09 UTC
Duration: 1m24s
Progress: 73%/9%
Datacenters:      
  - eu-west

╭─────────────────────────┬─────────────────────────────┬──────────┬──────────╮
│ Keyspace                │                       Table │ Progress │ Duration │
├─────────────────────────┼─────────────────────────────┼──────────┼──────────┤
│ keyspace1               │                   standard1 │ 0%       │ 0s       │
├─────────────────────────┼─────────────────────────────┼──────────┼──────────┤
│ simplestrategy_keyspace │               example_table │ 100%     │ 0s       │
├─────────────────────────┼─────────────────────────────┼──────────┼──────────┤
│ system_auth             │                role_members │ 100%     │ 0s       │
│ system_auth             │                       roles │ 100%     │ 1s       │
├─────────────────────────┼─────────────────────────────┼──────────┼──────────┤
│ system_distributed      │ cdc_generation_descriptions │ 100%     │ 1s       │
│ system_distributed      │   cdc_generation_timestamps │ 100%     │ 10s      │
│ system_distributed      │ cdc_streams_descriptions_v2 │ 0%/100%  │ 11s      │
│ system_distributed      │           view_build_status │ 0%       │ 0s       │
├─────────────────────────┼─────────────────────────────┼──────────┼──────────┤
│ system_traces           │                      events │ 100%     │ 0s       │
│ system_traces           │               node_slow_log │ 100%     │ 0s       │
│ system_traces           │      node_slow_log_time_idx │ 100%     │ 0s       │
│ system_traces           │                    sessions │ 100%     │ 0s       │
│ system_traces           │           sessions_time_idx │ 100%     │ 0s       │
╰─────────────────────────┴─────────────────────────────┴──────────┴──────────╯

(Logs of the run):

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                             Log links for testrun with test id 9281ece4-2f09-4448-b1e7-2a47ba580e09                                             |
+-----------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+
| Date            | Log type    | Link                                                                                                                            |
+-----------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+
| 20220509_204209 | db-cluster  | https://cloudius-jenkins-test.s3.amazonaws.com/9281ece4-2f09-4448-b1e7-2a47ba580e09/20220509_204209/db-cluster-9281ece4.tar.gz  |
| 20220509_204209 | loader-set  | https://cloudius-jenkins-test.s3.amazonaws.com/9281ece4-2f09-4448-b1e7-2a47ba580e09/20220509_204209/loader-set-9281ece4.tar.gz  |
| 20220509_204209 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/9281ece4-2f09-4448-b1e7-2a47ba580e09/20220509_204209/monitor-set-9281ece4.tar.gz |
| 20220509_204209 | sct         | https://cloudius-jenkins-test.s3.amazonaws.com/9281ece4-2f09-4448-b1e7-2a47ba580e09/20220509_204209/sct-runner-9281ece4.tar.gz  |
+-----------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+

In 3.0, oddly enough, (where the test uses the same data size) the repair seemingly ends before the encryption is activated:

Run:            876ea6dd-d175-11ec-b915-0a19f20d9d03
Status:         DONE
Start time:     11 May 22 21:58:45 UTC
End time:       11 May 22 21:59:00 UTC
Duration:       15s
Progress:       100%
Datacenters:    
  - eu-west

╭───────────────────────────────┬────────────────────────────────┬──────────┬──────────╮
│ Keyspace                      │                          Table │ Progress │ Duration │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ keyspace1                     │                      standard1 │ 100%     │ 6s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ simplestrategy_keyspace       │                  example_table │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_auth                   │                role_attributes │ 100%     │ 1s       │
│ system_auth                   │                   role_members │ 100%     │ 1s       │
│ system_auth                   │                          roles │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed_everywhere │ cdc_generation_descriptions_v2 │ 100%     │ 1s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed            │      cdc_generation_timestamps │ 100%     │ 1s       │
│ system_distributed            │    cdc_streams_descriptions_v2 │ 100%     │ 1s       │
│ system_distributed            │                 service_levels │ 100%     │ 1s       │
│ system_distributed            │              view_build_status │ 100%     │ 1s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_traces                 │                         events │ 100%     │ 0s       │
│ system_traces                 │                  node_slow_log │ 100%     │ 0s       │
│ system_traces                 │         node_slow_log_time_idx │ 100%     │ 0s       │
│ system_traces                 │                       sessions │ 100%     │ 0s       │
│ system_traces                 │              sessions_time_idx │ 100%     │ 0s       │
╰───────────────────────────────┴────────────────────────────────┴──────────┴──────────╯

So, I created a manual scenario where we create the task before the encryption but only start it after the encryption is on: < t:2022-05-09 17:27:03,592 f:cli.py l:1056 c:sdcm.mgmt.cli p:DEBUG > Issuing: 'sctool repair -c c4f0ef1c-8a0e-4e3b-beec-5c91738c18af --fail-fast --cron '35 17 * * *' ' repair/00cbc57d-d92a-476b-90af-96ff8a9d7a9c At this point we activate the encryption across the cluster, and then start the repair (From a node's log)

2022-05-09T17:28:47+00:00 manager-regression-fix-ipv6-db-node-272aa238-1 !    INFO |  [shard 0] cql_server_controller - Enabling encrypted CQL connections between client and server
2022-05-09T17:28:47+00:00 manager-regression-fix-ipv6-db-node-272aa238-1 !    INFO |  [shard 0] cql_server_controller - Enabling encrypted CQL connections between client and server
2022-05-09T17:28:47+00:00 manager-regression-fix-ipv6-db-node-272aa238-1 !    INFO |  [shard 0] cql_server_controller - Starting listening for CQL clients on [2a05:d018:eb8:3100:dd76:1cb7:6ea:e48e]:9042 (encrypted, non-shard-aware)
2022-05-09T17:28:47+00:00 manager-regression-fix-ipv6-db-node-272aa238-1 !    INFO |  [shard 0] cql_server_controller - Starting listening for CQL clients on [2a05:d018:eb8:3100:dd76:1cb7:6ea:e48e]:9042 (encrypted, non-shard-aware)
2022-05-09T17:28:47+00:00 manager-regression-fix-ipv6-db-node-272aa238-1 !    INFO |  [shard 0] cql_server_controller - Starting listening for CQL clients on [2a05:d018:eb8:3100:dd76:1cb7:6ea:e48e]:19042 (encrypted, shard-aware)
2022-05-09T17:28:47+00:00 manager-regression-fix-ipv6-db-node-272aa238-1 !    INFO |  [shard 0] cql_server_controller - Starting listening for CQL clients on [2a05:d018:eb8:3100:dd76:1cb7:6ea:e48e]:19042 (encrypted, shard-aware)

< t:2022-05-09 17:34:08,795 f:cli.py l:1056 c:sdcm.mgmt.cli p:DEBUG > Issuing: 'sctool start repair/00cbc57d-d92a-476b-90af-96ff8a9d7a9c -c c4f0ef1c-8a0e-4e3b-beec-5c91738c18af'

< t:2022-05-09 17:34:32,635 f:cli.py          l:1056 c:sdcm.mgmt.cli        p:DEBUG > Issuing: 'sctool  -c c4f0ef1c-8a0e-4e3b-beec-5c91738c18af progress repair/00cbc57d-d92a-476b-90af-96ff8a9d7a9c'
...
Run:            3b6a9637-cfbe-11ec-8942-0a71c06d7417
Status:         DONE
Start time:     09 May 22 17:34:08 UTC
End time:       09 May 22 17:34:22 UTC
Duration:       13s
Progress:       100%
Datacenters:    
  - eu-west

╭───────────────────────────────┬────────────────────────────────┬──────────┬──────────╮
│ Keyspace                      │                          Table │ Progress │ Duration │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ keyspace1                     │                      standard1 │ 100%     │ 6s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ simplestrategy_keyspace       │                  example_table │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_auth                   │                role_attributes │ 100%     │ 0s       │
│ system_auth                   │                   role_members │ 100%     │ 1s       │
│ system_auth                   │                          roles │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed_everywhere │ cdc_generation_descriptions_v2 │ 100%     │ 1s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed            │      cdc_generation_timestamps │ 100%     │ 0s       │
│ system_distributed            │    cdc_streams_descriptions_v2 │ 100%     │ 0s       │
│ system_distributed            │                 service_levels │ 100%     │ 1s       │
│ system_distributed            │              view_build_status │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_traces                 │                         events │ 100%     │ 0s       │
│ system_traces                 │                  node_slow_log │ 100%     │ 0s       │
│ system_traces                 │         node_slow_log_time_idx │ 100%     │ 0s       │
│ system_traces                 │                       sessions │ 100%     │ 0s       │
│ system_traces                 │              sessions_time_idx │ 100%     │ 0s       │
╰───────────────────────────────┴────────────────────────────────┴──────────┴──────────╯

How can the repair run when the client encryption is on?

Logs:

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                             Log links for testrun with test id 272aa238-a063-413c-ac09-453b9597d5f4                                             |
+-----------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+
| Date            | Log type    | Link                                                                                                                            |
+-----------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+
| 20220509_173902 | db-cluster  | https://cloudius-jenkins-test.s3.amazonaws.com/272aa238-a063-413c-ac09-453b9597d5f4/20220509_173902/db-cluster-272aa238.tar.gz  |
| 20220509_173902 | loader-set  | https://cloudius-jenkins-test.s3.amazonaws.com/272aa238-a063-413c-ac09-453b9597d5f4/20220509_173902/loader-set-272aa238.tar.gz  |
| 20220509_173902 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/272aa238-a063-413c-ac09-453b9597d5f4/20220509_173902/monitor-set-272aa238.tar.gz |
| 20220509_173902 | sct         | https://cloudius-jenkins-test.s3.amazonaws.com/272aa238-a063-413c-ac09-453b9597d5f4/20220509_173902/sct-runner-272aa238.tar.gz  |
+-----------------+-------------+---------------------------------------------------------------------------------------------------------------------------------+

It's imporatnt to note, the repair only suceeds in the ipv6 sanity. In any other sanity the repair fails all the same. An example from the ipv4 centos sanity of 3.0:

sctool output: Run:               8233f648-d0a3-11ec-b945-02b2f266c52d
Status:           ERROR
Cause:            repair 1536 token ranges out of 11520
Start time:       10 May 22 20:55:22 UTC
End time: 10 May 22 21:00:26 UTC
Duration: 5m4s
Progress: 31%/15%
Datacenters:      
  - us-eastscylla_node_east
  - us-west-2scylla_node_west

╭───────────────────────────────┬────────────────────────────────┬──────────┬──────────╮
│ Keyspace                      │                          Table │ Progress │ Duration │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ keyspace1                     │                      standard1 │ 0%       │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ simplestrategy_keyspace       │                  example_table │ 100%     │ 44s      │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_auth                   │                role_attributes │ 0%/100%  │ 40s      │
│ system_auth                   │                   role_members │ 0%/100%  │ 1m1s     │
│ system_auth                   │                          roles │ 0%       │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed_everywhere │ cdc_generation_descriptions_v2 │ 0%       │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed            │      cdc_generation_timestamps │ 0%       │ 0s       │
│ system_distributed            │    cdc_streams_descriptions_v2 │ 0%       │ 0s       │
│ system_distributed            │                 service_levels │ 0%       │ 0s       │
│ system_distributed            │              view_build_status │ 0%       │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_traces                 │                         events │ 100%     │ 25s      │
│ system_traces                 │                  node_slow_log │ 100%     │ 44s      │
│ system_traces                 │         node_slow_log_time_idx │ 100%     │ 45s      │
│ system_traces                 │                       sessions │ 100%     │ 42s      │
│ system_traces                 │              sessions_time_idx │ 100%     │ 22s      │
╰───────────────────────────────┴────────────────────────────────┴──────────┴──────────╯
ShlomiBalalis commented 2 years ago

Also, since as of 3.0 the repair takes a lot less time to run, I ran the same scenario with a larger data size, so that the repair will take 2 minutes again (roughly the same time as it did in the 2.6 run) but the repair succeeeds all the same:

< t:2022-05-10 13:54:40,527 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool repair -c 0e90fbc1-a3b1-4482-97eb-1e76b6f967f0 --fail-fast" finished with status 0
repair/a5725013-7b3d-43ce-8dd9-d3776d8b0999

After the encryption was activated:

< t:2022-05-10 14:01:50,853 f:cli.py          l:1056 c:sdcm.mgmt.cli        p:DEBUG > Issuing: 'sctool  -c 0e90fbc1-a3b1-4482-97eb-1e76b6f967f0 progress repair/a5725013-7b3d-43ce-8dd9-d3776d8b0999'

Run:               bc8ea0de-d068-11ec-b262-0a4fbbd9ca01
Status:           DONE
Start time:       10 May 22 13:54:40 UTC
End time: 10 May 22 13:56:41 UTC
Duration: 2m1s
Progress: 100%
Datacenters:      
  - eu-west

╭───────────────────────────────┬────────────────────────────────┬──────────┬──────────╮
│ Keyspace                      │                          Table │ Progress │ Duration │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ keyspace1                     │                      standard1 │ 100%     │ 19s      │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ simplestrategy_keyspace       │                  example_table │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_auth                   │                role_attributes │ 100%     │ 0s       │
│ system_auth                   │                   role_members │ 100%     │ 0s       │
│ system_auth                   │                          roles │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed_everywhere │ cdc_generation_descriptions_v2 │ 100%     │ 1s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_distributed            │      cdc_generation_timestamps │ 100%     │ 1s       │
│ system_distributed            │    cdc_streams_descriptions_v2 │ 100%     │ 0s       │
│ system_distributed            │                 service_levels │ 100%     │ 0s       │
│ system_distributed            │              view_build_status │ 100%     │ 0s       │
├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
│ system_traces                 │                         events │ 100%     │ 0s       │
│ system_traces                 │                  node_slow_log │ 100%     │ 0s       │
│ system_traces                 │         node_slow_log_time_idx │ 100%     │ 0s       │
│ system_traces                 │                       sessions │ 100%     │ 0s       │
│ system_traces                 │              sessions_time_idx │ 100%     │ 0s       │
╰───────────────────────────────┴────────────────────────────────┴──────────┴──────────╯