scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
51 stars 34 forks source link

Restore task of a Kubernetes cluster fails: `restore data: set gc_grace_seconds: Cannot ALTER <table system_distributed_everywhere.cdc_generatio n_descriptions_v2>` #3347

Closed ShlomiBalalis closed 1 year ago

ShlomiBalalis commented 1 year ago

Client version: 3.1.0-rc0-0.20230327.e0369c82 Server version: 3.1.0-rc0-0.20230327.e0369c82

At first, the backup ran just fine, until it tried to restore the system_distributed_everywhere keyspace, at this point the task failed with the following:

< t:2023-04-02 00:13:05,655 f:base.py         l:142  c:KubernetesCmdRunner  p:DEBUG > Command "sctool  -c ca96f672-b48f-4c66-b563-bf36b17ab897 progress restore/09c0491e-03f5-443b-af51-b204fa790ec7" finished with status 0
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               a1d18c85-d0ea-11ed-bfa5-b63ba1aede39
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Status:           ERROR
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Cause:            restore data: set gc_grace_seconds: Cannot ALTER <table system_distributed_everywhere.cdc_generation_descriptions_v2>
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Start time:       02 Apr 23 00:09:27 UTC
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > End time: 02 Apr 23 00:12:55 UTC
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Duration: 3m28s
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Progress: 33% | 33%
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230401235852UTC
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > +-------------------------------+-----------+---------+---------+------------+--------+
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > | Keyspace                      |  Progress |    Size | Success | Downloaded | Failed |
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > +-------------------------------+-----------+---------+---------+------------+--------+
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > | system_traces                 |      100% |       0 |       0 |          0 |      0 |
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > | audit                         |      100% |       0 |       0 |          0 |      0 |
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > | keyspace1                     | 33% | 33% | 13.960G |  4.653G |     4.653G |      0 |
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > | system_distributed_everywhere |   0% | 0% |  1.115M |       0 |          0 |      0 |
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > | system_distributed            |   0% | 0% |  1.119M |       0 |          0 |      0 |
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > | system_auth                   |   0% | 0% | 15.622k |       0 |          0 |      0 |
< t:2023-04-02 00:13:05,655 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > +-------------------------------+-----------+---------+---------+------------+--------+

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2d5d40d0-6c48-47ef-8046-efd1a23f2b0a/20230402_001455/grafana-screenshot-longevity-scylla-operator-3h-eks-backup_restore-scylla-per-server-metrics-nemesis-20230402_001608-longevity-scylla-operator-3h-restor-monitor-node-2d5d40d0-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2d5d40d0-6c48-47ef-8046-efd1a23f2b0a/20230402_001455/grafana-screenshot-overview-20230402_001455-longevity-scylla-operator-3h-restor-monitor-node-2d5d40d0-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/2d5d40d0-6c48-47ef-8046-efd1a23f2b0a/20230402_002346/db-cluster-2d5d40d0.tar.gz kubernetes - https://cloudius-jenkins-test.s3.amazonaws.com/2d5d40d0-6c48-47ef-8046-efd1a23f2b0a/20230402_002346/kubernetes-2d5d40d0.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/2d5d40d0-6c48-47ef-8046-efd1a23f2b0a/20230402_002346/loader-set-2d5d40d0.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/2d5d40d0-6c48-47ef-8046-efd1a23f2b0a/20230402_002346/monitor-set-2d5d40d0.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/2d5d40d0-6c48-47ef-8046-efd1a23f2b0a/20230402_002346/sct-2d5d40d0.log.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/2d5d40d0-6c48-47ef-8046-efd1a23f2b0a/20230402_002346/sct-runner-events-2d5d40d0.tar.gz

tzach commented 1 year ago

@ShlomiBalalis, this does not look like K8s related to me Can you reproduce without K8s?

ShlomiBalalis commented 1 year ago

It's true that this keyspace doesn't seem to be specifically related to k8s, but it did not appear in any other run that we had (i.e. it was not reproduced without k8s) so it does seem like there's some connection to it.

vponomaryov commented 1 year ago

It's true that this keyspace doesn't seem to be specifically related to k8s, but it did not appear in any other run that we had (i.e. it was not reproduced without k8s) so it does seem like there's some connection to it.

Operator uses separate, mostly self-handled, scylla.yaml options. So, I think it worth comparing the config for Scylla and try out the same one or very close in VM setup.

Michal-Leszczynski commented 1 year ago

It's true that this keyspace doesn't seem to be specifically related to k8s, but it did not appear in any other run that we had (i.e. it was not reproduced without k8s) so it does seem like there's some connection to it.

I guess that in the other runs table cdc_generatio n_descriptions_v2 was empty, so it wasn't backedu-up/restored so it didn't cause any problems.

I think that it's not related to k8s, but to the fact that SM needs to alter restored table (disable gc_grace_seconds) in order to prevent data resurrection. In this case, it looks like table system_distributed_everywhere.cdc_generation_descriptions_v2 cannot be altered (similar to system_schema.* tables), so SM has to provide a workaround for that.

Restoring system_schema.* tables uses workaround described here. I am not sure if this approach is suitable for system_distributed_everywhere.cdc_generation_descriptions_v2 as it introduces great data duplication (which is not that big of an issue in terms of schema files, because they shouldn't be big anyways). I will try to get more information about this case.

Moreover, similar problem should exist for all backed-up, internal tables which cannot be altered. I will try to get a complete list of those tables and see, what can be done about them (if anyone knows where to look for them, I would be grateful).

Michal-Leszczynski commented 1 year ago

@tzach just to make sure, it is important for us to backup and restore cdc, right?

Michal-Leszczynski commented 1 year ago

From what I can see, these are the problematic tables:

system_distributed cdc_generation_timestamps Cannot ALTER <table system_distributed.cdc_generation_timestamps>
system_distributed cdc_streams_descriptions_v2 Cannot ALTER <table system_distributed.cdc_streams_descriptions_v2>
system_distributed_everywhere cdc_generation_descriptions_v2 Cannot ALTER <table system_distributed_everywhere.cdc_generation_descriptions_v2>

Generally there are more tables that SM cannot alter, but they are either from system_schema (which has a workaround) or system (which is not backed-up in the first place).

ShlomiBalalis commented 1 year ago

The issue was recreated in a scenario that uses a regular cluster: Client version: 3.1.0-rc0-0.20230403.21e14177 Server version: 3.1.0-rc0-0.20230403.21e14177

Scylla version: 5.1.7-0.20230312.5c5a9633eab8 with build-id c1e264fdf53efb6a13a38f915ca3bf37ca66eebf

The restore failed when I tried to restore a previously created snapshot, contiaining one keyspace (10gb_sizetiered):

< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               16f11be0-d2ea-11ed-a4cb-025121dd0539
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Status:           ERROR
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Cause:            restore data: disable gc_grace_seconds: Cannot ALTER <table system_distributed.cdc_streams_descriptions_v2>
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Start time:       04 Apr 23 13:10:36 UTC
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > End time: 04 Apr 23 13:10:38 UTC
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Duration: 2s
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Progress: 0% | 0%
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230223105105UTC
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────────────────────┬──────────┬──────────┬─────────┬────────────┬────────╮
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace                      │ Progress │     Size │ Success │ Downloaded │ Failed │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼──────────┼──────────┼─────────┼────────────┼────────┤
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ 10gb_sizetiered               │  0% | 0% │  34.250G │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │  0% | 0% │ 872.271k │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed_everywhere │  0% | 0% │ 862.076k │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │  0% | 0% │  16.102k │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │     100% │        0 │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────────────────────┴──────────┴──────────┴─────────┴────────────┴────────╯

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/178e72ae-d2e2-4e0b-a207-9a260ba5c972/20230404_131154/db-cluster-178e72ae.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/178e72ae-d2e2-4e0b-a207-9a260ba5c972/20230404_131154/loader-set-178e72ae.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/178e72ae-d2e2-4e0b-a207-9a260ba5c972/20230404_131154/monitor-set-178e72ae.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/178e72ae-d2e2-4e0b-a207-9a260ba5c972/20230404_131154/sct-runner-178e72ae.tar.gz

Build URL

The thing is, previously the same exact restore has worked perfectly: Client version: 3.1.0-rc0-0.20230306.e99afd24 Server version: 3.1.0-rc0-0.20230306.e99afd24

Scylla version: 5.1.6-0.20230223.530600a64674 with build-id c38178ef635d4ac7f67a8a3ce666056c81f5afb9

< t:2023-03-09 02:52:34,224 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo sctool  -c a090ed57-c818-4135-80ed-703006a055d1 progress restore/41f26e49-2ae5-48f7-ad02-ec044d09a37e"...
< t:2023-03-09 02:52:34,390 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Run:            de0269bd-be24-11ed-8a81-02f9a7a920bb
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status:         DONE - repair required (see restore docs)
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Start time:     09 Mar 23 02:48:27 UTC
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > End time:       09 Mar 23 02:52:27 UTC
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Duration:       4m0s
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Progress:       100% | 100%
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Snapshot Tag:   sm_20230223105105UTC
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╭─────────────────┬─────────────┬─────────┬─────────┬────────────┬──────────────┬────────╮
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ Keyspace        │    Progress │    Size │ Success │ Downloaded │ Deduplicated │ Failed │
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ├─────────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼────────┤
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10gb_sizetiered │ 100% | 100% │ 34.250G │ 34.250G │    34.250G │            0 │      0 │
< t:2023-03-09 02:52:34,391 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╰─────────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴────────╯

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/328ba86f-699f-4a60-8bc2-7538e427f087/20230310_072216/db-cluster-328ba86f.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/328ba86f-699f-4a60-8bc2-7538e427f087/20230310_072216/loader-set-328ba86f.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/328ba86f-699f-4a60-8bc2-7538e427f087/20230310_072216/monitor-set-328ba86f.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/328ba86f-699f-4a60-8bc2-7538e427f087/20230310_072216/sct-runner-328ba86f.tar.gz

roydahan commented 1 year ago

@ShlomiBalalis compare the scylla.yaml you had in both runs, the answer is probably there. The problem is with CDC system tables.

ShlomiBalalis commented 1 year ago

@ShlomiBalalis compare the scylla.yaml you had in both runs, the answer is probably there. The problem is with CDC system tables.

There is no difference in the scylla yaml between the successful run and the failing run: The scylla.yaml of the faulty run:

api_address: 127.0.0.1
api_doc_dir: /opt/scylladb/api/api-doc/
api_ui_dir: /opt/scylladb/swagger-ui/dist/
batch_size_fail_threshold_in_kb: 1024
batch_size_warn_threshold_in_kb: 128
broadcast_rpc_address: 10.12.2.17
cluster_name: manager-regression-manager--db-cluster-178e72ae
commitlog_segment_size_in_mb: 32
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch
experimental: true
listen_address: 10.12.2.17
num_tokens: 256
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
prometheus_address: 0.0.0.0
rpc_address: 10.12.2.17
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
  - seeds: 10.12.2.17

The scylla.yaml of the successful run:

api_address: 127.0.0.1
api_doc_dir: /opt/scylladb/api/api-doc/
api_ui_dir: /opt/scylladb/swagger-ui/dist/
batch_size_fail_threshold_in_kb: 1024
batch_size_warn_threshold_in_kb: 128
broadcast_rpc_address: 10.12.2.177
cluster_name: manager-regression-manager--db-cluster-328ba86f
commitlog_segment_size_in_mb: 32
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch
experimental: true
listen_address: 10.12.2.177
num_tokens: 256
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
prometheus_address: 0.0.0.0
rpc_address: 10.12.2.177
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
  - seeds: 10.12.2.177

Looking at the schema restore progress output of the successful run, the problematic keyspaces were not even included/specified:

< t:2023-03-09 02:37:19,853 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c a090ed57-c818-4135-80ed-703006a055d1 progress restore/23a27e6e-b187-4fa7-9eae-697e902be1de" finished with status 0
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               38f75a27-be23-11ed-8a80-02f9a7a920bb
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE - restart required (see restore docs)
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Start time:       09 Mar 23 02:36:40 UTC
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > End time: 09 Mar 23 02:36:52 UTC
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Duration: 11s
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100% | 100%
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230223105105UTC
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────┬─────────────┬──────────┬──────────┬────────────┬──────────────┬────────╮
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace      │    Progress │     Size │  Success │ Downloaded │ Deduplicated │ Failed │
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────┼─────────────┼──────────┼──────────┼────────────┼──────────────┼────────┤
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_schema │ 100% | 100% │ 319.279k │ 319.279k │   319.279k │            0 │      0 │
< t:2023-03-09 02:37:19,854 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────┴─────────────┴──────────┴──────────┴────────────┴──────────────┴────────╯

And both runs are restoring the same snapshots, so nothing changed in that front. Thus, I don't see how CDC was supposedly involved in the mix.

karol-kokoszka commented 1 year ago

@ShlomiBalalis CDC tables doesn't come from system_schema keyspace. They come from system_distributed keyspace. And the system_distributed_keyspace is part of the backup that you want to restore. Seems it's included into backup manifests file.

< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               16f11be0-d2ea-11ed-a4cb-025121dd0539
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Status:           ERROR
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Cause:            restore data: disable gc_grace_seconds: Cannot ALTER <table system_distributed.cdc_streams_descriptions_v2>
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Start time:       04 Apr 23 13:10:36 UTC
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > End time: 04 Apr 23 13:10:38 UTC
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Duration: 2s
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Progress: 0% | 0%
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230223105105UTC
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────────────────────┬──────────┬──────────┬─────────┬────────────┬────────╮
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace                      │ Progress │     Size │ Success │ Downloaded │ Failed │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼──────────┼──────────┼─────────┼────────────┼────────┤
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ 10gb_sizetiered               │  0% | 0% │  34.250G │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │  0% | 0% │ 872.271k │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed_everywhere │  0% | 0% │ 862.076k │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │  0% | 0% │  16.102k │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │     100% │        0 │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────────────────────┴──────────┴──────────┴─────────┴────────────┴────────╯

The error message from the task output is clear as well Cannot ALTER <table system_distributed.cdc_streams_descriptions_v2. It directly points to CDC tables.

The point is that these tables must be skipped on manager restore level and the fix that skips these tables will come shortly. @Michal-Leszczynski is working on that.

Can I restart this test by myself somehow choosing the build hash of the manager ?

ShlomiBalalis commented 1 year ago

@ShlomiBalalis CDC tables doesn't come from system_schema keyspace. They come from system_distributed keyspace. And the system_distributed_keyspace is part of the backup that you want to restore. Seems it's included into backup manifests file.

< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               16f11be0-d2ea-11ed-a4cb-025121dd0539
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Status:           ERROR
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Cause:            restore data: disable gc_grace_seconds: Cannot ALTER <table system_distributed.cdc_streams_descriptions_v2>
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Start time:       04 Apr 23 13:10:36 UTC
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > End time: 04 Apr 23 13:10:38 UTC
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Duration: 2s
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Progress: 0% | 0%
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230223105105UTC
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────────────────────┬──────────┬──────────┬─────────┬────────────┬────────╮
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace                      │ Progress │     Size │ Success │ Downloaded │ Failed │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼──────────┼──────────┼─────────┼────────────┼────────┤
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ 10gb_sizetiered               │  0% | 0% │  34.250G │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │  0% | 0% │ 872.271k │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed_everywhere │  0% | 0% │ 862.076k │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │  0% | 0% │  16.102k │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │     100% │        0 │       0 │          0 │      0 │
< t:2023-04-04 13:10:41,021 f:cli.py          l:1110 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────────────────────┴──────────┴──────────┴─────────┴────────────┴────────╯

The error message from the task output is clear as well Cannot ALTER <table system_distributed.cdc_streams_descriptions_v2. It directly points to CDC tables.

The point is that these tables must be skipped on manager restore level and the fix that skips these tables will come shortly. @Michal-Leszczynski is working on that.

Can I restart this test by myself somehow choosing the build hash of the manager ?

The test is here: https://jenkins.scylladb.com/job/manager-3.1/job/sct-feature-test-restore-multiple-snapshot-restoring/ Rebuild the last build and replace scylla_mgmt_address and scylla_mgmt_agent_address with the latest build's list files.