scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
51 stars 34 forks source link

After a successful schema and data restoration *to a different region*, the restored keyspace is completely empty #3525

Open ShlomiBalalis opened 1 year ago

ShlomiBalalis commented 1 year ago

Issue description

At 2023-08-14 13:27:09,663, we started two restore tasks that uses a pre-created snapshot, that includes the keyspace 5gb_sizetiered_2022_1. First, a task to restore the schema:

< t:2023-08-14 13:27:13,826 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c a92d1307-4ac0-43df-874a-98667733d8ae --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230702201949UTC" finished with status 0
< t:2023-08-14 13:27:13,826 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: restore/256d69cd-92e9-49d7-bed5-e82928acf970

The restore task has ended successfully:

< t:2023-08-14 13:28:17,197 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c a92d1307-4ac0-43df-874a-98667733d8ae progress restore/256d69cd-92e9-49d7-bed5-e82928acf970" finished with status 0
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Restore progress
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Run:              4779b377-3aa6-11ee-a65d-0afbd2966d0b
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE - restart required (see restore docs)
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Start time:       14 Aug 23 13:27:13 UTC
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > End time: 14 Aug 23 13:28:07 UTC
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Duration: 54s
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100% | 100%
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230702201949UTC
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────┬─────────────┬──────────┬──────────┬────────────┬────────╮
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace      │    Progress │     Size │  Success │ Downloaded │ Failed │
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────┼─────────────┼──────────┼──────────┼────────────┼────────┤
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_schema │ 100% | 100% │ 474.478k │ 474.478k │   474.478k │      0 │
< t:2023-08-14 13:28:17,197 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────┴─────────────┴──────────┴──────────┴────────────┴────────╯

At which point, restart all of the nodes (' services) in the cluster, one by one:

< t:2023-08-14 13:28:18,164 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:28:18,539 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:28:18+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1   !NOTICE | sudo[13833]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:29:41,945 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:29:42,734 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:29:43,110 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:29:43+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1   !NOTICE | sudo[13875]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:29:47,335 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
< t:2023-08-14 13:30:49,093 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:30:49,149 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:30:49+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-2   !NOTICE | sudo[11111]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:32:15,063 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:32:15,403 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:32:15,846 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:32:15+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-2   !NOTICE | sudo[11168]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:32:20,003 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
< t:2023-08-14 13:33:21,198 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:33:21,638 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:33:21+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-3   !NOTICE | sudo[11148]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:34:46,992 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:34:47,310 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:34:47,687 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:34:47+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-3   !NOTICE | sudo[11199]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:34:51,981 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
< t:2023-08-14 13:35:53,665 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:35:54,077 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:35:53+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-4   !NOTICE | sudo[11277]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:37:09,635 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:37:10,549 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:37:11,016 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-08-14T13:37:10+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-4   !NOTICE | sudo[11324]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:37:15,151 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0

Afterwards, werestore the data:

< t:2023-08-14 13:38:20,276 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c a92d1307-4ac0-43df-874a-98667733d8ae --restore-tables --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230702201949UTC" finished with status 0
< t:2023-08-14 13:38:20,282 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: restore/ba67ff65-3170-4aa8-af74-efa2b694d89f

Which also passed:

< t:2023-08-14 13:44:08,432 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c a92d1307-4ac0-43df-874a-98667733d8ae progress restore/ba67ff65-3170-4aa8-af74-efa2b694d89f" finished with status 0
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Restore progress
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Run:              d4fef9a2-3aa7-11ee-a65e-0afbd2966d0b
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Start time:       14 Aug 23 13:38:20 UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > End time: 14 Aug 23 13:43:47 UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Duration: 5m27s
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100% | 100%
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230702201949UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────────────┬─────────────┬─────────┬─────────┬────────────┬────────╮
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace              │    Progress │    Size │ Success │ Downloaded │ Failed │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────┼─────────────┼─────────┼─────────┼────────────┼────────┤
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces         │        100% │       0 │       0 │          0 │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ 5gb_sizetiered_2022_1 │ 100% | 100% │ 17.133G │ 17.133G │    17.133G │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth           │ 100% | 100% │ 26.021k │ 26.021k │    26.021k │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed    │        100% │       0 │       0 │          0 │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ audit                 │        100% │       0 │       0 │          0 │      0 │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────────────┴─────────────┴─────────┴─────────┴────────────┴────────╯
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Post-restore repair progress:
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Run:              d4fef9a2-3aa7-11ee-a65e-0afbd2966d0b
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Start time:       14 Aug 23 13:38:20 UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > End time: 14 Aug 23 13:43:47 UTC
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Duration: 5m27s
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100%
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Datacenters:      
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG >   - eu-west
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╭────────────────────┬────────────────────────┬──────────┬──────────╮
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace           │                  Table │ Progress │ Duration │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├────────────────────┼────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth        │        role_attributes │ 100%     │ 4s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth        │           role_members │ 100%     │ 4s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth        │                  roles │ 100%     │ 4s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├────────────────────┼────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed │         service_levels │ 100%     │ 6s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed │      view_build_status │ 100%     │ 5s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├────────────────────┼────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │                 events │ 100%     │ 12s      │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │          node_slow_log │ 100%     │ 4s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │ node_slow_log_time_idx │ 100%     │ 2s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │               sessions │ 100%     │ 2s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces      │      sessions_time_idx │ 100%     │ 2s       │
< t:2023-08-14 13:44:08,432 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╰────────────────────┴────────────────────────┴──────────┴──────────╯

Afterwards, we also created a general repair task (since this code was not adjusted to the autmatic repair just yet):

< t:2023-08-14 13:44:11,117 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool repair -c a92d1307-4ac0-43df-874a-98667733d8ae" finished with status 0
< t:2023-08-14 13:44:11,117 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: repair/9d82ba8f-053c-41f8-83dd-798d2e49bf4a

Which passed:

< t:2023-08-14 13:49:24,031 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c a92d1307-4ac0-43df-874a-98667733d8ae progress repair/9d82ba8f-053c-41f8-83dd-798d2e49bf4a" finished with status 0
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               a5d19efb-3aa8-11ee-a661-0afbd2966d0b
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Start time:       14 Aug 23 13:44:10 UTC
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > End time: 14 Aug 23 13:48:58 UTC
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Duration: 4m47s
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100%
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > Datacenters:      
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG >   - eu-west
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────────────────────────┬────────────────────────────────┬──────────┬──────────╮
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace                      │                          Table │ Progress │ Duration │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ keyspace1                     │                      standard1 │ 100%     │ 3m39s    │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │                role_attributes │ 100%     │ 1s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │                   role_members │ 100%     │ 1s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │               role_permissions │ 100%     │ 1s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_auth                   │                          roles │ 100%     │ 1s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed_everywhere │ cdc_generation_descriptions_v2 │ 100%     │ 0s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │      cdc_generation_timestamps │ 100%     │ 5s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │    cdc_streams_descriptions_v2 │ 100%     │ 5s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │                 service_levels │ 100%     │ 6s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_distributed            │              view_build_status │ 100%     │ 5s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │                         events │ 100%     │ 17s      │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │                  node_slow_log │ 100%     │ 5s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │         node_slow_log_time_idx │ 100%     │ 3s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │                       sessions │ 100%     │ 3s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > │ system_traces                 │              sessions_time_idx │ 100%     │ 3s       │
< t:2023-08-14 13:49:24,036 f:cli.py          l:1122 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────────────────────────┴────────────────────────────────┴──────────┴──────────╯

Then, We executed a cassandra-stress to validate the data, which was DOA:

< t:2023-08-14 13:50:12,080 f:stress_thread.py l:287  c:sdcm.stress_thread   p:INFO  > cassandra-stress read no-warmup cl=QUORUM n=5242880 -schema 'keyspace=5gb_sizetiered_2022_1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native   user=cassandra password=cassandra -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq=1..5242880 -transport 'truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra' -node 10.4.3.146,10.4.0.171,10.4.0.236,10.4.0.248 -errors skip-unsupported-columns
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
WARN  13:50:19,052 Not using advanced port-based shard awareness with /10.4.0.171:9042 because we're missing port-based shard awareness port on the server
WARN  13:50:19,222 Not using advanced port-based shard awareness with /10.4.0.236:9042 because we're missing port-based shard awareness port on the server
java.io.IOException: Operation x10 on key(s) [4c4c3637324b38334f30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
Failed to connect over JMX; not collecting these stats

com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.Operation.error(Operation.java:141)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
java.io.IOException: Operation x10 on key(s) [343550504e4f30353430]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)

com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.Operation.error(Operation.java:141)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)

Looking into the data folders in the machines as well, it seems that they are completely empty:

scyllaadm@longevity-200gb-48h-verify-limited--db-node-84dfb4de-1:/var/lib/scylla/data$ ll 5gb_sizetiered_2022_1/standard1-e08b7420191411ee8ec98425b74f1f5d/
total 0
drwxr-xr-x 4 scylla scylla 47 Aug 14 13:29 ./
drwxr-xr-x 3 scylla scylla 64 Aug 14 13:29 ../
drwxr-xr-x 2 scylla scylla 10 Aug 14 13:29 staging/
drwxr-xr-x 2 scylla scylla 10 Aug 14 13:41 upload/

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1040-aws Scylla version (or git commit hash): 2022.2.12-20230727.f4448d5b0265 with build-id a87bfeb65d24abf65d074a3ba2e5b9664692d716

Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0624755b4db06e567 (aws: eu-west-1)

Test: longevity-200gb-48h-test_restore-nemesis Test id: 84dfb4de-0573-4a01-8806-8b832bcafd91 Test name: scylla-staging/Shlomo/longevity-200gb-48h-test_restore-nemesis Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 84dfb4de-0573-4a01-8806-8b832bcafd91` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=84dfb4de-0573-4a01-8806-8b832bcafd91) - Show all stored logs command: `$ hydra investigate show-logs 84dfb4de-0573-4a01-8806-8b832bcafd91` ## Logs: - **db-cluster-84dfb4de.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/db-cluster-84dfb4de.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/db-cluster-84dfb4de.tar.gz) - **sct-runner-events-84dfb4de.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/sct-runner-events-84dfb4de.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/sct-runner-events-84dfb4de.tar.gz) - **sct-84dfb4de.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/sct-84dfb4de.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/sct-84dfb4de.log.tar.gz) - **loader-set-84dfb4de.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/loader-set-84dfb4de.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/loader-set-84dfb4de.tar.gz) - **monitor-set-84dfb4de.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/monitor-set-84dfb4de.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/monitor-set-84dfb4de.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/Shlomo/job/longevity-200gb-48h-test_restore-nemesis/16/) [Argus](https://argus.scylladb.com/test/226c0f08-de6f-4d69-8f77-b01161019748/runs?additionalRuns[]=84dfb4de-0573-4a01-8806-8b832bcafd91)
mykaul commented 1 year ago

@ShlomiBalalis - where can I find the manager log, so we can see what was restored?

mykaul commented 1 year ago

Out of curiosity, why is it using LCS?

2023-08-14T14:03:55+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1     !INFO | scylla[13934]:  [shard 12] LeveledManifest - Leveled compaction strategy is restoring invariant of level 1 by compacting 2 sstables on behalf of keyspace1.standard1
ShlomiBalalis commented 1 year ago

@ShlomiBalalis - where can I find the manager log, so we can see what was restored?

the server is in the monitor tarball, the agents are in the db nodes

ShlomiBalalis commented 1 year ago

Out of curiosity, why is it using LCS?

2023-08-14T14:03:55+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1     !INFO | scylla[13934]:  [shard 12] LeveledManifest - Leveled compaction strategy is restoring invariant of level 1 by compacting 2 sstables on behalf of keyspace1.standard1

This is simply part of the longevity scenario, but this is not the problematic keyspace anyway

Michal-Leszczynski commented 1 year ago

So the logs show that some data has actually been downloaded and loaded to the cluster. The problem is that both automatic and manual repair (still present in this test scenario) didn't repair restored table.

So right now I'm checking if it's a restore or repair problem (tested version of SM does not contain repair refactor, so this is not connected to those changes).

ShlomiBalalis commented 1 year ago

Leading theory: I tried to restore the keyspace manually the old fasioned: downloading the sstables and refreshing, but we noticed something funny: At first, I was trying to query the keyspace right after the restore, consistently failing:

cassandra@cqlsh:5gb_sizetiered_2022_1> select * from standard1;
NoHostAvailable: 

Then, I tried to change the replication factor of the keyspace, and noticed that while the region of the cluster under test is eu-west:

$ nodetool status

Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.3.146  52.54 GB   256          ?       c37bdb3d-7a3b-477a-b4fb-a4a98684a2c5  1a
UN  10.4.0.236  49.08 GB   256          ?       5d5bd234-9aec-4146-b3cb-b8e2e1729fa4  1a
UN  10.4.0.248  44.61 GB   256          ?       2747eaee-4803-490a-ad78-03467dd1f7cc  1a
UN  10.4.0.171  44.33 GB   256          ?       1434be9a-d258-4dd2-9579-2cec850786c1  1a

The keyspace was set to replicate in us-east, which is probably the region of the originally backed up cluster:

cassandra@cqlsh> SELECT * FROM system_schema.keyspaces;

 keyspace_name                 | durable_writes | replication
-------------------------------+----------------+-------------------------------------------------------------------------------------
                   system_auth |           True |   {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '4'}
                 system_schema |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
                     keyspace1 |           True |   {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '3'}
            system_distributed |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
                        system |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
                         audit |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
                 system_traces |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}
 system_distributed_everywhere |           True |                        {'class': 'org.apache.cassandra.locator.EverywhereStrategy'}
         5gb_sizetiered_2022_1 |           True |   {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3'}

Once I altered the region of the keyspace's region, I was able to query it just fine:

cassandra@cqlsh> ALTER KEYSPACE "5gb_sizetiered_2022_1" WITH replication = {'class': 'NetworkTopologyStrategy', 'eu-west': '1'};
cassandra@cqlsh> use "5gb_sizetiered_2022_1";
cassandra@cqlsh:5gb_sizetiered_2022_1> select * from standard1;

 key                    | C0                                                                                                                                 | C1                                                                                                                                 | C10                                                                                                                                | C11                                                                                                                                | C12                                                                                                                                | C13                                                                                                                                | C14                                                                                                                                | C15                                                                                                                                | C2                                                                                                                                 | C3                                                                                                                                 | C4                                                                                                                                 | C5                                                                                                                                 | C6                                                                                                                                 | C7                                                                                                                                 | C8                                                                                                                                 | C9
------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------
 0x343831364b4b33324e30 | 0x88cfad6776a64370624fca6a579b22909dd7d10500537a183446f9f429dea1d13fcb3716aba68f5023a974a1f6fef5fd3e3eb2856c8a08a38fc019de9fe45b90 | 0x68a108ecfcca46efb65d38269f75d23a1f43456a1cbe033baefc0cd2f3edbfa4289874dc57ae085e9cd830ae0644351a3c32c6d49140f81e714d715f2324cb75 | 0xa0a206e6889d81d48edda04842b35a248f3a608bd4588619ca39176f64ca53238913a404fba7ec3e67c071b35e13e2a39773610a5b541dc7a8f32cadc7c7eedf | 0xb7a13a1ad602380b4577dc5bb64865e54922862cf670bf288d3fb9afd69091477c623e9255e1d81068bd0707e01c0680cc306cb3693be8688c0db1c948ea38c3 | 0x7e1119d0cb8f34cd7141ba1ec7cf71eb64f254b0d46fc0f78b31fa3c1fe336eae57412dbaad94d4c728ca51140438d5e2521587f657d7dcfffdeeeb1218b2357 | 0x5dfd6b7923a1025f0085ecf43516aec54c25ced79dc5217267c060ddc927de711b0eec16116eeb2380f184bb1a7d6f9482bcdd1f4c75d7c4cacd42950746e4fa | 0x1ddb930018f516dc7e3ffafddc7ac358df4f3d2352931ae31982c55cc7e0d7dc9ea6067de7218a9e61f735f69ab3eb1ceb3b27e6300deee70c6c455cd20e6a14 | 0x6812d3acf717ba682498373953e77d64792930f029e5ab2b2c4a477098f9b49f0e6d35615e9e65b7736ee992ab3ff027227c73595e71f355b6b89e1ab1c7fb9d | 0x51ae72d57ce76acc6c69c90713f7d4fb9261efcfc833e73b30e383e70eb56aea4e11c2b51053e7479142041df5bd832fd6417e835a851378433e0de71bbdaee0 | 0xa9fa0041d3270b15f1700778bea29b99a7ea7c2172e338157ca41593f99e3a04a6a649c698bc01b888f6038b8740678554f41de84a3fb66390d300328068d204 | 0xec0375be958914ee5a7797c6921dfa0b309d95cf98fc9dd846dbfcc982d2ad0da27a7d17f7b1ff6c6fcca1c816fc47f5b96af5a50e1c28ae9e31351d250b6aab | 0x405d6d52c7782e3b8271a809e4138ece48bb4c0c203c65368008e778c23d1c2fe2a8105b89cf2141ddbb9090b1f69192af21afeba81c05d70880179a6300b745 | 0x3e4ed6b0621aa8ebfdf0035d417727357ef13ccc7e20bb8489f00dce99ce5b3690ccf2ec7759a4f0d5134fa3ac0471dad663a1a934cfc3cafe621f39dcf9c112 | 0xd046c4ed74a6821b7739342f48419f07b1a0d69175c239ccc0a504ddd0c440f02f233c9a898e2d59a3111479e0166cb4b7745b1322f9fddefd3bb197f8c60a34 | 0x553fff6a14fc45ba273ae9549962324615e90d9d79933eb7121eb5741c1c773da5503824f4d8b6584a4407cabbd5d6862f7eeb1bb76690fd14442834a45afd7f | 0xbb9b3a6ac0acfe1e67bb87870b45be7f9119439c691d39c8b69412c4ec8b513bcf6b4965af98e61711cb7da504b252cd716ce29a0d8772c3f1b89d349c467f8f
...

So, the difference in regions is probably the cause of the failures.

tzach commented 1 year ago

So, restore only works in the same region, and there is a procedure to restore to a different region? This is acceptable, but it needs to be explicitly documented.

Michal-Leszczynski commented 1 year ago

Restoring tables has a requirement of having identical schema as in the backup. The dcs are also a part of the keyspace schema. So the fact that that restore does not work when you try to restore data into empty dc seems logical.

The strange part here is that load&stream does not complain when it has to upload sstables to nodes from empty dc (we can add manual checks for that in SM). I would suspect, that in this scenario uploaded sstables should be lost, as they don't belong to any node in the cluster, but maybe L&S still stores it somewhere, even though it's impossible to query the data because of the "unavailable replicas" error.

In your example you said that you used nodetool refresh for uploading sstables, but did you use it with the -las option? I'm curious if a work around in this case should look like:

Michal-Leszczynski commented 1 year ago

But at least we know that this issue is not a regression and that IMHO restore works as described in the docs.

ShlomiBalalis commented 1 year ago

Restoring tables has a requirement of having identical schema as in the backup. The dcs are also a part of the keyspace schema. So the fact that that restore does not work when you try to restore data into empty dc seems logical.

The strange part here is that load&stream does not complain when it has to upload sstables to nodes from empty dc (we can add manual checks for that in SM). I would suspect, that in this scenario uploaded sstables should be lost, as they don't belong to any node in the cluster, but maybe L&S still stores it somewhere, even though it's impossible to query the data because of the "unavailable replicas" error.

In your example you said that you used nodetool refresh for uploading sstables, but did you use it with the -las option?

Nope. a simple nodetool refresh -- 5gb_sizetiered_2022_1 standard1

I'm curious if a work around in this case should look like:

* restore schema

* (change replication of keyspace with non-existing dc - but do we have a guarantee that the restore tables will work when using different keyspace schema?)

* restore tables

* (or maybe here is the right place for changing keyspace replication - perhaps uploaded data is still stored somewhere in the cluster and now it is safe to alter keyspace)

In my case, I first loaded the data with refresh and only then altered the keyspace, and everything seemed fine afterwards (of course, it was only a preliminary check that the table contains data at all)

Michal-Leszczynski commented 1 year ago

Nope. a simple nodetool refresh -- 5gb_sizetiered_2022_1 standard1

That's strange because nodetool refresh docs says:

Scylla node will ignore the partitions in the sstables which are not assigned to this node. For example, if sstable are copied from a different node.

So I would expect that it worked partially / it's not reliable to use it in this way. So the approach with:

seems more promising. @asias, do you think that this approach is safe and should work?

Context: We have a backup from some cluster with only dc1. We want to restore it to a different cluster with only dc2. Normally, SM would first restore all schema from the backup (this requires cluster restart) and then it would proceed with restoring non-schema SSTables via load&stream. The problem is that we restore SSTables into keyspace replicated only in dc1 and we don't have any nodes from this dc in restore destination cluster, so even though restore procedure ends "successfully", the data is not there. Is it safe to use load&steam on SSTables when backed-up and restore destination clusters have identical table schema, but have different keyspace schema (keyspace name is the same, but there are different dc names in replication strategies)?

ShlomiBalalis commented 1 year ago

Nope. a simple nodetool refresh -- 5gb_sizetiered_2022_1 standard1

That's strange because nodetool refresh docs says:

Scylla node will ignore the partitions in the sstables which are not assigned to this node. For example, if sstable are copied from a different node.

So I would expect that it worked partially / it's not reliable to use it in this way. So the approach with:

  • restore schema
  • alter restored keyspace replication strategy (change dc names)
  • restore data

Yeah, regardless of the fact that it worked (and I agree, it's strange it worked at all) this is probably the correct course of action as far as I can tell

Michal-Leszczynski commented 1 year ago

My local experiments confirms that the approach:

works fine, but they are just experiments and not proofs of reliability. @ShlomiBalalis could we rerun this test scenario with the additional alter keyspace step in the middle of both restores?

Michal-Leszczynski commented 1 year ago

@ShlomiBalalis ping

fgelcer commented 1 year ago

@ShlomiBalalis ?

bhalevy commented 10 months ago

My local experiments confirms that the approach:

  • restore schema
  • alter restored keyspace replication strategy (change dc names)
  • restore data

works fine, but they are just experiments and not proofs of reliability. @ShlomiBalalis could we rerun this test scenario with the additional alter keyspace step in the middle of both restores?

@Mark-Gurevich can you please take over this? If needed, let's open an issue in SCT to add this as workaround until this issue is fixed.

@Michal-Leszczynski mind taking ownership of this issue?

Mark-Gurevich commented 10 months ago

IIUC we need to add to the disrupt_mgmt_restore nemesis code additional alter keyspace in middle of both restores? From a brief view of the code I didn't find where this can be added. Needs further deep dive.

Michal-Leszczynski commented 7 months ago

@mikliapko is this something that you could take care of? I mean validating that procedure described in https://github.com/scylladb/scylla-manager/issues/3525#issuecomment-1693310241 works fine with some proper test. When it's validated, we can add it to SM docs.

roydahan commented 1 month ago

Still happens: https://argus.scylladb.com/test/f1ff65fd-8324-4264-8d28-8c7122fca836/runs?additionalRuns[]=5986619f-8479-4267-a92f-19c6b604f84b

mikliapko commented 1 month ago

@mikliapko is this something that you could take care of? I mean validating that procedure described in #3525 (comment) works fine with some proper test. When it's validated, we can add it to SM docs.

Yep, as it's still happening, I will take a look into it

fruch commented 3 weeks ago

@mikliapko is this something that you could take care of? I mean validating that procedure described in #3525 (comment) works fine with some proper test. When it's validated, we can add it to SM docs.

Yep, as it's still happening, I will take a look into it

@mikliapko it's happening in a test that disable raft topology, is the schema restore depended on raft topology ?

Packages

Scylla version: 6.3.0~dev-20240927.c17d35371846 with build-id a9b08d0ce1f3cf99eb39d7a8372848fa2840dc1d Kernel Version: 6.8.0-1016-aws

Installation details

Cluster size: 5 nodes (i4i.8xlarge)

Scylla Nodes used in this run:

OS / Image: ami-087d814d9b6773015 (aws: undefined_region)

Test: longevity-mv-si-4days-streaming-test Test id: 34c4d009-73b1-490b-83e5-03f6705be5eb Test name: scylla-master/tier1/longevity-mv-si-4days-streaming-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 34c4d009-73b1-490b-83e5-03f6705be5eb` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=34c4d009-73b1-490b-83e5-03f6705be5eb) - Show all stored logs command: `$ hydra investigate show-logs 34c4d009-73b1-490b-83e5-03f6705be5eb` ## Logs: - **longevity-mv-si-4d-master-db-node-34c4d009-4** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-4-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-4-34c4d009.tar.gz) - **longevity-mv-si-4d-master-db-node-34c4d009-1** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-1-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-1-34c4d009.tar.gz) - **longevity-mv-si-4d-master-db-node-34c4d009-6** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-6-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-6-34c4d009.tar.gz) - **longevity-mv-si-4d-master-db-node-34c4d009-8** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-8-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-8-34c4d009.tar.gz) - **longevity-mv-si-4d-master-db-node-34c4d009-3** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-3-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-3-34c4d009.tar.gz) - **longevity-mv-si-4d-master-db-node-34c4d009-10** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-10-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-10-34c4d009.tar.gz) - **db-cluster-34c4d009.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/db-cluster-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/db-cluster-34c4d009.tar.gz) - **sct-runner-events-34c4d009.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/sct-runner-events-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/sct-runner-events-34c4d009.tar.gz) - **sct-34c4d009.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/sct-34c4d009.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/sct-34c4d009.log.tar.gz) - **loader-set-34c4d009.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/loader-set-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/loader-set-34c4d009.tar.gz) - **monitor-set-34c4d009.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/monitor-set-34c4d009.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/monitor-set-34c4d009.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tier1/job/longevity-mv-si-4days-streaming-test/3/) [Argus](https://argus.scylladb.com/test/f1ff65fd-8324-4264-8d28-8c7122fca836/runs?additionalRuns[]=34c4d009-73b1-490b-83e5-03f6705be5eb)
Michal-Leszczynski commented 3 weeks ago

Starting from SM 3.3 and Scylla 6.0, SM restores schema by applying the output of DESC SCHEMA WITH INTERNALS. The problem is the keyspace definition contains dc names - that's why this test fails with the following error:

"M":"Run ended with ERROR","task":"restore/09af96b8-68b1-4bf6-928b-7fd01aa266f4","status":"ERROR","cause":"restore data: create \"100gb_sizetiered_6_0\" (\"100gb_sizetiered_6_0\") with CREATE KEYSPACE \"100gb_sizetiered_6_0\" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3'} AND durable_writes = true: Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy for keyspace 100gb_sizetiered_6_0","duration":"5.618998928s"

So right now this is a documented limitation, but we should make it possible to restore schema into a different DC setting or make it easier for the user to modify just the DC part of keyspace schema.

Michal-Leszczynski commented 3 weeks ago

Created an issue for that: #4049.