Open vponomaryov opened 8 months ago
@roydahan , @fruch , @ShlomiBalalis ^
@roydahan , @fruch , @ShlomiBalalis ^
I think It's a known issue, cause the backups we created in different region
And their network topology doesn't match.
It's known for quite some time, but no one handled it.
@roydahan , @fruch , @ShlomiBalalis ^
I think It's a known issue, cause the backups we created in different region
And their network topology doesn't match.
It's known for quite some time, but no one handled it.
So, if it is known that 6 out of 7 backups are not compatible, then shouldn't we stop using it as a fast solution while compatible ones are not added?
@roydahan , @fruch , @ShlomiBalalis ^
I think It's a known issue, cause the backups we created in different region
And their network topology doesn't match.
It's known for quite some time, but no one handled it.
So, if it is known that 6 out of 7 backups are not compatible, then shouldn't we stop using it as a fast solution while compatible ones are not added?
As far as I know it would be 100% of validation stress command that should fail if you are not on the same region as the backup.
As for the restore command itself, if it's failing 1 out of 7 runs, it's probably a different manager issue, and should be raised there, until proven otherwise.
@roydahan , @fruch , @ShlomiBalalis ^
I think It's a known issue, cause the backups we created in different region And their network topology doesn't match. It's known for quite some time, but no one handled it.
So, if it is known that 6 out of 7 backups are not compatible, then shouldn't we stop using it as a fast solution while compatible ones are not added?
As far as I know it would be 100% of validation stress command that should fail if you are not on the same region as the backup.
The defaults/manager_persistent_snapshots.yaml
file doesn't say anything about the region.
So, if it is region-dependent then the SCT logic for the mgmt restore
must be updated to consider region values.
As for the restore command itself, if it's failing 1 out of 7 runs, it's probably a different manager issue, and should be raised there, until proven otherwise.
It fails on 6 backups out of 7. Not runs.
Tried it in different regions.
The following warning
:
WARN 15:01:15,947 Error while computing token map for keyspace 5gb_sizetiered_5_2 with datacenter us-east: could not achieve replication factor 3 (found 0 replicas only), check your keyspace replication settings.
Exists even in the us-east-1
region and even on the working backup.
Verified here: https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-eks-mgmt-restore/12/consoleFull
So, I don't think that the region changes picture here. There is some another problem with the backups.
Tried it in different regions. The following
warning
:WARN 15:01:15,947 Error while computing token map for keyspace 5gb_sizetiered_5_2 with datacenter us-east: could not achieve replication factor 3 (found 0 replicas only), check your keyspace replication settings.
Exists even in the
us-east-1
region and even on the working backup. Verified here: https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-eks-mgmt-restore/12/consoleFullSo, I don't think that the region changes picture here. There is some another problem with the backups.
So some specific backups can't be restored correctly ?
@ShlomiBalalis are all of those backups validated and we're working with known versions of scylla and manger ? And if so with which versions ?
@vponomaryov FYI I don't know where this warning is coming, but since the region related issues wasn't ever handled, I don't know which other issues are there.
If you found that a specific backup is broken, that a report for scylla-manager And the team/group handling manager should address it.
So some specific backups can't be restored correctly ?
If you found that a specific backup is broken, that a report for scylla-manager And the team/group handling manager should address it.
All the backups get applied successfully. The stress commands fail on 6 out of 7 such restored backups.
Issue description
When we run
mgmt_restore
nemesis, it picks up some of the existing persistent snapshots specified in thedefaults/manager_persistent_snapshots.yaml
file:When we use any of the snapshots, the
mgmt restore
operation finishes successfully. But, stress commands work only for1
(one) out of7
(5gb and 10gb) snapshots -10gb_sizetiered
.All other (6/7) cause following failure of the stress commands triggered inside of the nemesis:
It was not observed before just because the nemesis doesn't check the stress commands results. So, we have
false positive
nemesis results in 6 cases out of 7.The problematic snapshots can be observed with the following error messages in the main SCT log:
Impact
The restored data cannot be validated, hence, workability of the
mgmt restore
operation is under question.How frequently does it reproduce?
100% cases
Installation details
Kernel Version: 5.10.205-195.804.amzn2.x86_64 Scylla version (or git commit hash):
2023.1.3-20231219.b890271f125b
with build-id26667e5fe9023e6f688ca3bab2ee5f910abfd2cb
Operator Image: scylladb/scylla-operator:1.11.1 Operator Helm Version: v1.11.1 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/stable Cluster size: 3 nodes (i4i.4xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image:
` (k8s-eks:
eu-north-1`)Test:
vp-longevity-scylla-operator-3h-eks-mgmt-restore
Test id:7a950b38-c70b-4fef-8250-eddc1be0975a
Test name:scylla-staging/valerii/vp-longevity-scylla-operator-3h-eks-mgmt-restore
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 7a950b38-c70b-4fef-8250-eddc1be0975a` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7a950b38-c70b-4fef-8250-eddc1be0975a) - Show all stored logs command: `$ hydra investigate show-logs 7a950b38-c70b-4fef-8250-eddc1be0975a` ## Logs: - **kubernetes-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/kubernetes-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/kubernetes-7a950b38.tar.gz) - **kubernetes-must-gather-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/kubernetes-must-gather-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/kubernetes-must-gather-7a950b38.tar.gz) - **db-cluster-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/db-cluster-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/db-cluster-7a950b38.tar.gz) - **sct-runner-events-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/sct-runner-events-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/sct-runner-events-7a950b38.tar.gz) - **sct-7a950b38.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/sct-7a950b38.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/sct-7a950b38.log.tar.gz) - **loader-set-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/loader-set-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/loader-set-7a950b38.tar.gz) - **monitor-set-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/monitor-set-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/monitor-set-7a950b38.tar.gz) - **parallel-timelines-report-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/parallel-timelines-report-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/parallel-timelines-report-7a950b38.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-eks-mgmt-restore/8/) [Argus](https://argus.scylladb.com/test/03dd93e7-d5c2-4e3d-b4b4-81ac06f41b96/runs?additionalRuns[]=7a950b38-c70b-4fef-8250-eddc1be0975a)