scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

Data validation stress commands in the `mgmt_restore` nemesis fail on all 5gb persistent snapshots and 3 out 4 10gb ones #7122

Open vponomaryov opened 8 months ago

vponomaryov commented 8 months ago

Issue description

When we run mgmt_restore nemesis, it picks up some of the existing persistent snapshots specified in the defaults/manager_persistent_snapshots.yaml file:

aws:                                                                                                   
  bucket: "manager-backup-tests-permanent-snapshots-us-east-1"                                         
  confirmation_stress_template: "cassandra-stress read cl=QUORUM n={num_of_rows} -schema 'keyspace={keyspace_name} replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native  -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq={sequence_start}..{sequence_end}"
  snapshots_sizes:                                                                                     
    5:                                                                                                 
      number_of_rows: 5242880                                                                          
      expected_timeout: 1800  # 30 minutes                                                             
      snapshots:                                                                                       
        'sm_20230702185929UTC':                                                                        
          keyspace_name: "5gb_sizetiered_2022_2"                                                       
          scylla_version: "2022.2.9"                                                                   
          scylla_product: "enterprise"                                                                 
          number_of_nodes: 5                                                                           
        'sm_20230702201949UTC':                                                                        
          keyspace_name: "5gb_sizetiered_2022_1"                                                       
          scylla_version: "2022.1.7"                                                                   
          scylla_product: "enterprise"                                                                 
          number_of_nodes: 5                                                                           
        'sm_20230702190638UTC':                                                                        
          keyspace_name: "5gb_sizetiered_5_2"                                                          
          scylla_version: "5.2.3"                                                                      
          scylla_product: "oss"                                                                        
          number_of_nodes: 5                                                                           
    10:                                                                                                
      number_of_rows: 10485760                                                                         
      expected_timeout: 3600  # 60 minutes                                                             
      snapshots:                                                                                       
        'sm_20230223105105UTC':                                                                        
          keyspace_name: "10gb_sizetiered"                                                             
          scylla_version: "5.1.6"                                                                      
          scylla_product: "oss"                                                                        
          number_of_nodes: 3                                                                           
        'sm_20230702173347UTC':                                                                        
          keyspace_name: "10gb_sizetiered_2022_2"                                                      
          scylla_version: "2022.2.9"                                                                   
          scylla_product: "enterprise"                                                                 
          number_of_nodes: 4                                                                           
        'sm_20230702173940UTC':                                                                        
          keyspace_name: "10gb_sizetiered_2022_1"                                                      
          scylla_version: "2022.1.7"                                                                   
          scylla_product: "enterprise"                                                                 
          number_of_nodes: 4                                                                           
        'sm_20230702173329UTC':                                                                        
          keyspace_name: "10gb_sizetiered_5_2"                                                         
          scylla_version: "5.2.3"                                                                      
          scylla_product: "oss"                                                                        
          number_of_nodes: 4
...

When we use any of the snapshots, the mgmt restore operation finishes successfully. But, stress commands work only for 1 (one) out of 7 (5gb and 10gb) snapshots - 10gb_sizetiered.

All other (6/7) cause following failure of the stress commands triggered inside of the nemesis:

===== Using optimized driver!!! =====
WARN  19:28:52,417 Error while computing token map for keyspace 5gb_sizetiered_5_2 with datacenter us-east: could not achieve replication factor 3 (found 0 replicas only), check your keyspace replication settings.
Connected to cluster: sct-cluster, max pending requests per connection null, max connections per host 8
Datatacenter: eu-north-1; Host: /172.20.49.36; Rack: rack-1
Datatacenter: eu-north-1; Host: /172.20.74.81; Rack: rack-1
Datatacenter: eu-north-1; Host: /10.0.10.45; Rack: rack-1
Failed to connect over JMX; not collecting these stats
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
...
    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x10 on key(s) [4d4c34314f4f50343531]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
FAILURE
java.lang.RuntimeException: Failed to execute stress action
    at org.apache.cassandra.stress.StressAction.run(StressAction.java:101)
    at org.apache.cassandra.stress.Stress.run(Stress.java:143)
    at org.apache.cassandra.stress.Stress.main(Stress.java:62)

It was not observed before just because the nemesis doesn't check the stress commands results. So, we have false positive nemesis results in 6 cases out of 7.

The problematic snapshots can be observed with the following error messages in the main SCT log:

2024-01-18 21:37:00,614 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:CRITICAL > \
    java.io.IOException: Operation x10 on key(s) [4c393550333638353930]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
2024-01-18 21:37:00,623 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:CRITICAL > \
    2024-01-18 21:37:00.621: (CassandraStressLogEvent Severity.ERROR) period_type=one-time event_id=2cca77da-64ee-4805-912d-8bc9cf7e23c5 \
        during_nemesis=MgmtRestore: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=725 node=Node sct-loaders-eu-north-1-1 [None | None] (seed: False)

Impact

The restored data cannot be validated, hence, workability of the mgmt restore operation is under question.

How frequently does it reproduce?

100% cases

Installation details

Kernel Version: 5.10.205-195.804.amzn2.x86_64 Scylla version (or git commit hash): 2023.1.3-20231219.b890271f125b with build-id 26667e5fe9023e6f688ca3bab2ee5f910abfd2cb

Operator Image: scylladb/scylla-operator:1.11.1 Operator Helm Version: v1.11.1 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/stable Cluster size: 3 nodes (i4i.4xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: ` (k8s-eks:eu-north-1`)

Test: vp-longevity-scylla-operator-3h-eks-mgmt-restore Test id: 7a950b38-c70b-4fef-8250-eddc1be0975a Test name: scylla-staging/valerii/vp-longevity-scylla-operator-3h-eks-mgmt-restore Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 7a950b38-c70b-4fef-8250-eddc1be0975a` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7a950b38-c70b-4fef-8250-eddc1be0975a) - Show all stored logs command: `$ hydra investigate show-logs 7a950b38-c70b-4fef-8250-eddc1be0975a` ## Logs: - **kubernetes-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/kubernetes-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/kubernetes-7a950b38.tar.gz) - **kubernetes-must-gather-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/kubernetes-must-gather-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/kubernetes-must-gather-7a950b38.tar.gz) - **db-cluster-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/db-cluster-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/db-cluster-7a950b38.tar.gz) - **sct-runner-events-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/sct-runner-events-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/sct-runner-events-7a950b38.tar.gz) - **sct-7a950b38.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/sct-7a950b38.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/sct-7a950b38.log.tar.gz) - **loader-set-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/loader-set-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/loader-set-7a950b38.tar.gz) - **monitor-set-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/monitor-set-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/monitor-set-7a950b38.tar.gz) - **parallel-timelines-report-7a950b38.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/parallel-timelines-report-7a950b38.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a950b38-c70b-4fef-8250-eddc1be0975a/20240118_220013/parallel-timelines-report-7a950b38.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-eks-mgmt-restore/8/) [Argus](https://argus.scylladb.com/test/03dd93e7-d5c2-4e3d-b4b4-81ac06f41b96/runs?additionalRuns[]=7a950b38-c70b-4fef-8250-eddc1be0975a)
vponomaryov commented 8 months ago

@roydahan , @fruch , @ShlomiBalalis ^

fruch commented 8 months ago

@roydahan , @fruch , @ShlomiBalalis ^

I think It's a known issue, cause the backups we created in different region

And their network topology doesn't match.

It's known for quite some time, but no one handled it.

vponomaryov commented 8 months ago

@roydahan , @fruch , @ShlomiBalalis ^

I think It's a known issue, cause the backups we created in different region

And their network topology doesn't match.

It's known for quite some time, but no one handled it.

So, if it is known that 6 out of 7 backups are not compatible, then shouldn't we stop using it as a fast solution while compatible ones are not added?

fruch commented 8 months ago

@roydahan , @fruch , @ShlomiBalalis ^

I think It's a known issue, cause the backups we created in different region

And their network topology doesn't match.

It's known for quite some time, but no one handled it.

So, if it is known that 6 out of 7 backups are not compatible, then shouldn't we stop using it as a fast solution while compatible ones are not added?

As far as I know it would be 100% of validation stress command that should fail if you are not on the same region as the backup.

As for the restore command itself, if it's failing 1 out of 7 runs, it's probably a different manager issue, and should be raised there, until proven otherwise.

vponomaryov commented 8 months ago

@roydahan , @fruch , @ShlomiBalalis ^

I think It's a known issue, cause the backups we created in different region And their network topology doesn't match. It's known for quite some time, but no one handled it.

So, if it is known that 6 out of 7 backups are not compatible, then shouldn't we stop using it as a fast solution while compatible ones are not added?

As far as I know it would be 100% of validation stress command that should fail if you are not on the same region as the backup.

The defaults/manager_persistent_snapshots.yaml file doesn't say anything about the region. So, if it is region-dependent then the SCT logic for the mgmt restore must be updated to consider region values.

As for the restore command itself, if it's failing 1 out of 7 runs, it's probably a different manager issue, and should be raised there, until proven otherwise.

It fails on 6 backups out of 7. Not runs.

vponomaryov commented 8 months ago

Tried it in different regions. The following warning:

WARN  15:01:15,947 Error while computing token map for keyspace 5gb_sizetiered_5_2 with datacenter us-east: could not achieve replication factor 3 (found 0 replicas only), check your keyspace replication settings.

Exists even in the us-east-1 region and even on the working backup. Verified here: https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-eks-mgmt-restore/12/consoleFull

So, I don't think that the region changes picture here. There is some another problem with the backups.

fruch commented 8 months ago

Tried it in different regions. The following warning:

WARN  15:01:15,947 Error while computing token map for keyspace 5gb_sizetiered_5_2 with datacenter us-east: could not achieve replication factor 3 (found 0 replicas only), check your keyspace replication settings.

Exists even in the us-east-1 region and even on the working backup. Verified here: https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-eks-mgmt-restore/12/consoleFull

So, I don't think that the region changes picture here. There is some another problem with the backups.

So some specific backups can't be restored correctly ?

@ShlomiBalalis are all of those backups validated and we're working with known versions of scylla and manger ? And if so with which versions ?

@vponomaryov FYI I don't know where this warning is coming, but since the region related issues wasn't ever handled, I don't know which other issues are there.

If you found that a specific backup is broken, that a report for scylla-manager And the team/group handling manager should address it.

vponomaryov commented 8 months ago

So some specific backups can't be restored correctly ?

If you found that a specific backup is broken, that a report for scylla-manager And the team/group handling manager should address it.

All the backups get applied successfully. The stress commands fail on 6 out of 7 such restored backups.