scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 94 forks source link

SLA nemesis are failing on create SLA with Unauthorized error #5793

Closed fruch closed 1 year ago

fruch commented 1 year ago

Issue description

Seems like the newly introduced SLA nemesis are failing to create SLAs:

2023-01-29 18:30:07.560: (DisruptionEvent Severity.ERROR) period_type=end event_id=56dae367-3ff9-4234-a5f4-871aeffeb2ae duration=2s: nemesis_name=SevenSlWithMaxSharesDuringLoad target_node=Node longevity-lwt-3h-2023-1-db-node-a2a98e71-3 [3.238.194.95 | 10.12.0.132] (seed: False) errors=Error from server: code=2100 [Unauthorized] message="You have to be logged in and not anonymous to perform this request"
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4010, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3901, in disrupt_seven_sl_with_max_shares_during_load
error_events = sla_tests.test_seven_sl_with_max_shares_during_load(
File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/sla_tests.py", line 428, in test_seven_sl_with_max_shares_during_load
roles.append(create_sla_auth(session=session, shares=every_role_shares, index=auth_entity_name_index))
File "/home/ubuntu/scylla-cluster-tests/test_lib/sla.py", line 396, in create_sla_auth
password=STRESS_ROLE_PASSWORD_TEMPLATE % shares or '', login=True).create()
File "/home/ubuntu/scylla-cluster-tests/test_lib/sla.py", line 338, in create
self.session.execute(query)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1644, in execute_verbose
return execute_orig(*args, **kwargs)
File "cassandra/cluster.py", line 2699, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 5018, in cassandra.cluster.ResponseFuture.result
cassandra.Unauthorized: Error from server: code=2100 [Unauthorized] message="You have to be logged in and not anonymous to perform this request"

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1028-aws Scylla version (or git commit hash): 2023.1.0~rc0-20230129.ede545df8387 with build-id 4c248c07c60412023055e2ec9bf216be90f27580

Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-003714750f94c80ee (aws: us-east-1)

Test: longevity-lwt-3h-test Test id: a2a98e71-0925-4d06-8f5c-939edce41269 Test name: enterprise-2023.1/longevity/longevity-lwt-3h-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor a2a98e71-0925-4d06-8f5c-939edce41269` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=a2a98e71-0925-4d06-8f5c-939edce41269) - Show all stored logs command: `$ hydra investigate show-logs a2a98e71-0925-4d06-8f5c-939edce41269` ## Logs: - **db-cluster-a2a98e71.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/db-cluster-a2a98e71.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/db-cluster-a2a98e71.tar.gz) - **sct-runner-a2a98e71.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/sct-runner-a2a98e71.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/sct-runner-a2a98e71.tar.gz) - **monitor-set-a2a98e71.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/monitor-set-a2a98e71.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/monitor-set-a2a98e71.tar.gz) - **loader-set-a2a98e71.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/loader-set-a2a98e71.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/loader-set-a2a98e71.tar.gz) - **parallel-timelines-report-a2a98e71.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/parallel-timelines-report-a2a98e71.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a2a98e71-0925-4d06-8f5c-939edce41269/20230129_213328/parallel-timelines-report-a2a98e71.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/longevity/job/longevity-lwt-3h-test/2/)
juliayakovlev commented 1 year ago

This nemesis should not be run in those tests. Looks like we need to filter out those nemeses in non-sla tests

fruch commented 1 year ago

This nemesis should not be run in theis test. Looks like we need to filter out those nemeses in non-sla tests

lets use the same solution you used in the first SLA nemesis, later we can think of a better solution to skip those by default.

juliayakovlev commented 1 year ago

This nemesis should not be run in theis test. Looks like we need to filter out those nemeses in non-sla tests

lets use the same solution you used in the first SLA nemesis, later we can think of a better solution to skip those by default.

I suggest another so;ution. We can run those nemeses for every test that runs with authenticaton. So I will check in the SLA nemesis if the test running with authentication and let it runs. It may be good - to test SLA in different situations

fruch commented 1 year ago

@juliayakovlev

are we done with the needed fixes to make sure this won't happen ?

juliayakovlev commented 1 year ago

@juliayakovlev

are we done with the needed fixes to make sure this won't happen ?

Yes, I did the needed fixes. But there is an issue https://github.com/scylladb/scylla-enterprise/issues/2552 that causes to SLA nemeses failures. Most of the running nemeses fail with this. I do not want to release SLA nemesis in this situation. I raised it against Eliran again

eliransin commented 1 year ago

I will look into the mentioned issue. Any reason not to close this one?

juliayakovlev commented 1 year ago

I will look into the mentioned issue. Any reason not to close this one?

If we will merge new SLA nemeses all test will start to fail

fruch commented 1 year ago

Now it should be working