Open soyacz opened 2 years ago
@soyacz did it work with previous manager version / scylla version? (What did we run with when we tested 5.0 and/or 2022.1)
@karol-kokoszka is this a known issue
We didn't run this nemesis in this test for 5.0/2022.1
A test should not run for the first time on a branch. It should be run on master first, so that the question "do we have a regression or is the test at fault" is clearer.
@avikivity this is not trivial even if we tested all our jobs on master (and we not even close on doing it). The problem is with the nemesis approach, that used to be random, now it's less random but with changes to nemesis it becomes different from branch to branch. (IOW, the order of nemesis can be changed between releases).
We didn't run this nemesis in this test for 5.0/2022.1
I looked at history since 4.0 and didn't see that this nemesis ever ran on this job. @ShlomiBalalis I think that there is a (kind of) scale test for manager (probably not that big), please check that and report back - if not, you can use this longevity as a base test for it and it'll be a reproducer for this issue.
@soyacz for now, let's change the seed of this test and run it with different set of nemesis.
@soyacz for now, let's change the seed of this test and run it with different set of nemesis.
Started test with different seed: https://jenkins.scylladb.com/job/enterprise-2022.2/job/scale/job/scale-5000-tables-test/5/
@soyacz for now, let's change the seed of this test and run it with different set of nemesis.
Started test with different seed: https://jenkins.scylladb.com/job/enterprise-2022.2/job/scale/job/scale-5000-tables-test/5/
i see that the manager repair nemesis will be the nemesis 107 in the list of nemesis, so it will most likely finish the test, before reaching to it...
@soyacz for now, let's change the seed of this test and run it with different set of nemesis.
Started test with different seed: https://jenkins.scylladb.com/job/enterprise-2022.2/job/scale/job/scale-5000-tables-test/5/
i see that the manager repair nemesis will be the nemesis 107 in the list of nemesis, so it will most likely finish the test, before reaching to it...
No worries for that - in test there's 60 mins nemesis_interval. I have snippet code that quickly shows nemesis list for given seed (and other setting) so I didn't shoot blindly.
Hey, the issue is related to the scylla instance that is scylla-manager using to track the progress of repair task.
"error":"gocql: no response received from cassandra within timeout period"
is returned when manager tries to update it's own database with the new progress value.
Scylla-manager keeps the state in it's own instance of scylla-db that is by default installed on the same node as manager.
I wonder if I can find in logs anything that comes from manager owned scylla instance ? Would need some help here.
Closing, as this PR https://github.com/scylladb/qa-tasks/issues/886 is supposed to add manager backend's scylla logs to the test output. Having them, we will be able to investigate what happened and what limit have been hit.
Closing, as this PR scylladb/qa-tasks#886 is supposed to add manager backend's scylla logs to the test output. Having them, we will be able to investigate what happened and what limit have been hit.
The log of the manager's backend are now collected (https://github.com/scylladb/scylla-cluster-tests/pull/6116), we should really keep this bug open and try to reproduce it
The log of the manager's backend are now collected (https://github.com/scylladb/scylla-cluster-tests/pull/6116), we should really keep this bug open and try to reproduce it
OK, please update the issue with logs whenever gocql: no response received from cassandra within timeout period
appear.
Installation details
Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash):
2022.2.0~rc1-20220902.a9bc6d191031
with build-id074a0cb9e6a5ab36ba5e7f81385e68079ab6eeda
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc1.0.20220902.a9bc6d191031.tar.gz Cluster size: 1 nodes (i3.8xlarge)Scylla Nodes used in this run:
OS / Image:
ami-0492fd8e302e5af45
(aws: eu-west-1)Test:
scale-5000-tables-test
Test id:e63869e7-14a8-4815-aee1-d12107a6f01d
Test name:enterprise-2022.2/scale/scale-5000-tables-test
Test config file(s):Issue description
During test with 5000 tables, scylla-manager was failing to obtain repair task status due timeouts. In manager logs we can see errors like:
Scylla manager version: 3.0.0-0.20220523.5501e5d7f53
$ hydra investigate show-logs e63869e7-14a8-4815-aee1-d12107a6f01d
Logs:
| 20220915_203649 | critical | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/critical-e63869e7.log.tar.gz | | 20220915_203649 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/db-cluster-e63869e7.tar.gz | | 20220915_203649 | debug | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/debug-e63869e7.log.tar.gz | | 20220915_203649 | email_data | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/email_data-e63869e7.json.tar.gz | | 20220915_203649 | error | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/error-e63869e7.log.tar.gz | | 20220915_203649 | event | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/events-e63869e7.log.tar.gz | | 20220915_203649 | left_processes | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/left_processes-e63869e7.log.tar.gz | | 20220915_203649 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/loader-set-e63869e7.tar.gz | | 20220915_203649 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/monitor-set-e63869e7.tar.gz | | 20220915_203649 | normal | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/normal-e63869e7.log.tar.gz | | 20220915_203649 | output | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/output-e63869e7.log.tar.gz | | 20220915_203649 | event | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/raw_events-e63869e7.log.tar.gz | | 20220915_203649 | summary | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/summary-e63869e7.log.tar.gz | | 20220915_203649 | warning | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/warning-e63869e7.log.tar.gz
Jenkins job URL