scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
52 stars 34 forks source link

Timeouts on repair task for test with 5000 tables #3237

Open soyacz opened 2 years ago

soyacz commented 2 years ago

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 2022.2.0~rc1-20220902.a9bc6d191031 with build-id 074a0cb9e6a5ab36ba5e7f81385e68079ab6eeda Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc1.0.20220902.a9bc6d191031.tar.gz Cluster size: 1 nodes (i3.8xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0492fd8e302e5af45 (aws: eu-west-1)

Test: scale-5000-tables-test Test id: e63869e7-14a8-4815-aee1-d12107a6f01d Test name: enterprise-2022.2/scale/scale-5000-tables-test Test config file(s):

Issue description

During test with 5000 tables, scylla-manager was failing to obtain repair task status due timeouts. In manager logs we can see errors like:

Sep 14 15:29:14 longevity-5000-tables-2022-2-monitor-node-e63869e7-1 scylla-manager[15438]: {"L":"INFO","T":"2022-09-14T15:29:14.243Z","N":"http","M":"GET /api/v1/cluster/06c2a9f5-6c93-4add-a0c2-12b5635c6b74/task/repair/453d69fe-b4a6-42da-92c9-0109c0a3126e","from":"127.0.0.1:49510","status":500,"bytes":170,"duration":"3580ms","error":"load cluster \"06c2a9f5-6c93-4add-a0c2-12b5635c6b74\": gocql: no response received from cassandra within timeout period","_trace_id":"pMslOLNsTVGWXksqm8kAkw"}
Sep 14 15:53:05 longevity-5000-tables-2022-2-monitor-node-e63869e7-1 scylla-manager[15438]: {"L":"INFO","T":"2022-09-14T15:53:04.976Z","N":"http","M":"GET /api/v1/cluster/06c2a9f5-6c93-4add-a0c2-12b5635c6b74/tasks?all=false","from":"127.0.0.1:46612","status":500,"bytes":170,"duration":"2094ms","error":"load cluster \"06c2a9f5-6c93-4add-a0c2-12b5635c6b74\": gocql: no response received from cassandra within timeout period","_trace_id":"eXMTtzFoTyWqS-N2vb-zCw"}
Sep 14 15:53:09 longevity-5000-tables-2022-2-monitor-node-e63869e7-1 scylla-manager[15438]: {"L":"ERROR","T":"2022-09-14T15:53:08.248Z","N":"repair","M":"Update repair progress","run_id":"c955ffe0-3441-11ed-aec6-0abcb6784b61","key":{},"error":"gocql: no response received from cassandra within timeout period","_trace_id":"Fvvwe_8LQvO5WkX1E6117A","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:84\ngithub.com/scylladb/scylla-manager/v3/pkg/service/repair.(*dbProgressManager).OnJobResult\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/repair/progress.go:279\ngithub.com/scylladb/scylla-manager/v3/pkg/service/repair.(*worker).Run\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/repair/worker.go:71\ngithub.com/scylladb/scylla-manager/v3/pkg/service/repair.(*Service).Repair.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/repair/service.go:408\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57"}
Sep 14 15:55:31 longevity-5000-tables-2022-2-monitor-node-e63869e7-1 scylla-manager[15438]: {"L":"INFO","T":"2022-09-14T15:55:31.331Z","N":"http","M":"GET /api/v1/cluster/06c2a9f5-6c93-4add-a0c2-12b5635c6b74/task/repair/453d69fe-b4a6-42da-92c9-0109c0a3126e/latest","from":"127.0.0.1:49324","status":500,"bytes":180,"duration":"2065ms","error":"load progress for task \"453d69fe-b4a6-42da-92c9-0109c0a3126e\": gocql: no response received from cassandra within timeout period","_trace_id":"_KG9EIcPS_-T00bX7Efkbw"}
Sep 14 15:56:44 longevity-5000-tables-2022-2-monitor-node-e63869e7-1 scylla-manager[15438]: {"L":"INFO","T":"2022-09-14T15:56:44.253Z","N":"http","M":"GET /api/v1/cluster/06c2a9f5-6c93-4add-a0c2-12b5635c6b74/task/repair/453d69fe-b4a6-42da-92c9-0109c0a3126e/latest","from":"127.0.0.1:51140","status":500,"bytes":180,"duration":"4993ms","error":"load progress for task \"453d69fe-b4a6-42da-92c9-0109c0a3126e\": gocql: no response received from cassandra within timeout period","_trace_id":"UmqoYjWAR42QnUkNoPr2Gg"}
Sep 14 15:57:57 longevity-5000-tables-2022-2-monitor-node-e63869e7-1 scylla-manager[15438]: {"L":"INFO","T":"2022-09-14T15:57:57.428Z","N":"http","M":"GET /api/v1/cluster/06c2a9f5-6c93-4add-a0c2-12b5635c6b74/task/repair/453d69fe-b4a6-42da-92c9-0109c0a3126e/latest","from":"127.0.0.1:52462","status":500,"bytes":180,"duration":"7116ms","error":"load progress for task \"453d69fe-b4a6-42da-92c9-0109c0a3126e\": gocql: no response received from cassandra within timeout period","_trace_id":"461OIyGDRjuQCGClwT-1cA"}
Sep 14 16:00:39 longevity-5000-tables-2022-2-monitor-node-e63869e7-1 scylla-manager[15438]: {"L":"INFO","T":"2022-09-14T16:00:39.009Z","N":"http","M":"GET /api/v1/cluster/06c2a9f5-6c93-4add-a0c2-12b5635c6b74/task/repair/453d69fe-b4a6-42da-92c9-0109c0a3126e/latest","from":"127.0.0.1:54806","status":500,"bytes":180,"duration":"3859ms","error":"load progress for task \"453d69fe-b4a6-42da-92c9-0109c0a3126e\": gocql: no response received from cassandra within timeout period","_trace_id":"YwcE9_t_SfKLi5MbXi3Dhg"}
Sep 14 16:01:46 longevity-5000-tables-2022-2-monitor-node-e63869e7-1 scylla-manager[15438]: {"L":"INFO","T":"2022-09-14T16:01:46.771Z","N":"http","M":"GET /api/v1/cluster/06c2a9f5-6c93-4add-a0c2-12b5635c6b74/task/repair/453d69fe-b4a6-42da-92c9-0109c0a3126e/latest","from":"127.0.0.1:55992","status":500,"bytes":180,"duration":"1943ms","error":"load progress for task \"453d69fe-b4a6-42da-92c9-0109c0a3126e\": gocql: no response received from cassandra within timeout period","_trace_id":"a-vLMdjtRxK_XvcpA7P4Zw"}
Sep 14 16:01:47 longevity-5000-tables-2022-2-monitor-node-e63869e7-1 scylla-manager[15438]: {"L":"ERROR","T":"2022-09-14T16:01:47.403Z","N":"scheduler.06c2a9f5","M":"Cannot update the run","task":"healthcheck/8bc18090-4d4f-4302-abc9-75429eeb695e","run":{"cluster_id":"06c2a9f5-6c93-4add-a0c2-12b5635c6b74","type":"healthcheck","task_id":"8bc18090-4d4f-4302-abc9-75429eeb695e","id":"13814000-1dd2-11b2-aec5-0abcb6784b61","status":"DONE","owner":"","start_time":"2022-09-14T16:01:45.029198308Z","end_time":"2022-09-14T16:01:45.535494291Z"},"error":"gocql: no response received from cassandra within timeout period","_trace_id":"5AeBnn25Tmail6x_OFCT9g","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:84\ngithub.com/scylladb/scylla-manager/v3/pkg/service/scheduler.(*Service).run.func3\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/scheduler/service.go:414\ngithub.com/scylladb/scylla-manager/v3/pkg/service/scheduler.(*Service).run\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/scheduler/service.go:432\ngithub.com/scylladb/scylla-manager/v3/pkg/scheduler.(*Scheduler).asyncRun.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/scheduler/scheduler.go:412"}

Scylla manager version: 3.0.0-0.20220523.5501e5d7f53

Logs:

| 20220915_203649 | critical | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/critical-e63869e7.log.tar.gz | | 20220915_203649 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/db-cluster-e63869e7.tar.gz | | 20220915_203649 | debug | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/debug-e63869e7.log.tar.gz | | 20220915_203649 | email_data | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/email_data-e63869e7.json.tar.gz | | 20220915_203649 | error | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/error-e63869e7.log.tar.gz | | 20220915_203649 | event | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/events-e63869e7.log.tar.gz | | 20220915_203649 | left_processes | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/left_processes-e63869e7.log.tar.gz | | 20220915_203649 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/loader-set-e63869e7.tar.gz | | 20220915_203649 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/monitor-set-e63869e7.tar.gz | | 20220915_203649 | normal | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/normal-e63869e7.log.tar.gz | | 20220915_203649 | output | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/output-e63869e7.log.tar.gz | | 20220915_203649 | event | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/raw_events-e63869e7.log.tar.gz | | 20220915_203649 | summary | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/summary-e63869e7.log.tar.gz | | 20220915_203649 | warning | https://cloudius-jenkins-test.s3.amazonaws.com/e63869e7-14a8-4815-aee1-d12107a6f01d/20220915_203649/warning-e63869e7.log.tar.gz

Jenkins job URL

roydahan commented 2 years ago

@soyacz did it work with previous manager version / scylla version? (What did we run with when we tested 5.0 and/or 2022.1)

slivne commented 2 years ago

@karol-kokoszka is this a known issue

soyacz commented 2 years ago

We didn't run this nemesis in this test for 5.0/2022.1

avikivity commented 2 years ago

A test should not run for the first time on a branch. It should be run on master first, so that the question "do we have a regression or is the test at fault" is clearer.

roydahan commented 2 years ago

@avikivity this is not trivial even if we tested all our jobs on master (and we not even close on doing it). The problem is with the nemesis approach, that used to be random, now it's less random but with changes to nemesis it becomes different from branch to branch. (IOW, the order of nemesis can be changed between releases).

roydahan commented 2 years ago

We didn't run this nemesis in this test for 5.0/2022.1

I looked at history since 4.0 and didn't see that this nemesis ever ran on this job. @ShlomiBalalis I think that there is a (kind of) scale test for manager (probably not that big), please check that and report back - if not, you can use this longevity as a base test for it and it'll be a reproducer for this issue.

@soyacz for now, let's change the seed of this test and run it with different set of nemesis.

soyacz commented 2 years ago

@soyacz for now, let's change the seed of this test and run it with different set of nemesis.

Started test with different seed: https://jenkins.scylladb.com/job/enterprise-2022.2/job/scale/job/scale-5000-tables-test/5/

fgelcer commented 2 years ago

@soyacz for now, let's change the seed of this test and run it with different set of nemesis.

Started test with different seed: https://jenkins.scylladb.com/job/enterprise-2022.2/job/scale/job/scale-5000-tables-test/5/

i see that the manager repair nemesis will be the nemesis 107 in the list of nemesis, so it will most likely finish the test, before reaching to it...

soyacz commented 2 years ago

@soyacz for now, let's change the seed of this test and run it with different set of nemesis.

Started test with different seed: https://jenkins.scylladb.com/job/enterprise-2022.2/job/scale/job/scale-5000-tables-test/5/

i see that the manager repair nemesis will be the nemesis 107 in the list of nemesis, so it will most likely finish the test, before reaching to it...

No worries for that - in test there's 60 mins nemesis_interval. I have snippet code that quickly shows nemesis list for given seed (and other setting) so I didn't shoot blindly.

karol-kokoszka commented 1 year ago

Hey, the issue is related to the scylla instance that is scylla-manager using to track the progress of repair task. "error":"gocql: no response received from cassandra within timeout period" is returned when manager tries to update it's own database with the new progress value. Scylla-manager keeps the state in it's own instance of scylla-db that is by default installed on the same node as manager.

I wonder if I can find in logs anything that comes from manager owned scylla instance ? Would need some help here.

karol-kokoszka commented 1 year ago

Closing, as this PR https://github.com/scylladb/qa-tasks/issues/886 is supposed to add manager backend's scylla logs to the test output. Having them, we will be able to investigate what happened and what limit have been hit.

ShlomiBalalis commented 1 year ago

Closing, as this PR scylladb/qa-tasks#886 is supposed to add manager backend's scylla logs to the test output. Having them, we will be able to investigate what happened and what limit have been hit.

The log of the manager's backend are now collected (https://github.com/scylladb/scylla-cluster-tests/pull/6116), we should really keep this bug open and try to reproduce it

karol-kokoszka commented 1 year ago

The log of the manager's backend are now collected (https://github.com/scylladb/scylla-cluster-tests/pull/6116), we should really keep this bug open and try to reproduce it

OK, please update the issue with logs whenever gocql: no response received from cassandra within timeout period appear.