scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
52 stars 85 forks source link

Test duration error: "Test started at 2023-07-17 15:12:43, reached it's timeout (720 minute)" #6414

Open yarongilor opened 11 months ago

yarongilor commented 11 months ago

Issue description

Test reported the following durations:

17:34:43  < t:2023-07-17 14:34:43,121 f:sct_config.py   l:1822 c:sdcm.sct_config      p:INFO  > test_duration: 6550
17:34:44  < t:2023-07-17 14:34:43,936 f:tester.py       l:384  c:LongevityTest        p:INFO  > SCT Test duration: 720; Stress duration: 360; Prepare duration: 300

Test got an error event of:

18:12:43  < t:2023-07-17 15:12:43,388 f:tester.py       l:637  c:LongevityTest        p:INFO  > Test start time 2023-07-17 15:12:43, duration is 720 and timeout set to 2023-07-18 03:12:43
2023-07-18 03:12:43.770: (TestTimeoutEvent Severity.CRITICAL) period_type=not-set event_id=a50df069-493f-4c3d-8793-0123bd7916b2, Test started at 2023-07-17 15:12:43, reached it's timeout (720 minute)

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1039-aws Scylla version (or git commit hash): 5.4.0~dev-20230717.567b4536892f with build-id 98731517a6828085106956bea9f922fbb96697dc

Cluster size: 5 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: `` (aws: undefined_region)

Test: longevity-mv-si-4days-test Test id: 8138988d-92f9-4d06-82cd-bce1289b84c8 Test name: scylla-master/longevity/longevity-mv-si-4days-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 8138988d-92f9-4d06-82cd-bce1289b84c8` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=8138988d-92f9-4d06-82cd-bce1289b84c8) - Show all stored logs command: `$ hydra investigate show-logs 8138988d-92f9-4d06-82cd-bce1289b84c8` ## Logs: - **db-cluster-8138988d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/db-cluster-8138988d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/db-cluster-8138988d.tar.gz) - **sct-runner-events-8138988d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/sct-runner-events-8138988d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/sct-runner-events-8138988d.tar.gz) - **sct-8138988d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/sct-8138988d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/sct-8138988d.log.tar.gz) - **loader-set-8138988d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/loader-set-8138988d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/loader-set-8138988d.tar.gz) - **monitor-set-8138988d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/monitor-set-8138988d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/monitor-set-8138988d.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-mv-si-4days-test/3/) [Argus](https://argus.scylladb.com/test/b49dc529-c8e7-4117-a01d-2c6669c4dc05/runs?additionalRuns[]=8138988d-92f9-4d06-82cd-bce1289b84c8)
fgelcer commented 11 months ago

it is intended that the test will timeout earlier.. what is the problem here, is that something was stuck, and test ended with timeout error, instead of gracefully end..

@yarongilor , please update issue title and description to explain what the error is

yarongilor commented 11 months ago

@fgelcer , i'm not sure, there may be several alternatives to solve this error - either increasing test-duration or changing nemesis seed, or any other solution

fruch commented 11 months ago

@fgelcer , i'm not sure, there may be several alternatives to solve this error - either increasing test-duration or changing nemesis seed, or any other solution

@yarongilor none of those are not a solution 1) duration is a given criteria to those tests 2) changing seed is temporary - since adding of new nemesis would scramble it again.

you still didn't explained here what is the root cause of the issue, so no solving can be discussed yet.

yarongilor commented 11 months ago

As seen above the prepare duration is 300 mins:

< t:2023-07-17 14:34:43,936 f:tester.py       l:384  c:LongevityTest        p:INFO  > SCT Test duration: 720; Stress duration: 360; Prepare duration: 300

But the prepare stress actually runs more than double this period:

< t:2023-07-17 15:16:37,057 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo  docker exec c4003b01f7a904bf00c5a7216217f6245cd9c7c541a7d885e92b1d33c005b4a4 /bin/sh -c 'echo TAG: loader_idx:0-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=YYEG9D23N312HI1YAU1C; cassandra-stress user no-warmup profile=/tmp/c-s_profile_3si_5queries.yaml ops'"'"'(insert=25,si_read1=1,si_read2=1,si_read3=1,si_read4=1,si_read5=1)'"'"' cl=QUORUM n=100000000 -mode cql3 native -rate threads=10 -node 10.12.11.133,10.12.8.59,10.12.10.249,10.12.9.45,10.12.8.140 -errors skip-unsupported-columns'"...
< t:2023-07-18 01:35:19,632 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress user profile=/tmp/c-s_profile_3si_5queries.yaml ops'(insert=25,si_read1=1,si_read2=1,si_read3=1,si_read4=1,si_read5=1)' cl=QUORUM n=100000000 -mode cql3 native -rate threads=10
< t:2023-07-18 01:35:20,601 f:loader_utils.py l:216  c:LongevityTest        p:INFO  > Prepare finished

So prepare finished after ~ 680 minutes and not 300 minutes, so the stress_duration only left 720 - 680 = 40 minutes which is less than the "planned" stress-duration above of 360:

< t:2023-07-18 01:38:17,406 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo  docker exec 041a5cc45f9f5efbc782a9460f3202434a567d982526c56f2b6e2f140e58293b /bin/sh -c 'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=NTSYVKH5UMSQFU7VR1FA; cassandra-stress user no-warmup profile=/tmp/c-s_profile_2si_2queries.yaml ops'"'"'(insert=10,si_p_read1=1,si_p_read2=1)'"'"' cl=QUORUM duration=360m -mode cql3 native -rate threads=10 -node 10.12.8.59,10.12.9.45,10.12.8.140,10.12.9.48,10.12.9.107 -errors skip-unsupported-columns'"...

@fruch , @fgelcer , there could be multiple alternatives to solve this, following the fix of a76ed862a - feature(pipeline_test_duration): Generic test_duration for longevity:

  1. increasing prepare time from 300 to 700
  2. decreasing the prepare-command load.
fruch commented 11 months ago

I think that what the prepare_duration is meant for.

3rd option is to optimize the commands and split them to ranges to make prepare work faster

fgelcer commented 11 months ago

I think that what the prepare_duration is meant for.

3rd option is to optimize the commands and split them to ranges to make prepare work faster

i have been working in reducing the prepare time in this job (https://github.com/scylladb/scylla-cluster-tests/pull/6435)