Open yarongilor opened 1 year ago
it is intended that the test will timeout earlier.. what is the problem here, is that something was stuck, and test ended with timeout error, instead of gracefully end..
@yarongilor , please update issue title and description to explain what the error is
@fgelcer , i'm not sure, there may be several alternatives to solve this error - either increasing test-duration or changing nemesis seed, or any other solution
@fgelcer , i'm not sure, there may be several alternatives to solve this error - either increasing test-duration or changing nemesis seed, or any other solution
@yarongilor none of those are not a solution 1) duration is a given criteria to those tests 2) changing seed is temporary - since adding of new nemesis would scramble it again.
you still didn't explained here what is the root cause of the issue, so no solving can be discussed yet.
As seen above the prepare duration is 300 mins:
< t:2023-07-17 14:34:43,936 f:tester.py l:384 c:LongevityTest p:INFO > SCT Test duration: 720; Stress duration: 360; Prepare duration: 300
But the prepare stress actually runs more than double this period:
< t:2023-07-17 15:16:37,057 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo docker exec c4003b01f7a904bf00c5a7216217f6245cd9c7c541a7d885e92b1d33c005b4a4 /bin/sh -c 'echo TAG: loader_idx:0-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=YYEG9D23N312HI1YAU1C; cassandra-stress user no-warmup profile=/tmp/c-s_profile_3si_5queries.yaml ops'"'"'(insert=25,si_read1=1,si_read2=1,si_read3=1,si_read4=1,si_read5=1)'"'"' cl=QUORUM n=100000000 -mode cql3 native -rate threads=10 -node 10.12.11.133,10.12.8.59,10.12.10.249,10.12.9.45,10.12.8.140 -errors skip-unsupported-columns'"...
< t:2023-07-18 01:35:19,632 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > stress_cmd=cassandra-stress user profile=/tmp/c-s_profile_3si_5queries.yaml ops'(insert=25,si_read1=1,si_read2=1,si_read3=1,si_read4=1,si_read5=1)' cl=QUORUM n=100000000 -mode cql3 native -rate threads=10
< t:2023-07-18 01:35:20,601 f:loader_utils.py l:216 c:LongevityTest p:INFO > Prepare finished
So prepare finished after ~ 680 minutes and not 300 minutes, so the stress_duration only left 720 - 680 = 40 minutes which is less than the "planned" stress-duration above of 360:
< t:2023-07-18 01:38:17,406 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo docker exec 041a5cc45f9f5efbc782a9460f3202434a567d982526c56f2b6e2f140e58293b /bin/sh -c 'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=NTSYVKH5UMSQFU7VR1FA; cassandra-stress user no-warmup profile=/tmp/c-s_profile_2si_2queries.yaml ops'"'"'(insert=10,si_p_read1=1,si_p_read2=1)'"'"' cl=QUORUM duration=360m -mode cql3 native -rate threads=10 -node 10.12.8.59,10.12.9.45,10.12.8.140,10.12.9.48,10.12.9.107 -errors skip-unsupported-columns'"...
@fruch , @fgelcer , there could be multiple alternatives to solve this, following the fix of a76ed862a - feature(pipeline_test_duration): Generic test_duration for longevity
:
I think that what the prepare_duration is meant for.
3rd option is to optimize the commands and split them to ranges to make prepare work faster
I think that what the prepare_duration is meant for.
3rd option is to optimize the commands and split them to ranges to make prepare work faster
i have been working in reducing the prepare time in this job (https://github.com/scylladb/scylla-cluster-tests/pull/6435)
Issue description
Test reported the following durations:
Test got an error event of:
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1039-aws Scylla version (or git commit hash):
5.4.0~dev-20230717.567b4536892f
with build-id98731517a6828085106956bea9f922fbb96697dc
Cluster size: 5 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image: `` (aws: undefined_region)
Test:
longevity-mv-si-4days-test
Test id:8138988d-92f9-4d06-82cd-bce1289b84c8
Test name:scylla-master/longevity/longevity-mv-si-4days-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 8138988d-92f9-4d06-82cd-bce1289b84c8` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=8138988d-92f9-4d06-82cd-bce1289b84c8) - Show all stored logs command: `$ hydra investigate show-logs 8138988d-92f9-4d06-82cd-bce1289b84c8` ## Logs: - **db-cluster-8138988d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/db-cluster-8138988d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/db-cluster-8138988d.tar.gz) - **sct-runner-events-8138988d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/sct-runner-events-8138988d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/sct-runner-events-8138988d.tar.gz) - **sct-8138988d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/sct-8138988d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/sct-8138988d.log.tar.gz) - **loader-set-8138988d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/loader-set-8138988d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/loader-set-8138988d.tar.gz) - **monitor-set-8138988d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/monitor-set-8138988d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8138988d-92f9-4d06-82cd-bce1289b84c8/20230718_024447/monitor-set-8138988d.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-mv-si-4days-test/3/) [Argus](https://argus.scylladb.com/test/b49dc529-c8e7-4117-a01d-2c6669c4dc05/runs?additionalRuns[]=8138988d-92f9-4d06-82cd-bce1289b84c8)