scylladb / scylla-tools-java

Apache Cassandra, supplying tools for Scylla
Apache License 2.0
53 stars 85 forks source link

cassandra-stress can keep running even if thread had failed #168

Open dkropachev opened 4 years ago

dkropachev commented 4 years ago

Steps to reproduce are following:

  1. Run c-s with 40 threads:
    
    cassandra-stress read  cl=QUORUM duration=240m -schema keyspace=keyspace1 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 
    native -rate threads=40 -pop seq=1..20971520 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5 -node 10.0.2.221 -errors skip-unsupported-columns
2. Make one thread to fail, in this test thread failed due to the CQL error of QUORUM inconsistency

Result:

c-s hung for 1hour till produced: FAILURE java.lang.RuntimeException: Failed to execute stress action at org.apache.cassandra.stress.StressAction.run(StressAction.java:101) at org.apache.cassandra.stress.Stress.run(Stress.java:143) at org.apache.cassandra.stress.Stress.main(Stress.java:62)


Test-id: 6bb58cd8-dd28-4afd-8a0d-dbc73e2489a4

[c-s.log](https://github.com/scylladb/scylla-tools-java/files/4677672/c-s.log)
dkropachev commented 3 years ago

Another occasion with debug output: [Uploading cassandra-stress-l0-c0-k1-01665285-0ef1-408f-9325-484098e432a4.log…]()

fruch commented 1 year ago

happened during testing of 2023.1

Installation details

Kernel Version: 5.15.0-1036-aws Scylla version (or git commit hash): 2023.1.0~rc6-20230517.ca8d6a0d4fa7 with build-id 3c3e22ad787d01bbfda9da05aa4a62beb1004157

Cluster size: 3 nodes (i3en.large)

Scylla Nodes used in this run:

OS / Image: ami-094190108e73c7d8e (aws: eu-west-1)

Test: longevity-schema-changes-3h-test Test id: 7db11cad-2048-48e0-8e19-c416184fa6d2 Test name: enterprise-2023.1/SCT_Enterprise_Features/audit/longevity-schema-changes-3h-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 7db11cad-2048-48e0-8e19-c416184fa6d2` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7db11cad-2048-48e0-8e19-c416184fa6d2) - Show all stored logs command: `$ hydra investigate show-logs 7db11cad-2048-48e0-8e19-c416184fa6d2` ## Logs: - **db-cluster-7db11cad.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/db-cluster-7db11cad.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/db-cluster-7db11cad.tar.gz) - **sct-runner-events-7db11cad.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/sct-runner-events-7db11cad.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/sct-runner-events-7db11cad.tar.gz) - **sct-7db11cad.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/sct-7db11cad.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/sct-7db11cad.log.tar.gz) - **monitor-set-7db11cad.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/monitor-set-7db11cad.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/monitor-set-7db11cad.tar.gz) - **loader-set-7db11cad.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/loader-set-7db11cad.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/loader-set-7db11cad.tar.gz) - **parallel-timelines-report-7db11cad.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/parallel-timelines-report-7db11cad.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7db11cad-2048-48e0-8e19-c416184fa6d2/20230629_111120/parallel-timelines-report-7db11cad.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/SCT_Enterprise_Features/job/audit/job/longevity-schema-changes-3h-test/3/) [Argus](https://argus.scylladb.com/test/32eea484-9fbe-48bd-a0a8-ed9e202706ad/runs?additionalRuns[]=7db11cad-2048-48e0-8e19-c416184fa6d2)
fruch commented 1 year ago

happened also in multi-dc case: https://github.com/scylladb/scylladb/issues/13667

seems like it's happening when there lots of error happening during the run

roydahan commented 1 year ago

@mykaul can you please help us assign this issue, it makes our longevities hard to investigate.

dkropachev commented 1 year ago

@roydahan, @mykaul, i will take a look at it

roydahan commented 1 year ago

@dkropachev any chance you looked at this one?