scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.46k stars 1.27k forks source link

[AMI 1.0-rc1 m3.medium] - Longevity Tests - drainer_monkey - cassandra-stress does not survive #1129

Closed lmr closed 8 years ago

lmr commented 8 years ago

Installation details

Scylla version (or git commit hash): 1.0-rc1 Cluster size: m3.medium OS (RHEL/CentOS/Ubuntu/AWS AMI): CentOS

Hardware details (for performance issues)

Platform (physical/VM/cloud instance type/docker): AWS Hardware: sockets= cores= hyperthreading= memory= Disks: (SSD/HDD, count)

Running the latest AMI for 1.0-rc1, m3.medium instance, using our Drainer monkey, cassandra-stress survives for about 22 hours:

17:52:12 remote           L0773 DEBUG| Remote [centos@52.53.253.67]: Running 'cassandra-stress write cl=QUORUM duration=1440m -schema 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=1000 -node 172.31.24.179'

15:02:24 remote           L0182 DEBUG| [52.53.253.67] [stdout] java.io.IOException: Operation x10 on key(s) [39393832303835504e30]: Error executing: (WriteTimeoutException): Cassandra timeout during write query at consistency QUORUM (2 replica were required but only 1 acknowledged the write)

As an exceptional case, this also failed for the c3.large instance size, see issue #962. Other nemesis did pass our previous test runs for said larger instance sizes.

slivne commented 8 years ago

@lmr asside from the timeout is there another issue ...

seeing some timeouts while some of the load is passing (e.g. printout of stats lines from c-s with some ops passing) may be ok

If c-s prints 0 completed operations then that is probably not ok as well.

another method to validate this is ok is to try and run this toward a cassandra cluster using the same instance type - if it "works" on c* then we certainly need to keep digging.

if we reduce the load concurrency (less threads) will it work ?

lmr commented 8 years ago

This issue is caused by an insufficient amount of resources to handle the c-s command thrown at the cluster. I'm closing this, as currently there's no reasonable solution for it.