scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.24k stars 1.26k forks source link

[dtest] test_replace_node_using_the_same_ip_then_shut_down fails with consistency errors #15334

Open fruch opened 1 year ago

fruch commented 1 year ago

cassandra-stress in replace_address_test.TestReplaceAddress.test_replace_node_using_the_same_ip_then_shut_down, fails with consistency error, when only 1 of 3 nodes should be down:

ccmlib.node.ToolError: Subprocess ['stress', 'write', 'cl=QUORUM', 'duration=6m', '-rate', 'threads=10 throttle=1000/s', '-log', 'interval=5', '-schema', 'replication(factor=3) keyspace=keyspace2'] exited with non-zero status; exit status: 1; 
stdout: ******************** Stress Settings ********************
Command:
  Type: write
  Count: -1
  Duration: 6 MINUTES
  No Warmup: false
  Consistency Level: QUORUM
  Serial Consistency Level: SERIAL
  Target Uncertainty: not applicable
  Key Size (bytes): 10
  Counter Increment Distibution: add=fixed(1)
Rate:
  Auto: false
  Thread Count: 10
  OpsPer Sec: 1000
Population:
  Sequence: 1..1000000
  Order: ARBITRARY
  Wrap: true
Insert:
  Revisits: Uniform:  min=1,max=1000000
  Visits: Fixed:  key=1
  Row Population Ratio: Ratio: divisor=1.000000;delegate=Fixed:  key=1
  Batch Type: not batching
Columns:
  Max Columns Per Key: 5
  Column Names: [C0, C1, C2, C3, C4]
  Comparator: AsciiType
  Timestamp: null
  Variable Column Count: false
  Slice: false
  Size Distribution: Fixed:  key=34
  Count Distribution: Fixed:  key=5
Errors:
  Ignore: false
  Tries: 10
Log:
  No Summary: false
  No Settings: false
  File: null
  Interval Millis: 5000
  Level: NORMAL
Mode:
  API: JAVA_DRIVER_NATIVE
  Connection Style: CQL_PREPARED
  CQL Version: CQL3
  Protocol Version: V4
  Username: null
  Password: null
  Auth Provide Class: null
  Max Pending Per Connection: null
  Connections Per Host: 8
  Compression: NONE
Node:
  Nodes: [127.0.34.1]
  Is White List: false
  Datacenter: null
  Rack: null
Schema:
  Keyspace: keyspace2
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Replication Strategy Options: {replication_factor=3}
  Table Compression: null
  Table Compaction Strategy: null
  Table Compaction Strategy Options: {}
Transport:
  factory=org.apache.cassandra.thrift.TFramedTransportFactory; truststore=null; truststore-password=null; keystore=null; keystore-password=null; ssl-protocol=TLS; ssl-alg=SunX509; store-type=JKS; ssl-ciphers=TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA; 
Port:
  Native Port: 9042
  Thrift Port: 9160
  JMX Port: 7199
Send To Daemon:
  *not set*
Graph:
  File: null
  Revision: unknown
  Title: null
  Operation: WRITE
TokenRange:
  Wrap: false
  Split Factor: 1
CloudConf:
  File: null

===== Using optimized driver!!! =====
Connected to cluster: test, max pending requests per connection null, max connections per host 8
Datatacenter: datacenter1; Host: /127.0.34.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.34.2; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.34.3; Rack: rack1
Created keyspaces. Sleeping 1s for propagation.
Sleeping 2s...
Warming up WRITE with 50000 iterations...
Running WRITE with 10 threads 6 minutes
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
total,          4118,     824,     824,     824,     0.3,     0.3,     0.4,     0.4,     0.6,     0.6,    5.0,  0.00000,      0,      0,       0,       0,       0,       0
total,          9118,    1000,    1000,    1000,     0.3,     0.3,     0.3,     0.4,     0.5,     4.0,   10.0,  0.06864,      0,      0,       0,       0,       0,       0
total,         14118,    1000,    1000,    1000,     0.3,     0.3,     0.4,     0.4,     0.5,     0.7,   15.0,  0.05110,      0,      0,       0,       0,       0,       0
total,         19118,    1000,    1000,    1000,     0.3,     0.3,     0.4,     0.4,     0.4,     0.5,   20.0,  0.04000,      0,      0,       0,       0,       0,       0
total,         24118,    1000,    1000,    1000,     0.3,     0.2,     0.3,     1.7,     2.1,     2.1,   25.0,  0.03275,      0,      0,       0,       0,       0,       0
total,         29118,    1000,    1000,    1000,     0.3,     0.2,     0.3,     2.3,     2.6,     2.6,   30.0,  0.02769,      0,      0,       0,       0,       0,       0
total,         34118,    1000,    1000,    1000,     0.2,     0.2,     0.3,     0.3,     0.4,     0.4,   35.0,  0.02396,      0,      0,       0,       0,       0,       0
total,         39118,    1000,    1000,    1000,     0.4,     0.3,     0.9,     2.0,     2.7,     4.9,   40.0,  0.02112,      0,      0,       0,       0,       0,       0
total,         44118,    1000,    1000,    1000,     0.3,     0.3,     0.3,     0.4,     0.9,     1.4,   45.0,  0.01887,      0,      0,       0,       0,       0,       0
total,         49118,    1000,    1000,    1000,     0.3,     0.3,     0.4,     0.5,     1.0,     2.0,   50.0,  0.01706,      0,      0,       0,       0,       0,       0
total,         50465,     269,     269,     269,     0.3,     0.3,     0.4,     0.4,     0.8,     1.0,   55.0,  0.06937,      0,      0,       0,       0,       0,       0
java.io.IOException: Operation x10 on key(s) [32334c35364c37373230]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x10 on key(s) [38343137373834333631]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)

Spotted for the first time

Logs:

https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/4364/testReport/junit/replace_address_test/TestReplaceAddress/FullDtest___full_split014___test_replace_node_using_the_same_ip_then_shut_down_rbo_disabled_/

kbr-scylla commented 1 year ago

I got this on my PR, it looks different but related https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3753/testReport/junit/replace_address_test/TestReplaceAddress/Tests___Sanity_Tests___test_replace_node_using_the_same_ip_then_shut_down_rbo_disabled_/

ccmlib.node.ToolError: Subprocess ['stress', 'write', 'cl=QUORUM', 'duration=6m', '-rate', 'threads=10 throttle=1000/s', '-log', 'interval=5', '-schema', 'replication(factor=3) keyspace=keyspace2'] exited with non-zero status; exit status: 1; 
stdout: ******************** Stress Settings ********************
Command:
  Type: write
  Count: -1
  Duration: 6 MINUTES
  No Warmup: false
  Consistency Level: QUORUM
  Serial Consistency Level: SERIAL
  Target Uncertainty: not applicable
  Key Size (bytes): 10
  Counter Increment Distibution: add=fixed(1)
Rate:
  Auto: false
  Thread Count: 10
  OpsPer Sec: 1000
Population:
  Sequence: 1..1000000
  Order: ARBITRARY
  Wrap: true
Insert:
  Revisits: Uniform:  min=1,max=1000000
  Visits: Fixed:  key=1
  Row Population Ratio: Ratio: divisor=1.000000;delegate=Fixed:  key=1
  Batch Type: not batching
Columns:
  Max Columns Per Key: 5
  Column Names: [C0, C1, C2, C3, C4]
  Comparator: AsciiType
  Timestamp: null
  Variable Column Count: false
  Slice: false
  Size Distribution: Fixed:  key=34
  Count Distribution: Fixed:  key=5
Errors:
  Ignore: false
  Tries: 10
Log:
  No Summary: false
  No Settings: false
  File: null
  Interval Millis: 5000
  Level: NORMAL
Mode:
  API: JAVA_DRIVER_NATIVE
  Connection Style: CQL_PREPARED
  CQL Version: CQL3
  Protocol Version: V4
  Username: null
  Password: null
  Auth Provide Class: null
  Max Pending Per Connection: null
  Connections Per Host: 8
  Compression: NONE
Node:
  Nodes: [127.0.62.1]
  Is White List: false
  Datacenter: null
  Rack: null
Schema:
  Keyspace: keyspace2
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Replication Strategy Options: {replication_factor=3}
  Table Compression: null
  Table Compaction Strategy: null
  Table Compaction Strategy Options: {}
Transport:
  factory=org.apache.cassandra.thrift.TFramedTransportFactory; truststore=null; truststore-password=null; keystore=null; keystore-password=null; ssl-protocol=TLS; ssl-alg=SunX509; store-type=JKS; ssl-ciphers=TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA; 
Port:
  Native Port: 9042
  Thrift Port: 9160
  JMX Port: 7199
Send To Daemon:
  *not set*
Graph:
  File: null
  Revision: unknown
  Title: null
  Operation: WRITE
TokenRange:
  Wrap: false
  Split Factor: 1
CloudConf:
  File: null

===== Using optimized driver!!! =====
Connected to cluster: test, max pending requests per connection null, max connections per host 8
Datatacenter: datacenter1; Host: /127.0.62.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.62.2; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.62.3; Rack: rack1
; 
stderr: java.lang.RuntimeException: Encountered exception creating schema
    at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:100)
    at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpaces(SettingsSchema.java:69)
    at org.apache.cassandra.stress.settings.StressSettings.maybeCreateKeyspaces(StressSettings.java:230)
    at org.apache.cassandra.stress.StressAction.run(StressAction.java:58)
    at org.apache.cassandra.stress.Stress.run(Stress.java:143)
    at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Caused by: com.datastax.driver.core.exceptions.TransportException: [/127.0.62.3:9042] Connection has been closed
    at com.datastax.driver.core.exceptions.TransportException.copy(TransportException.java:38)
    at com.datastax.driver.core.exceptions.TransportException.copy(TransportException.java:24)
    at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
    at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
    at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
    at org.apache.cassandra.stress.util.JavaDriverClient.execute(JavaDriverClient.java:214)
    at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:86)
    ... 5 more
Caused by: com.datastax.driver.core.exceptions.TransportException: [/127.0.62.3:9042] Connection has been closed
    at com.datastax.driver.core.Connection$ConnectionCloseFuture.force(Connection.java:1737)
    at com.datastax.driver.core.Connection$ConnectionCloseFuture.force(Connection.java:1718)
    at com.datastax.driver.core.Connection.defunct(Connection.java:815)
    at com.datastax.driver.core.Connection$Dispatcher.exceptionCaught(Connection.java:1631)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
    at com.datastax.shaded.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
    at com.datastax.shaded.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
    at com.datastax.shaded.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125)
    at com.datastax.shaded.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:177)
    at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
    at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
    at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
    at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
    at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)
mykaul commented 11 months ago

@kbr-scylla - is your team looking at it?

kbr-scylla commented 11 months ago

No. Would you like me to put this item in the backlog?

mykaul commented 11 months ago

No. Would you like me to put this item in the backlog?

Please do. If it's not ARM, we should probably unflake this.

fruch commented 11 months ago

No. Would you like me to put this item in the backlog?

Please do. If it's not ARM, we should probably unflake this.

It's not ARM, we don't have dtest running on ARM, not yet.

kbr-scylla commented 11 months ago

This one was not caused by using slow disks, here https://jenkins.scylladb.com/blue/organizations/jenkins/scylla-master%2Fbyo%2Fdtest-byo/detail/dtest-byo/30/pipeline/112 there was a failure on c5ad.8xlarge instance with fast NVMe disk.

mykaul commented 10 months ago

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4388/testReport/junit/replace_address_test/TestReplaceAddress/Tests___Sanity_Tests___test_replace_node_using_the_same_ip_then_shut_down_rbo_disabled_/ ?

fruch commented 10 months ago

looks like client side issue to me: 1) the error is that one specific node is defuct, that shouldn't end in a failure. (and the test is taking node3 right after starting the stress)

2) looking at cassandra-stress code:

    public void createKeySpacesNative(StressSettings settings)
    {

        JavaDriverClient client  = settings.getJavaDriverClient(false);

        try
        {
            //Keyspace
            client.execute(createKeyspaceStatementCQL3(), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);

            client.execute("USE \""+keyspace+"\"", org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);

            //Add standard1
            client.execute(createStandard1StatementCQL3(settings), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);

            System.out.println(String.format("Created keyspaces. Sleeping %ss for propagation.", settings.node.nodes.size()));
            Thread.sleep(settings.node.nodes.size() * 1000L); // seconds
        }
        catch (AlreadyExistsException e)
        {
            //Ok.
        }
        catch (Exception e)
        {
            throw new RuntimeException("Encountered exception creating schema", e);
        }
    }
fruch commented 10 months ago

test maybe should wait a bit before taking node3 down for now, it might be quicker then waiting for cassandra-stress fix.

Annamikhlin commented 10 months ago

failed also in 2024.1/next - https://jenkins.scylladb.com/job/enterprise-2024.1/job/next/16/testReport/junit/replace_address_test/TestReplaceAddress/Tests___Sanity_Tests___test_replace_node_using_the_same_ip_then_shut_down_rbo_enabled_/

Annamikhlin commented 7 months ago

seen on: https://jenkins.scylladb.com/job/scylla-5.4/job/next/150/testReport/junit/replace_address_test/TestReplaceAddress/Tests___Sanity_Tests_RAFT___test_replace_node_using_the_same_ip_then_shut_down_rbo_disabled_/

patjed41 commented 1 month ago

This test is constantly failing in the dtest-with-gossip-topology-changes job. It started here https://jenkins.scylladb.com/job/scylla-master/job/dtest-with-gossip-topology-changes/26/ (July 23), so recently this test started failing all the time.