Open fruch opened 1 year ago
I got this on my PR, it looks different but related https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3753/testReport/junit/replace_address_test/TestReplaceAddress/Tests___Sanity_Tests___test_replace_node_using_the_same_ip_then_shut_down_rbo_disabled_/
ccmlib.node.ToolError: Subprocess ['stress', 'write', 'cl=QUORUM', 'duration=6m', '-rate', 'threads=10 throttle=1000/s', '-log', 'interval=5', '-schema', 'replication(factor=3) keyspace=keyspace2'] exited with non-zero status; exit status: 1;
stdout: ******************** Stress Settings ********************
Command:
Type: write
Count: -1
Duration: 6 MINUTES
No Warmup: false
Consistency Level: QUORUM
Serial Consistency Level: SERIAL
Target Uncertainty: not applicable
Key Size (bytes): 10
Counter Increment Distibution: add=fixed(1)
Rate:
Auto: false
Thread Count: 10
OpsPer Sec: 1000
Population:
Sequence: 1..1000000
Order: ARBITRARY
Wrap: true
Insert:
Revisits: Uniform: min=1,max=1000000
Visits: Fixed: key=1
Row Population Ratio: Ratio: divisor=1.000000;delegate=Fixed: key=1
Batch Type: not batching
Columns:
Max Columns Per Key: 5
Column Names: [C0, C1, C2, C3, C4]
Comparator: AsciiType
Timestamp: null
Variable Column Count: false
Slice: false
Size Distribution: Fixed: key=34
Count Distribution: Fixed: key=5
Errors:
Ignore: false
Tries: 10
Log:
No Summary: false
No Settings: false
File: null
Interval Millis: 5000
Level: NORMAL
Mode:
API: JAVA_DRIVER_NATIVE
Connection Style: CQL_PREPARED
CQL Version: CQL3
Protocol Version: V4
Username: null
Password: null
Auth Provide Class: null
Max Pending Per Connection: null
Connections Per Host: 8
Compression: NONE
Node:
Nodes: [127.0.62.1]
Is White List: false
Datacenter: null
Rack: null
Schema:
Keyspace: keyspace2
Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
Replication Strategy Options: {replication_factor=3}
Table Compression: null
Table Compaction Strategy: null
Table Compaction Strategy Options: {}
Transport:
factory=org.apache.cassandra.thrift.TFramedTransportFactory; truststore=null; truststore-password=null; keystore=null; keystore-password=null; ssl-protocol=TLS; ssl-alg=SunX509; store-type=JKS; ssl-ciphers=TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA;
Port:
Native Port: 9042
Thrift Port: 9160
JMX Port: 7199
Send To Daemon:
*not set*
Graph:
File: null
Revision: unknown
Title: null
Operation: WRITE
TokenRange:
Wrap: false
Split Factor: 1
CloudConf:
File: null
===== Using optimized driver!!! =====
Connected to cluster: test, max pending requests per connection null, max connections per host 8
Datatacenter: datacenter1; Host: /127.0.62.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.62.2; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.62.3; Rack: rack1
;
stderr: java.lang.RuntimeException: Encountered exception creating schema
at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:100)
at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpaces(SettingsSchema.java:69)
at org.apache.cassandra.stress.settings.StressSettings.maybeCreateKeyspaces(StressSettings.java:230)
at org.apache.cassandra.stress.StressAction.run(StressAction.java:58)
at org.apache.cassandra.stress.Stress.run(Stress.java:143)
at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Caused by: com.datastax.driver.core.exceptions.TransportException: [/127.0.62.3:9042] Connection has been closed
at com.datastax.driver.core.exceptions.TransportException.copy(TransportException.java:38)
at com.datastax.driver.core.exceptions.TransportException.copy(TransportException.java:24)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
at org.apache.cassandra.stress.util.JavaDriverClient.execute(JavaDriverClient.java:214)
at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:86)
... 5 more
Caused by: com.datastax.driver.core.exceptions.TransportException: [/127.0.62.3:9042] Connection has been closed
at com.datastax.driver.core.Connection$ConnectionCloseFuture.force(Connection.java:1737)
at com.datastax.driver.core.Connection$ConnectionCloseFuture.force(Connection.java:1718)
at com.datastax.driver.core.Connection.defunct(Connection.java:815)
at com.datastax.driver.core.Connection$Dispatcher.exceptionCaught(Connection.java:1631)
at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
at com.datastax.shaded.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
at com.datastax.shaded.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
at com.datastax.shaded.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125)
at com.datastax.shaded.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:177)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
@kbr-scylla - is your team looking at it?
No. Would you like me to put this item in the backlog?
No. Would you like me to put this item in the backlog?
Please do. If it's not ARM, we should probably unflake this.
No. Would you like me to put this item in the backlog?
Please do. If it's not ARM, we should probably unflake this.
It's not ARM, we don't have dtest running on ARM, not yet.
This one was not caused by using slow disks, here https://jenkins.scylladb.com/blue/organizations/jenkins/scylla-master%2Fbyo%2Fdtest-byo/detail/dtest-byo/30/pipeline/112
there was a failure on c5ad.8xlarge
instance with fast NVMe disk.
looks like client side issue to me: 1) the error is that one specific node is defuct, that shouldn't end in a failure. (and the test is taking node3 right after starting the stress)
2) looking at cassandra-stress code:
public void createKeySpacesNative(StressSettings settings)
{
JavaDriverClient client = settings.getJavaDriverClient(false);
try
{
//Keyspace
client.execute(createKeyspaceStatementCQL3(), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);
client.execute("USE \""+keyspace+"\"", org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);
//Add standard1
client.execute(createStandard1StatementCQL3(settings), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);
System.out.println(String.format("Created keyspaces. Sleeping %ss for propagation.", settings.node.nodes.size()));
Thread.sleep(settings.node.nodes.size() * 1000L); // seconds
}
catch (AlreadyExistsException e)
{
//Ok.
}
catch (Exception e)
{
throw new RuntimeException("Encountered exception creating schema", e);
}
}
test maybe should wait a bit before taking node3 down for now, it might be quicker then waiting for cassandra-stress fix.
This test is constantly failing in the dtest-with-gossip-topology-changes
job. It started here https://jenkins.scylladb.com/job/scylla-master/job/dtest-with-gossip-topology-changes/26/ (July 23), so recently this test started failing all the time.
cassandra-stress in replace_address_test.TestReplaceAddress.test_replace_node_using_the_same_ip_then_shut_down, fails with consistency error, when only 1 of 3 nodes should be down:
Spotted for the first time
Logs:
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/4364/testReport/junit/replace_address_test/TestReplaceAddress/FullDtest___full_split014___test_replace_node_using_the_same_ip_then_shut_down_rbo_disabled_/