scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 94 forks source link

Cassandra-stress failed during AddRemoveDc nemesis while targeting a potentially yet uncreated keyspace #4668

Closed KnifeyMoloko closed 2 years ago

KnifeyMoloko commented 2 years ago

Installation details

Kernel Version: 5.13.0-1022-aws Scylla version (or git commit hash): 5.1.dev-20220421.cc40685c288f with build-id fde9a351743afef6ce27c7787712211d9ee8f41f Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-07d353a9e8649b3a1 (aws: eu-west-1)

Test: longevity-50gb-3days Test id: ab770923-a43f-419a-b47a-c3457ba2b0c0 Test name: scylla-master/longevity/longevity-50gb-3days Test config file(s):

Issue description

>>>>>>> Cassandra-stress failed during AddRemoveDc nemesis. From what I can tell we've started a c-s thread targeting tables in the keyspace_new_dc keyspace, before it was actually created on the (new?) dc nodes. Filtering out sct.log for the name of the keyspace we get:

< t:2022-04-22 18:27:39,426 f:stress_thread.py l:168  c:sdcm.stress_thread   p:INFO  > cassandra-stress write no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,eu-west=3,eu-west_nemesis_dc=1) compression=LZ4Compressor compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native compression=lz4  user=cassandra password=cassandra -rate threads=5 -pop seq=1..10000 -log interval=5 -transport "truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra" -node 10.0.2.95 -errors skip-unsupported-columns
< t:2022-04-22 18:27:43,141 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "echo TAG: loader_idx:0-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=PDZFBB4M9MWJB8AT0LGD; cassandra-stress write no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,eu-west=3,eu-west_nemesis_dc=1) compression=LZ4Compressor compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native compression=lz4  user=cassandra password=cassandra -rate threads=5 -pop seq=1..10000 -log interval=5 -transport "truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra" -node 10.0.2.95 -errors skip-unsupported-columns"...
< t:2022-04-22 18:27:43,141 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress write no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,eu-west=3,eu-west_nemesis_dc=1) compression=LZ4Compressor compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
< t:2022-04-22 18:27:44,131 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   Keyspace: keyspace_new_dc
< t:2022-04-22 18:27:52,291 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:27:51+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-5 !    INFO |  [shard 7] schema_tables - Creating keyspace keyspace_new_dc
< t:2022-04-22 18:27:52,295 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 022-04-22T18:27:51+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-1 !    INFO |  [shard 7] migration_manager - Create new Keyspace: KSMetaData{name=keyspace_new_dc, strategyClass=org.apache.cassandra.locator.NetworkTopologyStrategy, strategyOptions={eu-west=3, eu-west_nemesis_dc=1}, cfMetaData={}, durable_writes=1, userTypes=org.apache.cassandra.config.UTMetaData@0x609008b3ca98}
< t:2022-04-22 18:27:52,296 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:27:51+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-1 !    INFO |  [shard 7] schema_tables - Creating keyspace keyspace_new_dc
< t:2022-04-22 18:27:52,365 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:27:51+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 7] schema_tables - Creating keyspace keyspace_new_dc
< t:2022-04-22 18:27:52,492 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:27:52+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-7 !    INFO |  [shard 7] schema_tables - Creating keyspace keyspace_new_dc
< t:2022-04-22 18:27:52,588 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:27:52+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-8 !    INFO |  [shard 7] schema_tables - Creating keyspace keyspace_new_dc
< t:2022-04-22 18:27:52,592 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:27:52+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-3 !    INFO |  [shard 7] schema_tables - Creating keyspace keyspace_new_dc
< t:2022-04-22 18:28:03,218 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:28:02+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-1 !    INFO |  [shard 11] schema_tables - Creating keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
< t:2022-04-22 18:28:03,290 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:28:02+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 11] schema_tables - Creating keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
< t:2022-04-22 18:28:03,310 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:28:02+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-8 !    INFO |  [shard 11] schema_tables - Creating keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
< t:2022-04-22 18:28:03,314 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:28:02+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-7 !    INFO |  [shard 11] schema_tables - Creating keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
< t:2022-04-22 18:28:03,315 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:28:02+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-3 !    INFO |  [shard 11] schema_tables - Creating keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
< t:2022-04-22 18:28:03,316 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 022-04-22T18:28:02+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-5 !    INFO |  [shard 11] migration_manager - Create new ColumnFamily: org.apache.cassandra.config.CFMetaData@0x6060080c4380[cfId=f1eb5850-c269-11ec-8b40-a9f81feddadf,ksName=keyspace_new_dc,cfName=standard1,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,readRepairChance=0,dcLocalReadRepairChance=0,gcGraceSeconds=864000,keyValidator=org.apache.cassandra.db.marshal.BytesType,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=key, type=org.apache.cassandra.db.marshal.BytesType, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=C0, type=org.apache.cassandra.db.marshal.BytesType, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}, ColumnDefinition{name=C1, type=org.apache.cassandra.db.marshal.BytesType, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}, ColumnDefinition{name=C2, type=org.apache.cassandra.db.marshal.BytesType, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}, ColumnDefinition{name=C3, type=org.apache.cassandra.db.marshal.BytesType, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}, ColumnDefinition{name=C4, type=org.apache.cassandra.db.marshal.BytesType, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=f1eb5851-c269-11ec-8b40-a9f81feddadf,droppedColumns={},collections={},indices={}]
< t:2022-04-22 18:28:03,318 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:28:02+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-5 !    INFO |  [shard 11] schema_tables - Creating keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
< t:2022-04-22 18:28:04,001 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'keyspace_new_dc' does not exist

and:

< t:2022-04-22 18:28:14,510 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   Keyspace: keyspace_new_dc
< t:2022-04-22 18:28:14,856 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   Keyspace: keyspace_new_dc
< t:2022-04-22 18:28:19,126 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 8] schema_tables - Dropping keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
< t:2022-04-22 18:28:19,126 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 8] schema_tables - Creating keyspace_new_dc.standard1 id=fb700560-c269-11ec-bd10-b9606a99d97a version=b8f41ed5-75fa-36e5-a0a1-6aab6bc408ae
< t:2022-04-22 18:28:19,128 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 9] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table keyspace_new_dc.standard1 due to table removal 

and:

< t:2022-04-22 18:28:57,644 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "echo TAG: loader_idx:0-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=CKWADK8XD82XEUO51DCG; cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -port jmx=6868 -mode cql3 native compression=lz4  user=cassandra password=cassandra -rate threads=5 -pop seq=1..10000 -log interval=5 -transport "truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra" -node 10.0.2.95 -errors skip-unsupported-columns"...
< t:2022-04-22 18:28:57,645 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress read no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -port jmx=6868 -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
< t:2022-04-22 18:28:58,636 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   Keyspace: keyspace_new_dc
< t:2022-04-22 18:29:05,961 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 022-04-22T18:29:05+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-4 ! WARNING |  [shard 11] storage_proxy - Exception when communicating with 10.0.2.53, to read from keyspace_new_dc.standard1: Can't find a column family with UUID f1eb5850-c269-11ec-8b40-a9f81feddadf

and:

< t:2022-04-22 18:31:30,343 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   Keyspace: keyspace_new_dc
< t:2022-04-22 18:31:32,632 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=QUTX9FY641M59373CP0X; cassandra-stress read no-warmup cl=QUORUM n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -port jmx=6868 -mode cql3 native compression=lz4  user=cassandra password=cassandra -rate threads=5 -pop seq=1..10000 -log interval=5 -transport "truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra" -node 10.0.2.95 -errors skip-unsupported-columns"...
< t:2022-04-22 18:31:32,633 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress read no-warmup cl=QUORUM n=10000 -schema 'keyspace=keyspace_new_dc compression=LZ4Compressor' -port jmx=6868 -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5
< t:2022-04-22 18:31:33,695 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   Keyspace: keyspace_new_dc
< t:2022-04-22 18:31:34,598 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > WARN  18:31:34,575 Error while computing token map for keyspace keyspace_new_dc with datacenter eu-west_nemesis_dc: could not achieve replication factor 1 (found 0 replicas only), check your keyspace replication settings.

<<<<<<<

Logs:

Jenkins job URL

KnifeyMoloko commented 2 years ago

Looking at the node logs for the node that was created during the AddRemoveDc nemesis running it looks like it's creating and dropping the missing keyspace:

2022-04-22T18:27:51+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 7] schema_tables - Creating keyspace keyspace_new_dc
2022-04-22T18:27:52+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 0] compaction - [Compact system_schema.keyspaces eb7300e0-c269-11ec-bbba-95be33d2f0b4] Compacted 2 sstables to [/var/lib/scylla/data/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/me-252-big-Data.db:level=0]. 26kB to 13kB (~50% of original) in 57ms = 233kB/s. ~256 total partitions merged to 20.
2022-04-22T18:27:52+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 7] schema_tables - Schema version changed to 97383513-c776-36b9-a2e1-bcca6ba647be
2022-04-22T18:28:02+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 11] schema_tables - Creating keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 8] schema_tables - Dropping keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 8] schema_tables - Creating keyspace_new_dc.standard1 id=fb700560-c269-11ec-bd10-b9606a99d97a version=b8f41ed5-75fa-36e5-a0a1-6aab6bc408ae
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 9] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table keyspace_new_dc.standard1 due to table removal
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 0] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table keyspace_new_dc.standard1 due to table removal
....
....
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 8] schema_tables - Schema version changed to fccfbfe8-b54b-3ce2-be9b-12ccf8fb81ad
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 !    INFO |  [shard 9] compaction - [Compact system.truncated fb795430-c269-11ec-bd2d-95c133d2f0b4] Compacted 3 sstables to [/var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-121-big-Data.db:level=0]. 86kB to 59kB (~69% of original) in 8ms = 7MB/s. ~384 total partitions merged to 2.
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! WARNING |  [shard 7] storage_proxy - Failed to apply mutation from 10.0.0.69#7: data_dictionary::no_such_column_family (Can't find a column family with UUID f1eb5850-c269-11ec-8b40-a9f81feddadf)

@roydahan I'm not sure if this is a SCT or Scylla issue at this moment. Looks as if the c-s thread failed due to this keyspace being added and dropped. At the same time Scylla itself seems to be confused on what is the state of the schema. What do you think?

fruch commented 2 years ago

@KnifeyMoloko if it wasn't SCT that dropped the table, it's a scylla issue.

roydahan commented 2 years ago

Please check the code of the nemesis, I think it's SCT.

fgelcer commented 2 years ago

@soyacz , i'm seeing it happening on another test, longevity-10gb-3h-gce-test: the nemesis itself did not fail, but seeing side effects of it inside the c-s output log itself:

WARN  11:20:37,693 Error while computing token map for keyspace keyspace_new_dc with datacenter us-east1_nemesis_dc: could not achieve replication factor 1 (found 0 replicas only), check your keyspace replication settings.
WARN  11:31:47,932 Error while computing token map for keyspace keyspace_new_dc with datacenter us-east1_nemesis_dc: could not achieve replication factor 1 (found 0 replicas only), check your keyspace replication settings.

this is the nemesis time:

2022-05-10 11:08:25.050: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=61327c90-9047-4456-bd44-54786fb2a9ba: nemesis_name=AddRemoveDc target_node=Node longevity-10gb-3h-5-0-db-node-391b02ec-0-2 [34.75.64.79 | 10.142.0.16] (seed: False)
2022-05-10 11:23:51.629: (DisruptionEvent Severity.NORMAL) period_type=end event_id=61327c90-9047-4456-bd44-54786fb2a9ba duration=15m26s: nemesis_name=AddRemoveDc target_node=Node longevity-10gb-3h-5-0-db-node-391b02ec-0-2 [34.75.64.79 | 10.142.0.16] (seed: False)

so it means that one of the messages in the c-s log happened after the nemesis has already finished... are we rolling back the cluster configuration at the end of the nemesis correctly?

soyacz commented 2 years ago

fix: https://github.com/scylladb/scylla-cluster-tests/pull/4744