Closed KnifeyMoloko closed 2 years ago
Looking at the node logs for the node that was created during the AddRemoveDc
nemesis running it looks like it's creating and dropping the missing keyspace:
2022-04-22T18:27:51+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 7] schema_tables - Creating keyspace keyspace_new_dc
2022-04-22T18:27:52+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 0] compaction - [Compact system_schema.keyspaces eb7300e0-c269-11ec-bbba-95be33d2f0b4] Compacted 2 sstables to [/var/lib/scylla/data/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/me-252-big-Data.db:level=0]. 26kB to 13kB (~50% of original) in 57ms = 233kB/s. ~256 total partitions merged to 20.
2022-04-22T18:27:52+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 7] schema_tables - Schema version changed to 97383513-c776-36b9-a2e1-bcca6ba647be
2022-04-22T18:28:02+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 11] schema_tables - Creating keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 8] schema_tables - Dropping keyspace_new_dc.standard1 id=f1eb5850-c269-11ec-8b40-a9f81feddadf version=215e22c4-8623-3b18-be6f-0a3b84ee8313
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 8] schema_tables - Creating keyspace_new_dc.standard1 id=fb700560-c269-11ec-bd10-b9606a99d97a version=b8f41ed5-75fa-36e5-a0a1-6aab6bc408ae
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 9] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table keyspace_new_dc.standard1 due to table removal
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 0] compaction_manager - Stopping 1 tasks for 0 ongoing compactions for table keyspace_new_dc.standard1 due to table removal
....
....
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 8] schema_tables - Schema version changed to fccfbfe8-b54b-3ce2-be9b-12ccf8fb81ad
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! INFO | [shard 9] compaction - [Compact system.truncated fb795430-c269-11ec-bd2d-95c133d2f0b4] Compacted 3 sstables to [/var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-121-big-Data.db:level=0]. 86kB to 59kB (~69% of original) in 8ms = 7MB/s. ~384 total partitions merged to 2.
2022-04-22T18:28:18+00:00 longevity-tls-50gb-3d-master-db-node-ab770923-9 ! WARNING | [shard 7] storage_proxy - Failed to apply mutation from 10.0.0.69#7: data_dictionary::no_such_column_family (Can't find a column family with UUID f1eb5850-c269-11ec-8b40-a9f81feddadf)
@roydahan I'm not sure if this is a SCT or Scylla issue at this moment. Looks as if the c-s thread failed due to this keyspace being added and dropped. At the same time Scylla itself seems to be confused on what is the state of the schema. What do you think?
@KnifeyMoloko if it wasn't SCT that dropped the table, it's a scylla issue.
Please check the code of the nemesis, I think it's SCT.
@soyacz , i'm seeing it happening on another test, longevity-10gb-3h-gce-test
:
the nemesis itself did not fail, but seeing side effects of it inside the c-s output log itself:
WARN 11:20:37,693 Error while computing token map for keyspace keyspace_new_dc with datacenter us-east1_nemesis_dc: could not achieve replication factor 1 (found 0 replicas only), check your keyspace replication settings.
WARN 11:31:47,932 Error while computing token map for keyspace keyspace_new_dc with datacenter us-east1_nemesis_dc: could not achieve replication factor 1 (found 0 replicas only), check your keyspace replication settings.
this is the nemesis time:
2022-05-10 11:08:25.050: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=61327c90-9047-4456-bd44-54786fb2a9ba: nemesis_name=AddRemoveDc target_node=Node longevity-10gb-3h-5-0-db-node-391b02ec-0-2 [34.75.64.79 | 10.142.0.16] (seed: False)
2022-05-10 11:23:51.629: (DisruptionEvent Severity.NORMAL) period_type=end event_id=61327c90-9047-4456-bd44-54786fb2a9ba duration=15m26s: nemesis_name=AddRemoveDc target_node=Node longevity-10gb-3h-5-0-db-node-391b02ec-0-2 [34.75.64.79 | 10.142.0.16] (seed: False)
so it means that one of the messages in the c-s log happened after the nemesis has already finished... are we rolling back the cluster configuration at the end of the nemesis correctly?
Installation details
Kernel Version: 5.13.0-1022-aws Scylla version (or git commit hash):
5.1.dev-20220421.cc40685c288f
with build-idfde9a351743afef6ce27c7787712211d9ee8f41f
Cluster size: 6 nodes (i3.4xlarge)Scylla Nodes used in this run:
OS / Image:
ami-07d353a9e8649b3a1
(aws: eu-west-1)Test:
longevity-50gb-3days
Test id:ab770923-a43f-419a-b47a-c3457ba2b0c0
Test name:scylla-master/longevity/longevity-50gb-3days
Test config file(s):Issue description
>>>>>>> Cassandra-stress failed during
AddRemoveDc
nemesis. From what I can tell we've started a c-s thread targeting tables in thekeyspace_new_dc
keyspace, before it was actually created on the (new?) dc nodes. Filtering outsct.log
for the name of the keyspace we get:and:
and:
and:
<<<<<<<
$ hydra investigate show-monitor ab770923-a43f-419a-b47a-c3457ba2b0c0
$ hydra investigate show-logs ab770923-a43f-419a-b47a-c3457ba2b0c0
Logs:
Jenkins job URL