scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
336 stars 171 forks source link

2 nodes in 7 node Scylla cluster missing in the middle of StatefulSet and we can't bootstrap new nodes there #2068

Closed gdubicki closed 1 month ago

gdubicki commented 2 months ago

What happened?

We originally had 7 nodes in our Scylla cluster, n2d-standard-32 with 3TB local SSDs, running Scylla 5.2.9, Scylla Operator 1.9.x, Scylla Manager 3.1.x.

We updated Scylla to 5.4.7, Scylla Operator 1.13.0 and Scylla Manager to 3.3.0 on the 25th of June. (Note that since that update we noticed https://github.com/scylladb/scylladb/issues/19793)

Then from June 25th to July 4th we migrated our Scylla from to a new node pool, with same machine size but a little different config - switching from the default SA to a custom one, used recommended oauth scopes. (Nothing that would affect Scylla directly, I think.)

On 12th of July we updated Scylla it to 5.4.9.

On July 19th there was a hardware issue on one GCP node with Scylla that caused it to be restarted and the local SSD contents were lost. This was pod 3 in the StatefulSet. The nodetool status after it happened was:

Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.7.252.229  1.53 TB    256          ?       1ff1b8df-7a90-4321-a309-7cd69e20bd70  us-west1-b
UN  10.7.241.130  1.28 TB    256          ?       8a24c600-5525-490e-a3cd-314f6062d6a1  us-west1-b
UN  10.7.241.175  1.43 TB    256          ?       050dcc67-7bb8-4d5d-89b1-5dbe0bcbb8b2  us-west1-b
UN  10.7.241.174  1.43 TB    256          ?       05bb205f-9475-43b1-a609-c9846f55ee2f  us-west1-b
UN  10.7.249.238  1.43 TB    256          ?       b8f68c62-c462-4a30-a505-5ece9ae1ab0b  us-west1-b
UN  10.7.243.109  1.54 TB    256          ?       5ddc2724-001a-4f06-8987-47633577263f  us-west1-b
DN  10.7.248.124  1.35 TB    256          ?       15253b54-8f30-4583-b08e-469c10c58aa2  us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

...but when I did the node replace procedure it failed on:

ERROR 2024-07-19 15:55:51,956 [shard  0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID 67322b56-9f99-4375-98d2-b14ce167bf7a not found)
2024-07-19 15:55:52,025 INFO exited: scylla (exit status 1; not expected)

(Note that the id from this message was not part of the cluster and some time later when did this procedure to remove the ghosts nodes from our cluster then that id was also NOT appearing anywhere then.)

Anyway, we decided to remove the node 15253b54-8f30-4583-b08e-469c10c58aa2 using procedure https://opensource.docs.scylladb.com/branch-5.4/operating-scylla/procedures/cluster-management/remove-node.html#removing-an-unavailable-node and then bootstrap the new one as a completely new node.

Node removal took ~13 hours but it succeeded.

But we didn't succeed with bootstrapping a new node. We guessed that it's because that although the nodetool status has shown 6 nodes then in the cluster, the ScyllaCluster object still had 7 of them:

Screenshot 2024-07-20 at 10 25 36 (1)

(Sorry for the screenshots with texts, that's only way some of the states were preserved.)

Then we tried to replace node 3 with node 6 to then scale down the cluster "officially" by updating the values for ScyllaCluster to make it a 6 node cluster (we wanted to do that anyway, assuming that with our RF=3 this is a better way to have a more balancer cluster).

But then most probably we hit an issue with PV / PVC from the old GCP node being left attached to the pod 3 and probably because of that we weren't able to start the replace procedure. :( (I think that this was why it happened because only yesterday I thought about this and today after did a cleanup of those PV / PVC I was able to make pod 3 schedule a new Scylla pod again.)

Ultimately we just left it as it was then but the next day we hit another issue: because of https://github.com/scylladb/scylladb/issues/19793 hit the limit of our 3TB local disk space and was killed. (Back then we didn't yet know that we hit the 90% limit, not 100%, because we didn't yet have the workaround from https://github.com/scylladb/scylla-operator/issues/2056).

This happened to pod 4 and we were afraid to try to restart it because we assumed that StatefulSet will not allow it to start until pod 3 is running.

So we had pod 3 unready and pod 4 not starting. Ultimately we hacked around StatefulSet to make it work - we created a copy of it, deleted it and oprhaning the pods, recreated it with podManagementPolicy: Parallel. The hack worked (we did this once in the past before to work around the STS limitation successfully), we managed to start pod 4 but it fell in a restart loop because it was getting its disk full quickly.

That was July 22nd already. We decided to start migrating our Scylla cluster to a new nodepool, with n2d-highmem-32 and 6TB disks - more memory to tune Scylla later to more recommended CPU to memory ratio and bigger disks to have more time to fix https://github.com/scylladb/scylladb/issues/19793.

The replace node worked this time and pod 4 was successfully moved to the new node pool.

In the meantime we did a cleanup of the ghosts nodes in our cluster, but we noticed some strangeness during it so we reported https://github.com/scylladb/scylladb/issues/20020. Our node ids and the ghosts are in that issue.

Anyway, we continued to migrate the remaining Scylla pods to the new node pool and it all worked well, we migrated all of them except the pod 0.

Note that pod 3 was still out, we didn't fix that yet. We planned to look into it more after we complete the migration to the new node pool as the disk space was slowly running out on the nodes with 3TB disks.

So yesterday we started to migrate pod 0 according to the procedure but during the bootstrap, a few minutes after it started, we accidentally triggered a delete of its PVC. :| It was shown as terminating, so we were worried that we will wait for ~12-30 hours (this is how long it took for us recently) for the bootstrap only to get it terminated and all of that progress lost, so we manually deleted the pod, deleted the PVC and PV and resumed.

But then the new Scylla didn't restart the bootstrap as expected.

Unfortunately because of the stress and of the conditions we worked in I don't have the exact info right now what we did and in what order. :(

I do have the logs from all Scyllas from that period, so I can query for things you want to know.

What I know is that:

  1. At some point we did see the node that we were trying to replace show in nodetool status with a null id, like in https://github.com/scylladb/scylladb/issues/19975
  2. We tried removing both the node we were trying to replace (1ff1b8df-7a90-4321-a309-7cd69e20bd70) as well as the ids of the nodes that were trying to replace it.
  3. We got this in Scylla logs when it was trying to replace the node:
    ERROR 2024-08-08 17:18:21,500 [shard  0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID 1ff1b8df-7a90-4321-a309-7cd69e20bd70 not found)
  4. We have tried setting, unsettling the node-replace label in the Services for 0 and 3. We tried removing the "internal" label there with the id of the node that is being replaces (hoping that the right one will be set on the next try).
  5. We have tried deleting the new GCP nodes that were being created for bootstrap to start with a clean disk.

So ultimately now we are left with both pod 3 and pod 0 not working and a cluster with this state:

root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.7.241.130  1.44 TB    256          ?       787555a6-89d6-4b33-941c-940415380062  us-west1-b
UN  10.7.241.175  1.61 TB    256          ?       5342afaf-c19c-4be2-ada1-929698a4c398  us-west1-b
UN  10.7.241.174  1.34 TB    256          ?       813f49f9-e397-4d70-8300-79fa91817f11  us-west1-b
UN  10.7.249.238  1.35 TB    256          ?       5cc72b36-6fcf-4790-a540-930e544d59d2  us-west1-b
UN  10.7.243.109  1.37 TB    256          ?       880977bf-7cbb-4e0f-be82-ded853da57aa  us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

cqlsh> select host_id, up from system.cluster_status;

 host_id                              | up
--------------------------------------+------
 787555a6-89d6-4b33-941c-940415380062 | True
 5342afaf-c19c-4be2-ada1-929698a4c398 | True
 813f49f9-e397-4d70-8300-79fa91817f11 | True
 5cc72b36-6fcf-4790-a540-930e544d59d2 | True
 880977bf-7cbb-4e0f-be82-ded853da57aa | True

(5 rows)
cqlsh> select server_id,group_id from system.raft_state ;

 server_id                            | group_id
--------------------------------------+--------------------------------------
 5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
 5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
 787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
 813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
 880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2
 aa434e8c-84a1-43a3-a398-28e4e6949e56 | 904c8960-2c68-11ee-979c-be9922839fd2
 ceb76652-0f39-4018-9fa5-dd8f0b25e85a | 904c8960-2c68-11ee-979c-be9922839fd2

(7 rows)

Now we can't bootstrap the missing pods 0 and 3 as new nodes (as I guess this is the only option now, if the are no nodes to replace).

We are getting seemingly good info in the logs at first like:

INFO  2024-08-09 08:50:48,814 [shard  0:stre] gossip - Gossip shadow round finished with nodes_talked={10.7.241.130, 10.7.243.109, 10.7.249.238, 10.7.241.174, 10.7.241.175}

...but then we get a bunch of:

WARN  2024-08-09 08:39:10,120 [shard  0:stre] storage_service - bootstrap[c4ddd61b-3ecb-4cce-a17a-d6f5c4525f20]: Found pending node ops = {{10.7.241.130 -> {2214835c-54fa-4763-8fa8-4c9d421f7670}}, {10.7.243.109 -> {2214835c-54fa-4763-8fa8-4c9d421f7670}}, {10.7.241.174 -> {2214835c-54fa-4763-8fa8-4c9d421f7670}}, {10.7.249.238 -> {2214835c-54fa-4763-8fa8-4c9d421f7670}}, {10.7.241.175 -> {2214835c-54fa-4763-8fa8-4c9d421f7670}}}, sleep 5 seconds and check again

...and it ends with repeating:

INFO  2024-08-09 08:51:43,923 [shard  0:stre] gossip - (rate limiting dropped 497 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...

When we try to remove any (ghost) nodes now, we can't:

root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool removenode ceb76652-0f39-4018-9fa5-dd8f0b25e85a
nodetool: Scylla API server HTTP POST to URL '/storage_service/remove_node' failed: std::runtime_error (Operation removenode is in progress, try again)
See 'nodetool help' or 'nodetool help <command>'.

...but if we check what node is still being deleted, we see:

root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool removenode status
RemovalStatus: No token removals in process.

Please let us know if you need any more info! Fixing this is the top priority for us.

What did you expect to happen?

Being now able to bootstrap the 2 missing nodes so we can restore our cluster to a fully healthy state.

How can we reproduce it (as minimally and precisely as possible)?

It's really hard for me to say which steps of the above history were the main contributors to our current situation...

Scylla Operator version

1.13.0

Kubernetes platform name and version

$ kubectl version
$ kubectl version
Client Version: v1.29.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.6-gke.1254000

(Note: this is the current version. When the whole story started we probably had a little older version.)

Please attach the must-gather archive.

scylla-operator-must-gather-dn7mtkvkjwvg.zip

Anything else we need to know?

This was initiated in this Slack thread

zimnx commented 2 months ago

Please try to get logs from longer time period. These available in the must-gather dump, only contains last 2 hours. Logs from commands you executed and you had issues are not present.

gdubicki commented 2 months ago

Thanks @zimnx. That's 8mln rows of logs though and I can't export more than 100k from Datadog in one shot. Can you perhaps provide me what strings I should search for so we can limit the size of the export?

gdubicki commented 2 months ago

I have done an export of logs with <100k rows with the following query:

index:scylla image_name:"scylladb/scylla" -large_data -compaction -querier -query_processor -"seastar::rpc::closed_error" (raft OR BOOTSTRAP OR WARN OR ERROR) -snitch_logger -iotune -repair 

I hope that this will be helpful at least for a start. Of course I can make more queries if needed.

extract-2024-08-09T14_34_53.679Z.csv.zip

gdubicki commented 2 months ago

@zimnx: Would performing a rolling restart of our cluster to apply some safe optimizations - only providing Scylla and Scylla Manager with more memory per node - be safe to do now (well, on Monday morning to be precise)?

I am also hoping that, if this is related to https://github.com/scylladb/scylladb/issues/19975, it might also resolve some or all of our current cluster scaling issues. Or that at least it won't do more harm here, while possibly improving the performance (which we will need when our peak hours traffic hits with only 5 nodes in the cluster).

zimnx commented 2 months ago

I have done an export of logs with <100k rows with the following query:

index:scylla image_name:"scylladb/scylla" -large_data -compaction -querier -query_processor -"seastar::rpc::closed_error" (raft OR BOOTSTRAP OR WARN OR ERROR) -snitch_logger -iotune -repair 

I hope that this will be helpful at least for a start. Of course I can make more queries if needed.

extract-2024-08-09T14_34_53.679Z.csv.zip

Filtering is not going to help, as there might be a log explaining a root cause of an issue that might not be picked up by the filter. I think it would be the best to collect logs +/- 10 minutes around time when commands were executed.

Please also attach recent output of nodetool status and nodetool gossipinfo

Would performing a rolling restart of our cluster to apply some safe optimizations - only providing Scylla and Scylla Manager with more memory per node - be safe to do now (well, on Monday morning to be precise)?

I can't guarantee it would be safe as your cluster is in borked state. If you're running RF=3 and CL=QUORUM queries, then it should be safe. Just don't add more CPUs, as it would trigger a resharding we don't want at this point. Make sure to give each node after restart enough time to warm the cache before proceeding to the next one if you're worried about the traffic.

gdubicki commented 2 months ago

As of now:

root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.7.241.130  1.56 TB    256          ?       787555a6-89d6-4b33-941c-940415380062  us-west1-b
UN  10.7.241.175  1.66 TB    256          ?       5342afaf-c19c-4be2-ada1-929698a4c398  us-west1-b
UN  10.7.241.174  1.48 TB    256          ?       813f49f9-e397-4d70-8300-79fa91817f11  us-west1-b
UN  10.7.249.238  1.51 TB    256          ?       5cc72b36-6fcf-4790-a540-930e544d59d2  us-west1-b
UN  10.7.243.109  1.43 TB    256          ?       880977bf-7cbb-4e0f-be82-ded853da57aa  us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool gossipinfo
/10.7.241.175
  generation:1721596585
  heartbeat:2435731
  NET_VERSION:0
  RACK:us-west1-b
  LOAD:1823919621838
  STATUS:NORMAL,-8864119814958549968
  DC:us-west1
  RPC_ADDRESS:10.7.241.175
  X4:1
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  HOST_ID:5342afaf-c19c-4be2-ada1-929698a4c398
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  X6:31
  X7:12
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  X2:system_auth.roles:0.000000;system_traces.node_slow_log:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;production.feed_truncations:0.000000;system_traces.sessions:0.000000;production.activities:0.792409;system_distributed.cdc_streams_descriptions_v2:0.000000;system_distributed.service_levels:1.000000;system_auth.role_members:0.000000;system_traces.events:0.000000;system_distributed.view_build_status:0.000000;production.feeds:0.893364;production.activities_v2:0.560822;system_traces.node_slow_log_time_idx:0.000000;test.heartrate_v1:0.000000;production.feed_counters:0.978368;system_traces.sessions_time_idx:0.000000;system_auth.role_attributes:0.000000;system_distributed.cdc_generation_timestamps:0.000000;
  RELEASE_VERSION:3.0.8
  X3:3
  X5:0:342255206:1721669718699
/10.7.249.238
  generation:1722496308
  heartbeat:1249240
  NET_VERSION:0
  RACK:us-west1-b
  LOAD:1664549588379
  STATUS:NORMAL,-9167251459053092449
  DC:us-west1
  RPC_ADDRESS:10.7.249.238
  X4:1
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  HOST_ID:5cc72b36-6fcf-4790-a540-930e544d59d2
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  X6:31
  X7:12
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  X2:production.feed_truncations:0.000000;system_traces.sessions:0.000000;test.heartrate_v1:0.000000;production.feed_counters:0.976706;system_distributed.cdc_generation_timestamps:0.000000;system_auth.role_members:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;production.activities:0.786895;system_traces.events:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;production.feeds:0.897525;system_distributed.view_build_status:0.000000;system_distributed.service_levels:1.000000;system_traces.node_slow_log_time_idx:0.000000;production.activities_v2:0.651030;system_auth.role_attributes:0.000000;system_traces.sessions_time_idx:0.000000;system_auth.roles:0.000000;system_traces.node_slow_log:0.000000;
  RELEASE_VERSION:3.0.8
  X3:3
  X5:0:342255206:1722544478221
/10.7.243.109
  generation:1723033937
  heartbeat:518066
  NET_VERSION:0
  RACK:us-west1-b
  LOAD:1574210785812
  STATUS:NORMAL,-8564353228911561110
  DC:us-west1
  RPC_ADDRESS:10.7.243.109
  X4:1
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  HOST_ID:880977bf-7cbb-4e0f-be82-ded853da57aa
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  X6:31
  X7:12
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  X2:production.feed_truncations:0.000000;system_traces.sessions:0.000000;test.heartrate_v1:0.000000;production.feed_counters:0.977903;system_distributed.cdc_generation_timestamps:0.000000;system_auth.role_members:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;production.activities:0.796932;system_traces.events:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;production.feeds:0.895235;system_distributed.view_build_status:0.000000;system_distributed.service_levels:1.000000;system_traces.node_slow_log_time_idx:0.000000;production.activities_v2:0.586408;system_auth.role_attributes:0.000000;system_traces.sessions_time_idx:0.000000;system_auth.roles:0.000000;system_traces.node_slow_log:0.000000;
  RELEASE_VERSION:3.0.8
  X3:3
  X5:0:342255206:1723103256746
/10.7.241.130
  generation:1721671188
  heartbeat:2373773
  NET_VERSION:0
  RACK:us-west1-b
  LOAD:1717705044311
  STATUS:NORMAL,-97971195482211408
  DC:us-west1
  RPC_ADDRESS:10.7.241.130
  X4:1
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  HOST_ID:787555a6-89d6-4b33-941c-940415380062
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  X6:31
  X7:12
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  X2:system_auth.roles:0.000000;system_traces.node_slow_log:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;production.feed_truncations:0.000000;system_traces.sessions:0.000000;production.activities:0.787918;system_distributed.cdc_streams_descriptions_v2:0.000000;system_distributed.service_levels:1.000000;system_auth.role_members:0.000000;system_traces.events:0.000000;system_distributed.view_build_status:0.000000;production.feeds:0.896144;production.activities_v2:0.429126;system_traces.node_slow_log_time_idx:0.000000;test.heartrate_v1:0.000000;production.feed_counters:0.974609;system_traces.sessions_time_idx:0.000000;system_auth.role_attributes:0.000000;system_distributed.cdc_generation_timestamps:0.000000;
  RELEASE_VERSION:3.0.8
  X3:3
  X5:0:342255206:1721720096847
/10.7.241.174
  generation:1722848828
  heartbeat:757739
  NET_VERSION:0
  RACK:us-west1-b
  LOAD:1628117103437
  STATUS:NORMAL,1116083325320868400
  DC:us-west1
  RPC_ADDRESS:10.7.241.174
  X4:1
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  HOST_ID:813f49f9-e397-4d70-8300-79fa91817f11
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  X6:31
  X7:12
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  X2:production.feed_truncations:0.000000;system_traces.sessions:0.000000;test.heartrate_v1:0.000000;production.feed_counters:0.977739;system_distributed.cdc_generation_timestamps:0.000000;system_auth.role_members:0.000000;system_distributed.cdc_streams_descriptions_v2:0.000000;production.activities:0.797675;system_traces.events:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;production.feeds:0.894565;system_distributed.view_build_status:0.000000;system_distributed.service_levels:1.000000;system_traces.node_slow_log_time_idx:0.000000;production.activities_v2:0.593607;system_auth.role_attributes:0.000000;system_traces.sessions_time_idx:0.000000;system_auth.roles:0.000000;system_traces.node_slow_log:0.000000;
  RELEASE_VERSION:3.0.8
  X3:3
  X5:0:342255206:1722923560164
gdubicki commented 2 months ago

Would performing a rolling restart of our cluster to apply some safe optimizations - only providing Scylla and Scylla Manager with more memory per node - be safe to do now (well, on Monday morning to be precise)?

I can't guarantee it would be safe as your cluster is in borked state. If you're running RF=3 and CL=QUORUM queries, then it should be safe. Just don't add more CPUs, as it would trigger a resharding we don't want at this point. Make sure to give each node after restart enough time to warm the cache before proceeding to the next one if you're worried about the traffic.

I did it and it worked, all the nodes restarted. As expected, it needed manual delete to recreate pods 1 and 2 as the automatic rollout by STS was stopped on pod 3 that is not going up. What I didn't expect is that after applying a change the ScyllaCluster object was updated with the new memory request and limit, but the StatefulSet was not. I am afraid that they got out of sync when we did the hack with deleting and recreating the STS. :/ Ultimately we updated the STS manually to create pods with more memory.

Here's an updated output of nodetool gossipinfo after the rolling restart:

root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool gossipinfo
/10.7.249.238
  generation:1723463402
  heartbeat:595
  RPC_ADDRESS:10.7.249.238
  X7:12
  X3:3
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  STATUS:NORMAL,9213597656330103393
  X2:system_distributed.cdc_generation_timestamps:0.000000;production.feed_truncations:0.000000;system_auth.roles:1.000000;system_traces.node_slow_log:0.000000;system_auth.role_attributes:0.000000;system_traces.sessions_time_idx:0.000000;system_traces.node_slow_log_time_idx:0.000000;production.activities_v2:0.253676;system_traces.events:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;production.feeds:0.868316;system_distributed.view_build_status:0.000000;system_distributed.service_levels:1.000000;system_traces.sessions:0.000000;production.feed_counters:0.924761;test.heartrate_v1:0.000000;production.activities:0.737744;system_distributed.cdc_streams_descriptions_v2:0.000000;system_auth.role_members:0.000000;
  X4:1
  LOAD:1670341483567
  X5:0:756442726:1723463415124
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  RELEASE_VERSION:3.0.8
  NET_VERSION:0
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  RACK:us-west1-b
  X6:31
  HOST_ID:5cc72b36-6fcf-4790-a540-930e544d59d2
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  DC:us-west1
/10.7.241.175
  generation:1723463490
  heartbeat:494
  RPC_ADDRESS:10.7.241.175
  X7:12
  X3:3
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  STATUS:NORMAL,9191245930443787145
  X2:system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_generation_timestamps:0.975906;test.heartrate_v1:0.000000;production.feed_counters:0.919281;production.feed_truncations:0.000000;system_distributed.service_levels:1.000000;production.activities:0.710310;system_distributed.cdc_streams_descriptions_v2:0.000000;system_traces.events:0.000000;system_distributed.view_build_status:0.000000;production.feeds:0.847015;production.activities_v2:0.203567;system_traces.node_slow_log_time_idx:0.000000;system_auth.role_members:0.000000;system_traces.sessions_time_idx:0.000000;system_auth.role_attributes:0.000000;system_traces.node_slow_log:0.000000;system_auth.roles:1.000000;
  X4:1
  LOAD:1827660219161
  X5:0:756442726:1723463502683
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  RELEASE_VERSION:3.0.8
  NET_VERSION:0
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  RACK:us-west1-b
  X6:31
  HOST_ID:5342afaf-c19c-4be2-ada1-929698a4c398
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  DC:us-west1
/10.7.243.109
  generation:1723463698
  heartbeat:235
  RPC_ADDRESS:10.7.243.109
  X7:12
  X3:3
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  STATUS:NORMAL,8942959036279233951
  X2:system_auth.role_attributes:0.000000;system_traces.sessions_time_idx:0.000000;production.feed_truncations:0.000000;system_auth.roles:1.000000;system_traces.node_slow_log:0.000000;system_auth.role_members:0.000000;system_traces.node_slow_log_time_idx:0.000000;production.activities_v2:0.107328;system_traces.events:0.000000;system_traces.sessions:0.000000;system_distributed.cdc_generation_timestamps:0.000000;production.feed_counters:0.777977;test.heartrate_v1:0.000000;production.feeds:0.542748;system_distributed.view_build_status:0.000000;system_distributed.service_levels:1.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;production.activities:0.432635;system_distributed.cdc_streams_descriptions_v2:0.000000;
  X4:1
  LOAD:1576482887375
  X5:0:756442726:1723463710868
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  RELEASE_VERSION:3.0.8
  NET_VERSION:0
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  RACK:us-west1-b
  X6:31
  HOST_ID:880977bf-7cbb-4e0f-be82-ded853da57aa
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  DC:us-west1
/10.7.241.174
  generation:1723463611
  heartbeat:346
  RPC_ADDRESS:10.7.241.174
  X7:12
  X3:3
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  STATUS:NORMAL,996896219089399285
  X2:system_distributed.cdc_generation_timestamps:0.000000;production.feed_truncations:0.000000;system_auth.roles:1.000000;system_traces.node_slow_log:0.000000;system_auth.role_attributes:0.000000;system_traces.sessions_time_idx:0.000000;system_traces.node_slow_log_time_idx:0.000000;production.activities_v2:0.142687;system_traces.events:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.791391;production.feeds:0.753341;system_distributed.view_build_status:0.000000;system_distributed.service_levels:1.000000;system_traces.sessions:0.000000;production.feed_counters:0.862901;test.heartrate_v1:0.000000;production.activities:0.617211;system_distributed.cdc_streams_descriptions_v2:0.000000;system_auth.role_members:0.000000;
  X4:1
  LOAD:1631596291649
  X5:0:756442726:1723463624075
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  RELEASE_VERSION:3.0.8
  NET_VERSION:0
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  RACK:us-west1-b
  X6:31
  HOST_ID:813f49f9-e397-4d70-8300-79fa91817f11
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  DC:us-west1
/10.7.241.130
  generation:1723463283
  heartbeat:746
  RPC_ADDRESS:10.7.241.130
  X7:12
  X3:3
  X8:v2;1723193419331;a0d4c151-c4d9-4ada-a801-c39a82eb9602
  STATUS:NORMAL,8992710247996941333
  X2:system_distributed.cdc_generation_timestamps:0.999277;production.feed_truncations:0.000000;system_auth.roles:0.967668;system_traces.node_slow_log:0.000000;system_auth.role_attributes:0.000000;system_traces.sessions_time_idx:0.000000;system_traces.node_slow_log_time_idx:0.000000;production.activities_v2:0.153191;system_traces.events:0.000000;system_distributed_everywhere.cdc_generation_descriptions_v2:0.000000;production.feeds:0.893740;system_distributed.view_build_status:0.000000;system_distributed.service_levels:1.000000;system_traces.sessions:0.000000;production.feed_counters:0.942885;test.heartrate_v1:0.000000;production.activities:0.768185;system_distributed.cdc_streams_descriptions_v2:0.000000;system_auth.role_members:0.000000;
  X4:1
  LOAD:1721391700392
  X5:0:756442726:1723463295645
  X9:org.apache.cassandra.locator.GossipingPropertyFileSnitch
  RELEASE_VERSION:3.0.8
  NET_VERSION:0
  X1:AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,HINTED_HANDOFF_SEPARATE_CONNECTION,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
  RACK:us-west1-b
  X6:31
  HOST_ID:787555a6-89d6-4b33-941c-940415380062
  SCHEMA:d2474336-9cf3-3195-9445-8c250dcadb2f
  DC:us-west1

I don't know if something significant changed here, so I am just pasting it as it was returned. (The nodetool status output didn't change, except for minimal changes in the load values.)

gdubicki commented 2 months ago

I attach all the logs from all the nodes from the period: Aug 8, 6:55 pm CEST – Aug 8, 7:20 pm CEST.

extract-2024-08-12T15_40_23.250Z.csv.zip

gdubicki commented 2 months ago

I attach all the logs from all the nodes from the period: Aug 8, 8:13 pm CEST – Aug 8, 8:24 pm CEST

extract-2024-08-12T15_53_10.287Z.csv.zip

gdubicki commented 2 months ago

I attach all the logs from all the nodes from the period: Aug 8, 8:31 pm CEST – Aug 8, 9:15 pm CEST

extract-2024-08-12T16_06_59.243Z.csv.zip

gdubicki commented 2 months ago

The time periods above were chosen because some things have started or stopped happening at this time and to keep below the 100k limit. This is a visualization of the amount of logs during that whole time:

Screenshot 2024-08-12 at 18 17 44
gdubicki commented 2 months ago

Please let me know if there's anything that stands out, @zimnx. I will share more logs tomorrow.

gdubicki commented 2 months ago

I attach all the logs from all the nodes from the period: Aug 8, 9:15 pm CEST – Aug 8, 9:31 pm CEST

extract-2024-08-13T08_43_17.999Z.csv.zip

gdubicki commented 2 months ago

I attach all the logs from all the nodes from the period: Aug 8, 9:31 pm CEST – Aug 8, 9:50 pm CEST

extract-2024-08-13T08_47_46.619Z.csv.zip

gdubicki commented 2 months ago

I attach all the logs from all the nodes from the period: Aug 8, 9:50 pm CEST – Aug 8, 10:05 pm CEST

extract-2024-08-13T08_51_55.757Z.csv.zip

gdubicki commented 2 months ago

If you prefer I can merge these files into one. Let me know if I can help in any other way!

kbr-scylla commented 2 months ago

@gdubicki do attempts to remove the 2 ghost nodes still give the "Operation in progress" error after rolling restart?

gdubicki commented 2 months ago

I successfully removed the first one! 🥳

root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool removenode aa434e8c-84a1-43a3-a398-28e4e6949e56
root@gke-main-scylla-6-25fcbc5b-1mnq:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.19.dev2+g9d49b38 | Scylla 5.4.9-0.20240703.fdcbbb85adcd | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select server_id,group_id from system.raft_state ;

 server_id                            | group_id
--------------------------------------+--------------------------------------
 5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
 5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
 787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
 813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
 880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2
 ceb76652-0f39-4018-9fa5-dd8f0b25e85a | 904c8960-2c68-11ee-979c-be9922839fd2

(6 rows)

Will try the other one in a few minutes.

Update: the second one got removed correctly too. :)

kbr-scylla commented 2 months ago

Ok, after you remove the second one, you can try booting new node again, just make sure you purge the old directories (in /var/lib/scylla, data, commitlog etc.) on the new node first, if you are reusing the same machine you used in previous boot attempt.

gdubicki commented 2 months ago

No, we will be starting with a new machine, with empty disk.

But how do you provision a new node with Scylla Operator? Should be remove the label scylla/replace="" from the appropriate Service?

kbr-scylla commented 2 months ago

But how do you provision a new node with Scylla Operator? Should be remove the label scylla/replace="" from the appropriate Service?

@scylladb/rnd-cloud-operator please advise

zimnx commented 2 months ago

Make sure StatefulSet/ScyllaCluster replicas reflects your existing state of the cluster. Clear all replace labels from Services. Bring back OrderedReady podManagementPolicy. Then bump number of members in ScyllaCluster to expected value. It will create new pods via statefulset.

gdubicki commented 2 months ago

Thanks a lot! We will do it tomorrow morning a it's a bit of hacking and it's safer to do it outside of our peak hours, which start now.

gdubicki commented 2 months ago

Ok, I started to do this today and things got a bit strange.

Make sure StatefulSet/ScyllaCluster replicas reflects your existing state of the cluster.

I actually didn't notice this part so I didn't do this. :|

But I would be afraid to change ScyllaCluster from 7 nodes to 5 now as wouldn't that remove the last two services along with the their pods 5 and 6? That would give us 2 nodes down out of 5 with RF=3 so we would be 1 node away from losing data and the cluster performance would not hold our peak hours. :/

Clear all replace labels from Services. Bring back OrderedReady podManagementPolicy.

I did this.

But a few minutes later I noticed that while Service 0 has remained without the labels, Service 3 has restored the labels and they look like this now:

apiVersion: v1
kind: Service
metadata:
  annotations:
    internal.scylla-operator.scylladb.com/current-token-ring-hash: qLFKP9ngpFWPAL0uyeS9L9UdydMoPcJqYY4vMLkJTAVpxCdHO0iN113JaZHXXC2aJM
oEoOuWBogBdKo+sgVvoQ==
    internal.scylla-operator.scylladb.com/host-id: c5214c14-6fb6-4ade-b5c9-01bf9f5b2029
    internal.scylla-operator.scylladb.com/last-cleaned-up-token-ring-hash: qLFKP9ngpFWPAL0uyeS9L9UdydMoPcJqYY4vMLkJTAVpxCdHO0iN113JaZ
HXXC2aJMoEoOuWBogBdKo+sgVvoQ==
    meta.helm.sh/release-name: scylla
    meta.helm.sh/release-namespace: scylla
    scylla-operator.scylladb.com/managed-hash: faqxjG8nRXLfj9++wiv/LxhVTYY7U8B28B78f8/UTD6cFqAvLSR/bMLnt/m2guunOinrulwyh2c3RmQ8jXO4Ww
==
  creationTimestamp: "2023-11-29T13:46:14Z"
  labels:
    app: scylla
    app.kubernetes.io/managed-by: scylla-operator
    app.kubernetes.io/name: scylla
    internal.scylla-operator.scylladb.com/replacing-node-hostid: c5214c14-6fb6-4ade-b5c9-01bf9f5b2029
    scylla-operator.scylladb.com/scylla-service-type: member
    scylla/cluster: scylla
    scylla/datacenter: us-west1
    scylla/rack: us-west1-b
    scylla/replace: ""
  name: scylla-us-west1-us-west1-b-3
  namespace: scylla

At the same time the pod for this service (3) has disappeared. (We still have pod 0 in the Pending state.)

I attach logs from Scyllas from the last 30 minutes extract-2024-08-15T08_50_35.820Z.csv.zip and from Scylla Manager, Operator, etc. from the same time period extract-2024-08-15T08_48_23.576Z.csv.zip.

Plus some outputs:

root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.7.241.130  1.85 TB    256          ?       787555a6-89d6-4b33-941c-940415380062  us-west1-b
UN  10.7.241.175  2 TB       256          ?       5342afaf-c19c-4be2-ada1-929698a4c398  us-west1-b
UN  10.7.241.174  1.79 TB    256          ?       813f49f9-e397-4d70-8300-79fa91817f11  us-west1-b
UN  10.7.249.238  1.91 TB    256          ?       5cc72b36-6fcf-4790-a540-930e544d59d2  us-west1-b
UN  10.7.243.109  1.72 TB    256          ?       880977bf-7cbb-4e0f-be82-ded853da57aa  us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
root@gke-main-scylla-6-25fcbc5b-1mnq:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.19.dev2+g9d49b38 | Scylla 5.4.9-0.20240703.fdcbbb85adcd | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select server_id,group_id from system.raft_state ;

 server_id                            | group_id
--------------------------------------+--------------------------------------
 5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
 5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
 787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
 813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
 880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2

(5 rows)
cqlsh> select host_id, up from system.cluster_status;

 host_id                              | up
--------------------------------------+------
 787555a6-89d6-4b33-941c-940415380062 | True
 5342afaf-c19c-4be2-ada1-929698a4c398 | True
 813f49f9-e397-4d70-8300-79fa91817f11 | True
 5cc72b36-6fcf-4790-a540-930e544d59d2 | True
 880977bf-7cbb-4e0f-be82-ded853da57aa | True

(5 rows)

...although there's nothing new here.

gdubicki commented 2 months ago

I don't know if it's related, but I just noticed that our Scylla Manager is not working, see https://github.com/scylladb/scylla-manager/issues/3972

gdubicki commented 2 months ago

Do you think we can still try to add one more node to our cluster by adding a node to our node pool and seeing it it gets added to the cluster or do you consider it unsafe in the current state, @zimnx?

zimnx commented 2 months ago

Please collect new must-gather and attach it here as I don't understand in what state ScyllaCluster is nor Pods are.

gdubicki commented 2 months ago

We added the node before we noticed your reply 🤦‍♂️ ...but it looks like it's bootstrapping correctly! 🥳

It has started on node gke-main-scylla-6-25fcbc5b-bq2w and this looks ok-ish, I guess:

noderoot@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UJ  10.7.252.229  ?          256          ?       null                                  us-west1-b
UN  10.7.241.130  1.87 TB    256          ?       787555a6-89d6-4b33-941c-940415380062  us-west1-b
UN  10.7.241.175  1.95 TB    256          ?       5342afaf-c19c-4be2-ada1-929698a4c398  us-west1-b
UN  10.7.241.174  1.77 TB    256          ?       813f49f9-e397-4d70-8300-79fa91817f11  us-west1-b
UN  10.7.249.238  1.83 TB    256          ?       5cc72b36-6fcf-4790-a540-930e544d59d2  us-west1-b
UN  10.7.243.109  1.74 TB    256          ?       880977bf-7cbb-4e0f-be82-ded853da57aa  us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
root@gke-main-scylla-6-25fcbc5b-1mnq:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.19.dev2+g9d49b38 | Scylla 5.4.9-0.20240703.fdcbbb85adcd | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select host_id, up from system.cluster_status;

 host_id                              | up
--------------------------------------+------
 00000000-0000-0000-0000-000000000000 | True
 787555a6-89d6-4b33-941c-940415380062 | True
 5342afaf-c19c-4be2-ada1-929698a4c398 | True
 813f49f9-e397-4d70-8300-79fa91817f11 | True
 5cc72b36-6fcf-4790-a540-930e544d59d2 | True
 880977bf-7cbb-4e0f-be82-ded853da57aa | True

(6 rows)
cqlsh> select server_id,group_id from system.raft_state ;

 server_id                            | group_id
--------------------------------------+--------------------------------------
 5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
 5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
 60daa392-6362-423d-93b2-1ff747903287 | 904c8960-2c68-11ee-979c-be9922839fd2
 787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
 813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
 880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2

(6 rows)

I attach the Scylla logs from last 30 minutes.

extract-2024-08-15T15_21_46.921Z.csv.zip

Please collect new must-gather and attach it here as I don't understand in what state ScyllaCluster is nor Pods are.

Sure, will do in a few minutes.

gdubicki commented 2 months ago

scylla-operator-must-gather-hlgph85ggm86.zip

Thank you for all the help, @zimnx! 🤗

zimnx commented 2 months ago

So your 6th node is joining (0 ordinal), and your ScyllaCluster has 7 desired replicas. Pod with 3 ordinal is missing and has a replace label. To bring back cluster back to expected state, I would suggest to remove both replace labels (scylla/replace: "" and internal.scylla-operator.scylladb.com/replacing-node-hostid: c5214c14-6fb6-4ade-b5c9-01bf9f5b2029) from Service scylla-us-west1-us-west1-b-3. Since you removed ghost nodes replace is no longer needed. Once -0 node joins the cluster, StatefulSet controller will recreate -3 Pod and it will start joining the cluster. Once last 7th node joins your cluster should be fully reconciled, and only then you may carry next topology changes if you still desire.

gdubicki commented 1 month ago

Thank you again for all your help, @zimnx!

We managed to get our cluster into a healthy state with all 7 nodes up:

root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.7.252.229  2.45 TB    256          ?       60daa392-6362-423d-93b2-1ff747903287  us-west1-b
UN  10.7.241.130  2.67 TB    256          ?       787555a6-89d6-4b33-941c-940415380062  us-west1-b
UN  10.7.241.175  2.94 TB    256          ?       5342afaf-c19c-4be2-ada1-929698a4c398  us-west1-b
UN  10.7.241.174  2.6 TB     256          ?       813f49f9-e397-4d70-8300-79fa91817f11  us-west1-b
UN  10.7.249.238  2.59 TB    256          ?       5cc72b36-6fcf-4790-a540-930e544d59d2  us-west1-b
UN  10.7.243.109  2.66 TB    256          ?       880977bf-7cbb-4e0f-be82-ded853da57aa  us-west1-b
UN  10.7.248.124  2.06 TB    256          ?       dea17e3f-198a-4ab8-b246-ff29e103941a  us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

root@gke-main-scylla-6-25fcbc5b-bq2w:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.19.dev2+g9d49b38 | Scylla 5.4.9-0.20240703.fdcbbb85adcd | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select server_id,group_id from system.raft_state ;

 server_id                            | group_id
--------------------------------------+--------------------------------------
 5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
 5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
 60daa392-6362-423d-93b2-1ff747903287 | 904c8960-2c68-11ee-979c-be9922839fd2
 787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
 813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
 880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2
 dea17e3f-198a-4ab8-b246-ff29e103941a | 904c8960-2c68-11ee-979c-be9922839fd2

(7 rows)
cqlsh> select host_id, up from system.cluster_status;

 host_id                              | up
--------------------------------------+------
 60daa392-6362-423d-93b2-1ff747903287 | True
 787555a6-89d6-4b33-941c-940415380062 | True
 dea17e3f-198a-4ab8-b246-ff29e103941a | True
 5342afaf-c19c-4be2-ada1-929698a4c398 | True
 813f49f9-e397-4d70-8300-79fa91817f11 | True
 5cc72b36-6fcf-4790-a540-930e544d59d2 | True
 880977bf-7cbb-4e0f-be82-ded853da57aa | True

(7 rows)
cqlsh>