scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
339 stars 175 forks source link

Can't replace or remove a node #2133

Open gdubicki opened 1 month ago

gdubicki commented 1 month ago

Context

We had a 7-node Scylla cluster in GCP, n2d-highmem-32 with 6TB local SSDs, running Scylla 6.0.3, Scylla Operator 1.13.0, Scylla Manager 3.3.3.

(It is the same cluster that was the protagonist of https://github.com/scylladb/scylla-operator/issues/2068)

Before the current issue has started, we did:

  1. an upgrade from ScyllaDB 5.4.9 to 6.0.3,
  2. enable the Raft-based consistent topology updates feature (as documented here),

...and wanted to upgrade to the latest 6.1.1 (as we were trying to fix https://github.com/scylladb/scylladb/issues/19793).

What happened

Pod 3 loses its data on 16th of Sep

On Sep 16th, 19:14 UTC the pod 3 (id dea17e3f-198a-4ab8-b246-ff29e103941a) has lost its data.

The local provisioned on that node has logged:

2024-09-16T19:14:27.217439362Z I0916 19:14:27.217317       1 cache.go:64] Updated pv "local-pv-a3d05a4c" to cache
2024-09-16T19:14:30.717550046Z I0916 19:14:30.717390       1 deleter.go:195] Start cleanup for pv local-pv-a3d05a4c
2024-09-16T19:14:30.718020906Z I0916 19:14:30.717933       1 deleter.go:266] Deleting PV file volume "local-pv-a3d05a4c" contents at hostpath "/mnt/raid-disks/disk0", mountpath "/mnt/raid-disks/disk0"
2024-09-16T19:15:42.190376535Z I0916 19:15:42.190192       1 cache.go:64] Updated pv "local-pv-a3d05a4c" to cache
2024-09-16T19:16:50.731826090Z I0916 19:16:50.731625       1 deleter.go:165] Deleting pv local-pv-a3d05a4c after successful cleanup
2024-09-16T19:16:50.745132170Z I0916 19:16:50.744977       1 cache.go:64] Updated pv "local-pv-a3d05a4c" to cache
2024-09-16T19:16:50.757703739Z I0916 19:16:50.757577       1 cache.go:73] Deleted pv "local-pv-a3d05a4c" from cache
2024-09-16T19:17:00.746148946Z I0916 19:17:00.745992       1 discovery.go:384] Found new volume at host path "/mnt/raid-disks/disk0" with capacity 6438149685248, creating Local PV "local-pv-a3d05a4c", required volumeMode "Filesystem"
2024-09-16T19:17:00.752645076Z I0916 19:17:00.752537       1 discovery.go:418] Created PV "local-pv-a3d05a4c" for volume at "/mnt/raid-disks/disk0"
2024-09-16T19:17:00.752837656Z I0916 19:17:00.752744       1 cache.go:55] Added pv "local-pv-a3d05a4c" to cache

The nodetool status output at that time looked like nothing was wrong:

root@gke-main-scylla-6-25fcbc5b-8hgv:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address      Load    Tokens Owns Host ID                              Rack
UN 10.7.241.130 2.39 TB 256    ?    787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.62 TB 256    ?    813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 2.70 TB 256    ?    5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.51 TB 256    ?    880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
UN 10.7.248.124 2.54 TB 256    ?    dea17e3f-198a-4ab8-b246-ff29e103941a us-west1-b
UN 10.7.249.238 2.31 TB 256    ?    5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 2.82 TB 256    ?    60daa392-6362-423d-93b2-1ff747903287 us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

The logs from 2024-09-16 19:10-19:34 UTC: extract-2024-09-30T14_19_57.781Z.csv.zip

But as the disk usage on the nodes was constantly growing, we assumed that the node will automatically get recreated so we left it like that for ~2 days. Then we noticed that it's failing to start with:

Pod 3 replacement fails on 18th of Sep

ERROR 2024-09-18 07:52:08,434 [shard  0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID dea17e3f-198a-4ab8-b246-ff29e103941a not found)

...and that the nodetool status is showing this:

root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address      Load    Tokens Owns Host ID                              Rack
UN 10.7.241.130 2.39 TB 256    ?    787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.61 TB 256    ?    813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 2.69 TB 256    ?    5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.51 TB 256    ?    880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
DN 10.7.248.124 2.54 TB 256    ?    dff2a772-f3e3-4e64-a380-7deaa1bf96df us-west1-b
UN 10.7.249.238 2.32 TB 256    ?    5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 2.82 TB 256    ?    60daa392-6362-423d-93b2-1ff747903287 us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

The old node id from the error message was nowhere to be found:

root@gke-main-scylla-6-25fcbc5b-bq2w:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.20 | Scylla 6.0.3-0.20240808.a56f7ce21ad4 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select server_id,group_id from system.raft_state ;

 server_id                            | group_id
--------------------------------------+--------------------------------------
 5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
 5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
 60daa392-6362-423d-93b2-1ff747903287 | 904c8960-2c68-11ee-979c-be9922839fd2
 787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
 813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
 880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2
 dff2a772-f3e3-4e64-a380-7deaa1bf96df | 904c8960-2c68-11ee-979c-be9922839fd2

(7 rows)
cqlsh> select host_id, up from system.cluster_status;

 host_id                              | up
--------------------------------------+-------
 60daa392-6362-423d-93b2-1ff747903287 |  True
 787555a6-89d6-4b33-941c-940415380062 |  True
 dff2a772-f3e3-4e64-a380-7deaa1bf96df | False
 5342afaf-c19c-4be2-ada1-929698a4c398 |  True
 813f49f9-e397-4d70-8300-79fa91817f11 |  True
 5cc72b36-6fcf-4790-a540-930e544d59d2 |  True
 880977bf-7cbb-4e0f-be82-ded853da57aa |  True

(7 rows)

Pod 3 another replacement try fails on 19th of Sep

We tried deleting the old node that had the local SSD issue, create a new one in its place and letting the cluster do the node replacement again, but it failed with a similar error as above:

ERROR 2024-09-19 14:06:53,099 [shard  0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID dff2a772-f3e3-4e64-a380-7deaa1bf96df not found)

Our cluster looked like this then:

root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address      Load    Tokens Owns Host ID                              Rack
UN 10.7.241.130 2.42 TB 256    ?    787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.65 TB 256    ?    813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 2.72 TB 256    ?    5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.54 TB 256    ?    880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
DN 10.7.248.124 2.54 TB 256    ?    3ec289d5-5910-4759-93bc-6e26ab5cda9f us-west1-b
UN 10.7.249.238 2.34 TB 256    ?    5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 2.85 TB 256    ?    60daa392-6362-423d-93b2-1ff747903287 us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
root@gke-main-scylla-6-25fcbc5b-bq2w:/# cqlsh
Connected to scylla at 0.0.0.0:9042
[cqlsh 6.0.20 | Scylla 6.0.3-0.20240808.a56f7ce21ad4 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select server_id,group_id from system.raft_state ;

 server_id                            | group_id
--------------------------------------+--------------------------------------
 3ec289d5-5910-4759-93bc-6e26ab5cda9f | 904c8960-2c68-11ee-979c-be9922839fd2
 5342afaf-c19c-4be2-ada1-929698a4c398 | 904c8960-2c68-11ee-979c-be9922839fd2
 5cc72b36-6fcf-4790-a540-930e544d59d2 | 904c8960-2c68-11ee-979c-be9922839fd2
 60daa392-6362-423d-93b2-1ff747903287 | 904c8960-2c68-11ee-979c-be9922839fd2
 787555a6-89d6-4b33-941c-940415380062 | 904c8960-2c68-11ee-979c-be9922839fd2
 813f49f9-e397-4d70-8300-79fa91817f11 | 904c8960-2c68-11ee-979c-be9922839fd2
 880977bf-7cbb-4e0f-be82-ded853da57aa | 904c8960-2c68-11ee-979c-be9922839fd2

(7 rows)
cqlsh> select host_id, up from system.cluster_status;

 host_id                              | up
--------------------------------------+-------
 60daa392-6362-423d-93b2-1ff747903287 |  True
 787555a6-89d6-4b33-941c-940415380062 |  True
 3ec289d5-5910-4759-93bc-6e26ab5cda9f | False
 5342afaf-c19c-4be2-ada1-929698a4c398 |  True
 813f49f9-e397-4d70-8300-79fa91817f11 |  True
 5cc72b36-6fcf-4790-a540-930e544d59d2 |  True
 880977bf-7cbb-4e0f-be82-ded853da57aa |  True

(7 rows)

Node removal fails on 24th of Sep

At this point we decided to try to remove the down node, with id 3ec289d5-5910-4759-93bc-6e26ab5cda9f, from the cluster to continue our original task of upgrade Scylla to 6.1.1, planning to go back to replacing the missing node after we do that.

However, the noderemove operation also failed.

ERROR 2024-09-24 15:21:28,364 [shard  0:strm] raft_topology - Removenode failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (raft topology: exec_global_command(stream_ranges) failed with std::runtime_error (failed status returned from 880977bf-7cbb-4e0f-be82-ded853da57aa/10.7.243.109))). Request ID: af34f102-7977-11ef-db22-982520c2c047

We can't find meaningful errors from before this message, so I attach ~100k lines of logs from 2024-09-24 14:15-15:22 UTC from that day here: extract-2024-09-30T14_10_55.525Z.csv.zip

Node removal fails after a retry on 27th of Sep

Retrying noderemove didn't work:

root@gke-main-scylla-6-25fcbc5b-bq2w:/# nodetool removenode 3ec289d5-5910-4759-93bc-6e26ab5cda9f
error executing POST request to http://localhost:10000/storage_service/remove_node with parameters {"host_id": "3ec289d5-5910-4759-93bc-6e26ab5cda9f"}: remote replied with status code 500 Internal Server Error:
std::runtime_error (removenode: node 3ec289d5-5910-4759-93bc-6e26ab5cda9f is in 'removing' state. Wait for it to be in 'normal' state)

We tried to do a rolling restart of the cluster and retry, similarly to what we did in https://github.com/scylladb/scylla-operator/issues/2068, but that did not help this time. The error message was as before, just with a different timestamp:

ERROR 2024-09-27 18:36:15,901 [shard  0:strm] raft_topology - Removenode failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (raft topology: exec_global_command(stream_ranges) failed with std::runtime_error (failed status returned from 880977bf-7cbb-4e0f-be82-ded853da57aa/10.7.243.109))). Request ID: 28a3f924-7bef-11ef-17f5-dfc99f47da8e

Additional info

During this time we had surprising moments when our Scylla disks were being filled with snapshots, getting dangerously close to 80% of disk use, example:

root@gke-main-scylla-6-25fcbc5b-bq2w:/# du -hs /var/lib/scylla/data/production/activities-5de370f07b2011ed86b49f900e84b8e9/snapshots
2.9T    /var/lib/scylla/data/production/activities-5de370f07b2011ed86b49f900e84b8e9/snapshots

We cleared the snapshots when that happened using commandnodetool clearsnapshot.

must-gather output

scylla-operator-must-gather-w7rn9tspr85z.zip

gdubicki commented 1 month ago

Please note that the Scylla Operator 1.14.0, ScyllaDB 6.0.4 and ScyllaDB 6.1.2 were all released after we started working on the original task here (upgrade to 6.0.3 and then 6.1.1).

gdubicki commented 1 month ago

The main problem we have with this cluster state is that the backups are failing:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool progress   --cluster scylla/scylla backup/monday-backup-r2
Run:        53e4fb2c-7f1f-11ef-b9a8-c68ce6620235
Status:     ERROR (initialising)
Cause:      get backup target: create units: system_auth.role_members: the whole replica set [10.7.248.124] is filtered out, so the data owned by it can't be backed up
Start time: 30 Sep 24 11:30:00 UTC
End time:   30 Sep 24 11:35:59 UTC
Duration:   5m59s
Progress:   -

We would really appreciate any hints on how to move forward!

zimnx commented 1 month ago

what happens now if you add Kubernetes Node able to host Pod 3? Seems like Operator will try to replace known 3ec289d5-5910-4759-93bc-6e26ab5cda9f. Please attach new must gather after you add it.

gdubicki commented 1 month ago

what happens now if you add Kubernetes Node able to host Pod 3? Seems like Operator will try to replace known 3ec289d5-5910-4759-93bc-6e26ab5cda9f.

Just did that.

The logs from the first ~10 minutes since starting the new node: extract-2024-10-10T10_43_13.937Z.csv.zip

The nodetool status now shows:

root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address      Load    Tokens Owns Host ID                              Rack
UN 10.7.241.130 2.78 TB 256    ?    787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.95 TB 256    ?    813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 3.22 TB 256    ?    5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.91 TB 256    ?    880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
UN 10.7.248.124 ?       256    ?    3ec289d5-5910-4759-93bc-6e26ab5cda9f us-west1-b
UN 10.7.249.238 2.70 TB 256    ?    5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 3.08 TB 256    ?    60daa392-6362-423d-93b2-1ff747903287 us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

Isn't UN seems a wrong state here? Shouldn't it be UJ? (Btw, it was DN before I added the k8s node.)

Please attach new must gather after you add it.

Generating it now...

gdubicki commented 1 month ago

scylla-operator-must-gather-4xpb5nbck5vj.zip

cc @zimnx

zimnx commented 1 month ago

From the logs it seems to be joining just fine

gdubicki commented 1 month ago

From the logs it seems to be joining just fine

True! I hope it will complete correctly. 🤞 I will report the results here, it will probably take many hours/a day or two.

Thanks @zimnx!

gdubicki commented 1 month ago

Unfortunately, something went wrong again, @zimnx. :(

One of the errors I see is:

ERROR 2024-10-11 08:34:11,571 [shard  0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID 3ec289d5-5910-4759-93bc-6e26ab5cda9f not found)

The nodetool status in fact doesn't show this id anymore:

$ kubectl exec -it sts/scylla-us-west1-us-west1-b -n scylla -- nodetool status
Defaulted container "scylla" out of: scylla, scylladb-api-status-probe, scylla-manager-agent, sidecar-injection (init), sysctl-buddy (init)
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address      Load    Tokens Owns Host ID                              Rack
UN 10.7.241.130 2.74 TB 256    ?    787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.93 TB 256    ?    813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 3.18 TB 256    ?    5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.88 TB 256    ?    880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
DN 10.7.248.124 ?       256    ?    c16ae0c9-33bf-4c99-8f44-d995eff274f2 us-west1-b
UN 10.7.249.238 2.66 TB 256    ?    5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 3.05 TB 256    ?    60daa392-6362-423d-93b2-1ff747903287 us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

Here's the must-gather generated a few minutes ago: scylla-operator-must-gather-q9lcjrmkkgrh.zip

I will also attach some logs in a few minutes, they are being exported now.

Please let me know if you need anything else!

gdubicki commented 1 month ago

Logs from the first ~6h of the bootstrap (Oct 10th, 10:30-16:15 UTC): extract-2024-10-11T08_49_50.794Z.csv.zip

zimnx commented 1 month ago

Looks like node that was replacing crashed with core dump:

2024-10-11T07:39:03.632035907Z WARN  2024-10-11 07:39:03,631 [shard 22: gms] token_metadata - topology version 19 held for 12114.809 [s] past expiry, released at: 0x6469d6e 0x646a380 0x646a668 0x3ff477a 0x3fe41d6 0x3f5ca4d 0x5f62a1f 0x5f63d07 0x5f87c70 0x5f2312a /opt/scylladb/libreloc/libc.so.6+0x8c946 /opt/scylladb/libreloc/libc.so.6+0x11296f
2024-10-11T07:39:03.632058147Z    --------
2024-10-11T07:39:03.632066967Z    seastar::internal::do_with_state<std::tuple<std::unordered_map<dht::token, utils::small_vector<utils::tagged_uuid<locator::host_id_tag>, 3ul>, std::hash<dht::token>, std::equal_to<dht::token>, std::allocator<std::pair<dht::token const, utils::small_vector<utils::tagged_uuid<locator::host_id_tag>, 3ul> > > >, boost::icl::interval_map<dht::token, std::unordered_set<utils::tagged_uuid<locator::host_id_tag>, std::hash<utils::tagged_uuid<locator::host_id_tag> >, std::equal_to<utils::tagged_uuid<locator::host_id_tag> >, std::allocator<utils::tagged_uuid<locator::host_id_tag> > >, boost::icl::partial_absorber, std::less, boost::icl::inplace_plus, boost::icl::inter_section, boost::icl::continuous_interval<dht::token, std::less>, std::allocator>, boost::icl::interval_map<dht::token, std::unordered_set<utils::tagged_uuid<locator::host_id_tag>, std::hash<utils::tagged_uuid<locator::host_id_tag> >, std::equal_to<utils::tagged_uuid<locator::host_id_tag> >, std::allocator<utils::tagged_uuid<locator::host_id_tag> > >, boost::icl::partial_absorber, std::less, boost::icl::inplace_plus, boost::icl::inter_section, boost::icl::continuous_interval<dht::token, std::less>, std::allocator>, seastar::lw_shared_ptr<locator::token_metadata const> >, seastar::future<void> >
2024-10-11T07:39:03.635583946Z WARN  2024-10-11 07:39:03,635 [shard  0: gms] token_metadata - topology version 19 held for 12114.812 [s] past expiry, released at: 0x6469d6e 0x646a380 0x646a668 0x3ff477a 0x3fe41d6 0x429ebb4 0x144f19a 0x5f62a1f 0x5f63d07 0x5f63068 0x5ef1017 0x5ef01dc 0x13deae8 0x13e0530 0x13dd0b9 /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x13da4a4
2024-10-11T07:39:03.635618196Z    --------
2024-10-11T07:39:03.635624426Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635628486Z    --------
2024-10-11T07:39:03.635632956Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635636786Z    --------
2024-10-11T07:39:03.635640506Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635644316Z    --------
2024-10-11T07:39:03.635648066Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635651616Z    --------
2024-10-11T07:39:03.635655256Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635659106Z    --------
2024-10-11T07:39:03.635663056Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635666656Z    --------
2024-10-11T07:39:03.635670336Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095728321Z ERROR 2024-10-11 07:39:04,095 [shard  0: gms] raft_topology - Cannot map id of a node being replaced 3ec289d5-5910-4759-93bc-6e26ab5cda9f to its ip, at: 0x6469d6e 0x646a380 0x646a668 0x5f2251e 0x5f226d7 0x4080b56 0x4295f91 0x4151d0a 0x5f62a1f 0x5f63d07 0x5f63068 0x5ef1017 0x5ef01dc 0x13deae8 0x13e0530 0x13dd0b9 /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x13da4a4
2024-10-11T07:39:04.095755801Z    --------
2024-10-11T07:39:04.095760711Z    seastar::internal::coroutine_traits_base<service::storage_service::nodes_to_notify_after_sync>::promise_type
2024-10-11T07:39:04.095764171Z    --------
2024-10-11T07:39:04.095768341Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095771381Z    --------
2024-10-11T07:39:04.095785011Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095788291Z    --------
2024-10-11T07:39:04.095791671Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095794951Z    --------
2024-10-11T07:39:04.095798571Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095801881Z    --------
2024-10-11T07:39:04.095805241Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095808311Z    --------
2024-10-11T07:39:04.095811341Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095814321Z    --------
2024-10-11T07:39:04.095817571Z    seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095957031Z Aborting on shard 0.
2024-10-11T07:39:04.095963631Z Backtrace:
2024-10-11T07:39:04.095967051Z   0x5f50de8
2024-10-11T07:39:04.095970331Z   0x5f87671
2024-10-11T07:39:04.095974061Z   /opt/scylladb/libreloc/libc.so.6+0x3dbaf
2024-10-11T07:39:04.095977171Z   /opt/scylladb/libreloc/libc.so.6+0x8e883
2024-10-11T07:39:04.095980631Z   /opt/scylladb/libreloc/libc.so.6+0x3dafd
2024-10-11T07:39:04.095984141Z   /opt/scylladb/libreloc/libc.so.6+0x2687e
2024-10-11T07:39:04.095987521Z   0x5f226dc
2024-10-11T07:39:04.095990761Z   0x4080b56
2024-10-11T07:39:04.095993821Z   0x4295f91
2024-10-11T07:39:04.095997091Z   0x4151d0a
2024-10-11T07:39:04.096000451Z   0x5f62a1f
2024-10-11T07:39:04.096003691Z   0x5f63d07
2024-10-11T07:39:04.096006851Z   0x5f63068
2024-10-11T07:39:04.096010071Z   0x5ef1017
2024-10-11T07:39:04.096013441Z   0x5ef01dc
2024-10-11T07:39:04.096016541Z   0x13deae8
2024-10-11T07:39:04.096019621Z   0x13e0530
2024-10-11T07:39:04.096022801Z   0x13dd0b9
2024-10-11T07:39:04.096026141Z   /opt/scylladb/libreloc/libc.so.6+0x27b89
2024-10-11T07:39:04.096029331Z   /opt/scylladb/libreloc/libc.so.6+0x27c4a
2024-10-11T07:39:04.096032511Z   0x13da4a4
2024-10-11T07:43:28.272751467Z 2024-10-11 07:43:28,271 INFO exited: scylla (terminated by SIGABRT (core dumped); not expected)
zimnx commented 1 month ago

Could you check if coredump was saved? /proc/sys/kernel/core_pattern at gke-main-scylla-6-25fcbc5b-412m node should contain location for coredumps. If it's there, please upload it following this guide: https://opensource.docs.scylladb.com/stable/troubleshooting/report-scylla-problem.html#send-files-to-scylladb-support


At this point node will crash in loop because 3ec289d5-5910-4759-93bc-6e26ab5cda9f is not known any longer, as it was replaced by c16ae0c9-33bf-4c99-8f44-d995eff274f2.

I would suggest to retry replacing c16ae0c9-33bf-4c99-8f44-d995eff274f2, maybe you won't hit the crash again.

To do so, remove internal.scylla-operator.scylladb.com/replacing-node-hostid: 3ec289d5-5910-4759-93bc-6e26ab5cda9f label from Service scylla-us-west1-us-west1-b-3. Operator will trigger c16ae0c9-33bf-4c99-8f44-d995eff274f2 replacement.

zimnx commented 1 month ago

Alternatively, you can try removing both scylla/replace and internal.scylla-operator.scylladb.com/replacing-node-hostid from Service and restarting the Pod. It would boot without replace parameter and maybe it would continue streaming the rest of the data. Probably quicker than repeating entire replace procedure.

zimnx commented 1 month ago

Decoded backtrace:

2024-10-11T07:39:04.095963631Z Backtrace:
[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:825
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:855
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:867
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:4071
 (inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/release/seastar/./seastar/src/core/reactor.cc:4047
 (inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/release/seastar/./seastar/src/core/reactor.cc:4043
/data/scylla-s3-reloc.cache/by-build-id/00ad3169bb53c452cf2ab93d97785dc56117ac3e/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=9148cab1b932d44ef70e306e9c02ee38d06cad51, for GNU/Linux 3.2.0, not stripped

__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
seastar::on_fatal_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:81
service::storage_service::sync_raft_topology_nodes(seastar::lw_shared_ptr<locator::token_metadata>, std::optional<utils::tagged_uuid<locator::host_id_tag> >, std::unordered_set<utils::tagged_uuid<raft::server_id_tag>, std::hash<utils::tagged_uuid<raft::server_id_tag> >, std::equal_to<utils::tagged_uuid<raft::server_id_tag> >, std::allocator<utils::tagged_uuid<raft::server_id_tag> > >)::$_1::operator()(utils::tagged_uuid<raft::server_id_tag>, service::replica_state const&) const at ./service/storage_service.cc:?
service::storage_service::sync_raft_topology_nodes(seastar::lw_shared_ptr<locator::token_metadata>, std::optional<utils::tagged_uuid<locator::host_id_tag> >, std::unordered_set<utils::tagged_uuid<raft::server_id_tag>, std::hash<utils::tagged_uuid<raft::server_id_tag> >, std::equal_to<utils::tagged_uuid<raft::server_id_tag> >, std::allocator<utils::tagged_uuid<raft::server_id_tag> > >) at ./service/storage_service.cc:607
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<service::storage_service::nodes_to_notify_after_sync>::promise_type>::resume() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/coroutine:240
 (inlined by) seastar::internal::coroutine_traits_base<service::storage_service::nodes_to_notify_after_sync>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:83
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2690
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3152
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3320
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3210
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ./main.cc:700
std::function<int (int, char**)>::operator()(int, char**) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
main at ./main.cc:2211
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?
gdubicki commented 1 month ago

Could you check if coredump was saved? /proc/sys/kernel/core_pattern at gke-main-scylla-6-25fcbc5b-412m node should contain location for coredumps.

It core dump apparently was not saved, I couldn't find it at that location (/core.%e.%p.%t) and a few other standard placed...

gdubicki commented 1 month ago

I would suggest to retry replacing c16ae0c9-33bf-4c99-8f44-d995eff274f2, maybe you won't hit the crash again.

To do so, remove internal.scylla-operator.scylladb.com/replacing-node-hostid: 3ec289d5-5910-4759-93bc-6e26ab5cda9f label from Service scylla-us-west1-us-west1-b-3. Operator will trigger c16ae0c9-33bf-4c99-8f44-d995eff274f2 replacement.

Alternatively, you can try removing both scylla/replace and internal.scylla-operator.scylladb.com/replacing-node-hostid from Service and restarting the Pod. It would boot without replace parameter and maybe it would continue streaming the rest of the data. Probably quicker than repeating entire replace procedure.

I think I already tried both of these ways (see the issue description), but will try again. Probably will start on Monday morning though.

zimnx commented 1 month ago

In the meantime, please report an issue in Scylla repo, it shouldn't crash during replacement. Attach ~3h logs before the crash (2024-10-11T07:39:04) and backtrace.

gdubicki commented 1 month ago

Alternatively, you can try removing both scylla/replace and internal.scylla-operator.scylladb.com/replacing-node-hostid from Service and restarting the Pod. It would boot without replace parameter and maybe it would continue streaming the rest of the data. Probably quicker than repeating entire replace procedure.

Trying that now...

gdubicki commented 1 month ago

It didn't continue to stream the data, @zimnx. :( The disk usage stats suggest that the disk has been cleaned and it's bootstrapping the data from scratch.

Image

The nodetool status output:

$ kubectl exec -it sts/scylla-us-west1-us-west1-b -n scylla -- nodetool status
Defaulted container "scylla" out of: scylla, scylladb-api-status-probe, scylla-manager-agent, sidecar-injection (init), sysctl-buddy (init)
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address      Load    Tokens Owns Host ID                              Rack
UN 10.7.241.130 2.71 TB 256    ?    787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.91 TB 256    ?    813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 3.12 TB 256    ?    5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.85 TB 256    ?    880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
UN 10.7.248.124 ?       256    ?    c16ae0c9-33bf-4c99-8f44-d995eff274f2 us-west1-b
UN 10.7.249.238 2.61 TB 256    ?    5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 3.03 TB 256    ?    60daa392-6362-423d-93b2-1ff747903287 us-west1-b

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

The logs from pod 3 since I restarted it: logs.txt.gz

I can see quite a lot of exceptions there:

WARN  2024-10-14 07:47:33,810 [shard  7:stmt] storage_proxy - Failed to apply mutation from 10.7.243.109#2: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 928a90eb-dacc-3aca-934e-d51eb198c063): data_dictionary::no_such_keyspace (Can't find a keyspace production)
WARN  2024-10-14 07:47:33,985 [shard  2:strm] storage_proxy - Failed to apply mutation from 10.7.241.175#28: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 40ebc17c-74b9-3f0e-bf24-10491b26a1fc): exceptions::invalid_request_exception (Unknown type production.feed_id)
INFO  2024-10-14 08:03:10,915 [shard  2:mt2c] lsa - LSA allocation failure, increasing reserve in section 0x61b008cfc620 to 2 segments; trace: 0x6469d6e 0x646a380 0x646a668 0x215468d 0x1fe969c 0x2129655 0x6341216
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memta
ble&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#2}, seastar::future<void> >(row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preempti
on_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, auto:1, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, auto:1&&, (auto:2&&)...)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::fina
lly_body<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_
updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, auto:1, basic_preemption_source&)::{lambda()#1}::operator()() const::{lamb
da()#2}>(seastar::thread_attributes, auto:1&&, (auto:2&&)...)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#3}, false> >(seastar::future<void>::finally_body<row_ca
che::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, auto:1, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_0::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}::operator()()::{lambda(auto:1)#1}, seastar::future<void>::then_wrapped_nrvo<void, row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_0::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator(
)<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}::operator()()::{lambda(auto:1)#1}>(std::function<seastar::future<void> ()>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_0::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(auto:1&, auto:2&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(auto:1)::{lambda()#1}::operator()()::{lambda(auto:1)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
(...)
scylla-operator-bot[bot] commented 1 day ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle stale