Open gdubicki opened 1 month ago
Please note that the Scylla Operator 1.14.0, ScyllaDB 6.0.4 and ScyllaDB 6.1.2 were all released after we started working on the original task here (upgrade to 6.0.3 and then 6.1.1).
The main problem we have with this cluster state is that the backups are failing:
$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool progress --cluster scylla/scylla backup/monday-backup-r2
Run: 53e4fb2c-7f1f-11ef-b9a8-c68ce6620235
Status: ERROR (initialising)
Cause: get backup target: create units: system_auth.role_members: the whole replica set [10.7.248.124] is filtered out, so the data owned by it can't be backed up
Start time: 30 Sep 24 11:30:00 UTC
End time: 30 Sep 24 11:35:59 UTC
Duration: 5m59s
Progress: -
We would really appreciate any hints on how to move forward!
what happens now if you add Kubernetes Node able to host Pod 3
? Seems like Operator will try to replace known 3ec289d5-5910-4759-93bc-6e26ab5cda9f
.
Please attach new must gather after you add it.
what happens now if you add Kubernetes Node able to host Pod
3
? Seems like Operator will try to replace known3ec289d5-5910-4759-93bc-6e26ab5cda9f
.
Just did that.
The logs from the first ~10 minutes since starting the new node: extract-2024-10-10T10_43_13.937Z.csv.zip
The nodetool status now shows:
root@gke-main-scylla-6-25fcbc5b-1mnq:/# nodetool status
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.7.241.130 2.78 TB 256 ? 787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.95 TB 256 ? 813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 3.22 TB 256 ? 5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.91 TB 256 ? 880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
UN 10.7.248.124 ? 256 ? 3ec289d5-5910-4759-93bc-6e26ab5cda9f us-west1-b
UN 10.7.249.238 2.70 TB 256 ? 5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 3.08 TB 256 ? 60daa392-6362-423d-93b2-1ff747903287 us-west1-b
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
Isn't UN
seems a wrong state here? Shouldn't it be UJ
? (Btw, it was DN
before I added the k8s node.)
Please attach new must gather after you add it.
Generating it now...
From the logs it seems to be joining just fine
From the logs it seems to be joining just fine
True! I hope it will complete correctly. 🤞 I will report the results here, it will probably take many hours/a day or two.
Thanks @zimnx!
Unfortunately, something went wrong again, @zimnx. :(
One of the errors I see is:
ERROR 2024-10-11 08:34:11,571 [shard 0:main] init - Startup failed: std::runtime_error (Replaced node with Host ID 3ec289d5-5910-4759-93bc-6e26ab5cda9f not found)
The nodetool status in fact doesn't show this id anymore:
$ kubectl exec -it sts/scylla-us-west1-us-west1-b -n scylla -- nodetool status
Defaulted container "scylla" out of: scylla, scylladb-api-status-probe, scylla-manager-agent, sidecar-injection (init), sysctl-buddy (init)
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.7.241.130 2.74 TB 256 ? 787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.93 TB 256 ? 813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 3.18 TB 256 ? 5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.88 TB 256 ? 880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
DN 10.7.248.124 ? 256 ? c16ae0c9-33bf-4c99-8f44-d995eff274f2 us-west1-b
UN 10.7.249.238 2.66 TB 256 ? 5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 3.05 TB 256 ? 60daa392-6362-423d-93b2-1ff747903287 us-west1-b
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
Here's the must-gather generated a few minutes ago: scylla-operator-must-gather-q9lcjrmkkgrh.zip
I will also attach some logs in a few minutes, they are being exported now.
Please let me know if you need anything else!
Logs from the first ~6h of the bootstrap (Oct 10th, 10:30-16:15 UTC): extract-2024-10-11T08_49_50.794Z.csv.zip
Looks like node that was replacing crashed with core dump:
2024-10-11T07:39:03.632035907Z WARN 2024-10-11 07:39:03,631 [shard 22: gms] token_metadata - topology version 19 held for 12114.809 [s] past expiry, released at: 0x6469d6e 0x646a380 0x646a668 0x3ff477a 0x3fe41d6 0x3f5ca4d 0x5f62a1f 0x5f63d07 0x5f87c70 0x5f2312a /opt/scylladb/libreloc/libc.so.6+0x8c946 /opt/scylladb/libreloc/libc.so.6+0x11296f
2024-10-11T07:39:03.632058147Z --------
2024-10-11T07:39:03.632066967Z seastar::internal::do_with_state<std::tuple<std::unordered_map<dht::token, utils::small_vector<utils::tagged_uuid<locator::host_id_tag>, 3ul>, std::hash<dht::token>, std::equal_to<dht::token>, std::allocator<std::pair<dht::token const, utils::small_vector<utils::tagged_uuid<locator::host_id_tag>, 3ul> > > >, boost::icl::interval_map<dht::token, std::unordered_set<utils::tagged_uuid<locator::host_id_tag>, std::hash<utils::tagged_uuid<locator::host_id_tag> >, std::equal_to<utils::tagged_uuid<locator::host_id_tag> >, std::allocator<utils::tagged_uuid<locator::host_id_tag> > >, boost::icl::partial_absorber, std::less, boost::icl::inplace_plus, boost::icl::inter_section, boost::icl::continuous_interval<dht::token, std::less>, std::allocator>, boost::icl::interval_map<dht::token, std::unordered_set<utils::tagged_uuid<locator::host_id_tag>, std::hash<utils::tagged_uuid<locator::host_id_tag> >, std::equal_to<utils::tagged_uuid<locator::host_id_tag> >, std::allocator<utils::tagged_uuid<locator::host_id_tag> > >, boost::icl::partial_absorber, std::less, boost::icl::inplace_plus, boost::icl::inter_section, boost::icl::continuous_interval<dht::token, std::less>, std::allocator>, seastar::lw_shared_ptr<locator::token_metadata const> >, seastar::future<void> >
2024-10-11T07:39:03.635583946Z WARN 2024-10-11 07:39:03,635 [shard 0: gms] token_metadata - topology version 19 held for 12114.812 [s] past expiry, released at: 0x6469d6e 0x646a380 0x646a668 0x3ff477a 0x3fe41d6 0x429ebb4 0x144f19a 0x5f62a1f 0x5f63d07 0x5f63068 0x5ef1017 0x5ef01dc 0x13deae8 0x13e0530 0x13dd0b9 /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x13da4a4
2024-10-11T07:39:03.635618196Z --------
2024-10-11T07:39:03.635624426Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635628486Z --------
2024-10-11T07:39:03.635632956Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635636786Z --------
2024-10-11T07:39:03.635640506Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635644316Z --------
2024-10-11T07:39:03.635648066Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635651616Z --------
2024-10-11T07:39:03.635655256Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635659106Z --------
2024-10-11T07:39:03.635663056Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:03.635666656Z --------
2024-10-11T07:39:03.635670336Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095728321Z ERROR 2024-10-11 07:39:04,095 [shard 0: gms] raft_topology - Cannot map id of a node being replaced 3ec289d5-5910-4759-93bc-6e26ab5cda9f to its ip, at: 0x6469d6e 0x646a380 0x646a668 0x5f2251e 0x5f226d7 0x4080b56 0x4295f91 0x4151d0a 0x5f62a1f 0x5f63d07 0x5f63068 0x5ef1017 0x5ef01dc 0x13deae8 0x13e0530 0x13dd0b9 /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x13da4a4
2024-10-11T07:39:04.095755801Z --------
2024-10-11T07:39:04.095760711Z seastar::internal::coroutine_traits_base<service::storage_service::nodes_to_notify_after_sync>::promise_type
2024-10-11T07:39:04.095764171Z --------
2024-10-11T07:39:04.095768341Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095771381Z --------
2024-10-11T07:39:04.095785011Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095788291Z --------
2024-10-11T07:39:04.095791671Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095794951Z --------
2024-10-11T07:39:04.095798571Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095801881Z --------
2024-10-11T07:39:04.095805241Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095808311Z --------
2024-10-11T07:39:04.095811341Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095814321Z --------
2024-10-11T07:39:04.095817571Z seastar::internal::coroutine_traits_base<void>::promise_type
2024-10-11T07:39:04.095957031Z Aborting on shard 0.
2024-10-11T07:39:04.095963631Z Backtrace:
2024-10-11T07:39:04.095967051Z 0x5f50de8
2024-10-11T07:39:04.095970331Z 0x5f87671
2024-10-11T07:39:04.095974061Z /opt/scylladb/libreloc/libc.so.6+0x3dbaf
2024-10-11T07:39:04.095977171Z /opt/scylladb/libreloc/libc.so.6+0x8e883
2024-10-11T07:39:04.095980631Z /opt/scylladb/libreloc/libc.so.6+0x3dafd
2024-10-11T07:39:04.095984141Z /opt/scylladb/libreloc/libc.so.6+0x2687e
2024-10-11T07:39:04.095987521Z 0x5f226dc
2024-10-11T07:39:04.095990761Z 0x4080b56
2024-10-11T07:39:04.095993821Z 0x4295f91
2024-10-11T07:39:04.095997091Z 0x4151d0a
2024-10-11T07:39:04.096000451Z 0x5f62a1f
2024-10-11T07:39:04.096003691Z 0x5f63d07
2024-10-11T07:39:04.096006851Z 0x5f63068
2024-10-11T07:39:04.096010071Z 0x5ef1017
2024-10-11T07:39:04.096013441Z 0x5ef01dc
2024-10-11T07:39:04.096016541Z 0x13deae8
2024-10-11T07:39:04.096019621Z 0x13e0530
2024-10-11T07:39:04.096022801Z 0x13dd0b9
2024-10-11T07:39:04.096026141Z /opt/scylladb/libreloc/libc.so.6+0x27b89
2024-10-11T07:39:04.096029331Z /opt/scylladb/libreloc/libc.so.6+0x27c4a
2024-10-11T07:39:04.096032511Z 0x13da4a4
2024-10-11T07:43:28.272751467Z 2024-10-11 07:43:28,271 INFO exited: scylla (terminated by SIGABRT (core dumped); not expected)
Could you check if coredump was saved? /proc/sys/kernel/core_pattern
at gke-main-scylla-6-25fcbc5b-412m
node should contain location for coredumps.
If it's there, please upload it following this guide: https://opensource.docs.scylladb.com/stable/troubleshooting/report-scylla-problem.html#send-files-to-scylladb-support
At this point node will crash in loop because 3ec289d5-5910-4759-93bc-6e26ab5cda9f
is not known any longer, as it was replaced by c16ae0c9-33bf-4c99-8f44-d995eff274f2
.
I would suggest to retry replacing c16ae0c9-33bf-4c99-8f44-d995eff274f2
, maybe you won't hit the crash again.
To do so, remove internal.scylla-operator.scylladb.com/replacing-node-hostid: 3ec289d5-5910-4759-93bc-6e26ab5cda9f
label from Service scylla-us-west1-us-west1-b-3
.
Operator will trigger c16ae0c9-33bf-4c99-8f44-d995eff274f2
replacement.
Alternatively, you can try removing both scylla/replace
and internal.scylla-operator.scylladb.com/replacing-node-hostid
from Service and restarting the Pod. It would boot without replace parameter and maybe it would continue streaming the rest of the data. Probably quicker than repeating entire replace procedure.
Decoded backtrace:
2024-10-11T07:39:04.095963631Z Backtrace:
[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
(inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:825
(inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:855
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:867
(inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:4071
(inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/release/seastar/./seastar/src/core/reactor.cc:4047
(inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/release/seastar/./seastar/src/core/reactor.cc:4043
/data/scylla-s3-reloc.cache/by-build-id/00ad3169bb53c452cf2ab93d97785dc56117ac3e/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=9148cab1b932d44ef70e306e9c02ee38d06cad51, for GNU/Linux 3.2.0, not stripped
__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
seastar::on_fatal_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:81
service::storage_service::sync_raft_topology_nodes(seastar::lw_shared_ptr<locator::token_metadata>, std::optional<utils::tagged_uuid<locator::host_id_tag> >, std::unordered_set<utils::tagged_uuid<raft::server_id_tag>, std::hash<utils::tagged_uuid<raft::server_id_tag> >, std::equal_to<utils::tagged_uuid<raft::server_id_tag> >, std::allocator<utils::tagged_uuid<raft::server_id_tag> > >)::$_1::operator()(utils::tagged_uuid<raft::server_id_tag>, service::replica_state const&) const at ./service/storage_service.cc:?
service::storage_service::sync_raft_topology_nodes(seastar::lw_shared_ptr<locator::token_metadata>, std::optional<utils::tagged_uuid<locator::host_id_tag> >, std::unordered_set<utils::tagged_uuid<raft::server_id_tag>, std::hash<utils::tagged_uuid<raft::server_id_tag> >, std::equal_to<utils::tagged_uuid<raft::server_id_tag> >, std::allocator<utils::tagged_uuid<raft::server_id_tag> > >) at ./service/storage_service.cc:607
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<service::storage_service::nodes_to_notify_after_sync>::promise_type>::resume() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/coroutine:240
(inlined by) seastar::internal::coroutine_traits_base<service::storage_service::nodes_to_notify_after_sync>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:83
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2690
(inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3152
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3320
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3210
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ./main.cc:700
std::function<int (int, char**)>::operator()(int, char**) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
main at ./main.cc:2211
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?
Could you check if coredump was saved?
/proc/sys/kernel/core_pattern
atgke-main-scylla-6-25fcbc5b-412m
node should contain location for coredumps.
It core dump apparently was not saved, I couldn't find it at that location (/core.%e.%p.%t
) and a few other standard placed...
I would suggest to retry replacing c16ae0c9-33bf-4c99-8f44-d995eff274f2, maybe you won't hit the crash again.
To do so, remove internal.scylla-operator.scylladb.com/replacing-node-hostid: 3ec289d5-5910-4759-93bc-6e26ab5cda9f label from Service scylla-us-west1-us-west1-b-3. Operator will trigger c16ae0c9-33bf-4c99-8f44-d995eff274f2 replacement.
Alternatively, you can try removing both scylla/replace and internal.scylla-operator.scylladb.com/replacing-node-hostid from Service and restarting the Pod. It would boot without replace parameter and maybe it would continue streaming the rest of the data. Probably quicker than repeating entire replace procedure.
I think I already tried both of these ways (see the issue description), but will try again. Probably will start on Monday morning though.
In the meantime, please report an issue in Scylla repo, it shouldn't crash during replacement. Attach ~3h logs before the crash (2024-10-11T07:39:04) and backtrace.
Alternatively, you can try removing both
scylla/replace
andinternal.scylla-operator.scylladb.com/replacing-node-hostid
from Service and restarting the Pod. It would boot without replace parameter and maybe it would continue streaming the rest of the data. Probably quicker than repeating entire replace procedure.
Trying that now...
It didn't continue to stream the data, @zimnx. :( The disk usage stats suggest that the disk has been cleaned and it's bootstrapping the data from scratch.
The nodetool status output:
$ kubectl exec -it sts/scylla-us-west1-us-west1-b -n scylla -- nodetool status
Defaulted container "scylla" out of: scylla, scylladb-api-status-probe, scylla-manager-agent, sidecar-injection (init), sysctl-buddy (init)
Datacenter: us-west1
====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.7.241.130 2.71 TB 256 ? 787555a6-89d6-4b33-941c-940415380062 us-west1-b
UN 10.7.241.174 2.91 TB 256 ? 813f49f9-e397-4d70-8300-79fa91817f11 us-west1-b
UN 10.7.241.175 3.12 TB 256 ? 5342afaf-c19c-4be2-ada1-929698a4c398 us-west1-b
UN 10.7.243.109 2.85 TB 256 ? 880977bf-7cbb-4e0f-be82-ded853da57aa us-west1-b
UN 10.7.248.124 ? 256 ? c16ae0c9-33bf-4c99-8f44-d995eff274f2 us-west1-b
UN 10.7.249.238 2.61 TB 256 ? 5cc72b36-6fcf-4790-a540-930e544d59d2 us-west1-b
UN 10.7.252.229 3.03 TB 256 ? 60daa392-6362-423d-93b2-1ff747903287 us-west1-b
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
The logs from pod 3
since I restarted it: logs.txt.gz
I can see quite a lot of exceptions there:
WARN 2024-10-14 07:47:33,810 [shard 7:stmt] storage_proxy - Failed to apply mutation from 10.7.243.109#2: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 928a90eb-dacc-3aca-934e-d51eb198c063): data_dictionary::no_such_keyspace (Can't find a keyspace production)
WARN 2024-10-14 07:47:33,985 [shard 2:strm] storage_proxy - Failed to apply mutation from 10.7.241.175#28: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 40ebc17c-74b9-3f0e-bf24-10491b26a1fc): exceptions::invalid_request_exception (Unknown type production.feed_id)
INFO 2024-10-14 08:03:10,915 [shard 2:mt2c] lsa - LSA allocation failure, increasing reserve in section 0x61b008cfc620 to 2 segments; trace: 0x6469d6e 0x646a380 0x646a668 0x215468d 0x1fe969c 0x2129655 0x6341216
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memta
ble&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#2}, seastar::future<void> >(row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preempti
on_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, auto:1, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, auto:1&&, (auto:2&&)...)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::fina
lly_body<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#3}, false> >(seastar::future<void>::finally_body<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_
updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, auto:1, basic_preemption_source&)::{lambda()#1}::operator()() const::{lamb
da()#2}>(seastar::thread_attributes, auto:1&&, (auto:2&&)...)::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#3}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#3}, false> >(seastar::future<void>::finally_body<row_ca
che::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, auto:1, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#3}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_0::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}::operator()()::{lambda(auto:1)#1}, seastar::future<void>::then_wrapped_nrvo<void, row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_0::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(row_cache::external_updater&, std::function<seastar::future<void> ()>&) const::{lambda(auto:1)#1}::operator(
)<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}::operator()()::{lambda(auto:1)#1}>(std::function<seastar::future<void> ()>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, row_cache::do_update(row_cache::external_updater, std::function<seastar::future<void> ()>)::$_0::operator()<row_cache::external_updater, std::function<seastar::future<void> ()> >(auto:1&, auto:2&) const::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(auto:1)::{lambda()#1}::operator()()::{lambda(auto:1)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
--------
seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
(...)
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
/lifecycle stale
Context
We had a 7-node Scylla cluster in GCP, n2d-highmem-32 with 6TB local SSDs, running Scylla 6.0.3, Scylla Operator 1.13.0, Scylla Manager 3.3.3.
(It is the same cluster that was the protagonist of https://github.com/scylladb/scylla-operator/issues/2068)
Before the current issue has started, we did:
...and wanted to upgrade to the latest 6.1.1 (as we were trying to fix https://github.com/scylladb/scylladb/issues/19793).
What happened
Pod 3 loses its data on 16th of Sep
On Sep 16th, 19:14 UTC the pod
3
(iddea17e3f-198a-4ab8-b246-ff29e103941a
) has lost its data.The local provisioned on that node has logged:
The nodetool status output at that time looked like nothing was wrong:
The logs from 2024-09-16 19:10-19:34 UTC: extract-2024-09-30T14_19_57.781Z.csv.zip
But as the disk usage on the nodes was constantly growing, we assumed that the node will automatically get recreated so we left it like that for ~2 days. Then we noticed that it's failing to start with:
Pod 3 replacement fails on 18th of Sep
...and that the nodetool status is showing this:
The old node id from the error message was nowhere to be found:
Pod 3 another replacement try fails on 19th of Sep
We tried deleting the old node that had the local SSD issue, create a new one in its place and letting the cluster do the node replacement again, but it failed with a similar error as above:
Our cluster looked like this then:
Node removal fails on 24th of Sep
At this point we decided to try to remove the down node, with id
3ec289d5-5910-4759-93bc-6e26ab5cda9f
, from the cluster to continue our original task of upgrade Scylla to 6.1.1, planning to go back to replacing the missing node after we do that.However, the noderemove operation also failed.
We can't find meaningful errors from before this message, so I attach ~100k lines of logs from 2024-09-24 14:15-15:22 UTC from that day here: extract-2024-09-30T14_10_55.525Z.csv.zip
Node removal fails after a retry on 27th of Sep
Retrying noderemove didn't work:
We tried to do a rolling restart of the cluster and retry, similarly to what we did in https://github.com/scylladb/scylla-operator/issues/2068, but that did not help this time. The error message was as before, just with a different timestamp:
Additional info
During this time we had surprising moments when our Scylla disks were being filled with snapshots, getting dangerously close to 80% of disk use, example:
We cleared the snapshots when that happened using
commandnodetool clearsnapshot
.must-gather output
scylla-operator-must-gather-w7rn9tspr85z.zip