scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
340 stars 175 forks source link

bad_alloc in scylladb container #2148

Open kevinlmadison opened 1 month ago

kevinlmadison commented 1 month ago

What happened?

I'm brand new to scylla and I'm trying to install the operator, scylla-manager, and scylla, all using the 3 helm charts. The operator seems to work fine, but the manager and scylla both fail to come all the way up due to the scylladb container failing somehow. I've looked at the logs and I'm not quite sure what I'm looking at, I see a bad_alloc but I'm not sure what to make of it. I'll attach the helm values and a sample of the logs.

I have attempted adjusting the resource limits and requests, as well as the PVC capacity. I have tried in dev mode and non dev mode. I tried reverting to the previous version of the container image. In every case I'm getting the same logs.

INFO  2024-10-03 21:45:06,969 [shard 0:strm] gossip - failure_detector_loop: Started main loop
INFO  2024-10-03 21:45:07,026 [shard 0:strm] storage_service - entering JOINING mode
INFO  2024-10-03 21:45:07,026 [shard 0:strm] raft_topology - topology changes are using raft
INFO  2024-10-03 21:45:07,026 [shard 0:strm] raft_topology - start topology coordinator fiber
INFO  2024-10-03 21:45:07,026 [shard 0:strm] init - starting system distributed keyspace
INFO  2024-10-03 21:45:07,026 [shard 0:strm] system_distributed_keyspace - system_distributed(_everywhere) keyspaces and tables are up-to-date. Not creating
INFO  2024-10-03 21:45:07,029 [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator bfe2bd6d-c6e0-4890-81c7-d105da4ccc3e
INFO  2024-10-03 21:45:07,046 [shard 0:strm] storage_service - entering NORMAL mode
INFO  2024-10-03 21:45:07,046 [shard 0:strm] raft_group0 - finish_setup_after_join: group 0 ID present, loading server info.
INFO  2024-10-03 21:45:07,046 [shard 0:strm] raft_group0 - finish_setup_after_join: SUPPORTS_RAFT feature enabled. Starting internal upgrade-to-raft procedure.
INFO  2024-10-03 21:45:07,055 [shard 0:strm] raft_group0_upgrade - Already upgraded.
INFO  2024-10-03 21:45:07,058 [shard 0:strm] storage_service - Starting the tablet split monitor...
INFO  2024-10-03 21:45:07,058 [shard 0:main] init - starting tracing
INFO  2024-10-03 21:45:07,058 [shard 0:main] init - SSTable data integrity checker is disabled.
INFO  2024-10-03 21:45:07,062 [shard 0:main] init - starting auth service
INFO  2024-10-03 21:45:07,082 [shard 0:main] init - starting batchlog manager
INFO  2024-10-03 21:45:07,082 [shard 0:main] init - starting load meter
INFO  2024-10-03 21:45:07,082 [shard 0:main] init - starting cf cache hit rate calculator
INFO  2024-10-03 21:45:07,082 [shard 0:main] init - starting view update backlog broker
INFO  2024-10-03 21:45:07,082 [shard 0:main] init - allow replaying hints
INFO  2024-10-03 21:45:07,082 [shard 0:main] init - Launching generate_mv_updates for non system tables
INFO  2024-10-03 21:45:07,089 [shard 0:comp] compaction - [Compact system_schema.columns c1105b30-81d0-11ef-8abf-da41d75a1898] Compacted 2 sstables to [/var/lib/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/me-3gk2_1of6_2fnv423yjtdd9k56l4-big-Data.db:level=0]. 24kB to 19kB (~78% of original) in 557ms = 44kB/s. ~256 total partitions merged to 5.
INFO  2024-10-03 21:45:07,110 [shard 0:comp] compaction - [Compact system.truncated c17c3c60-81d0-11ef-8abf-da41d75a1898] Compacting [/var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gk2_1of4_1espc23yjtdd9k56l4-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gk2_1of4_0td3k23yjtdd9k56l4-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-1-big-Data.db:level=0:origin=memtable]
INFO  2024-10-03 21:45:07,141 [shard 0:stmt] cql_server_controller - Enabling encrypted CQL connections between client and server
INFO  2024-10-03 21:45:07,141 [shard 0:stmt] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:9042 (unencrypted, non-shard-aware)
INFO  2024-10-03 21:45:07,141 [shard 0:stmt] cql_server_controller - Starting listening for CQL clients on 0.0.0.0:19042 (unencrypted, shard-aware)
INFO  2024-10-03 21:45:07,157 [shard 0:main] init - Shutting down local storage
INFO  2024-10-03 21:45:07,158 [shard 0:main] storage_service - Stop transport: starts
INFO  2024-10-03 21:45:07,158 [shard 0:main] migration_manager - stopping migration service
INFO  2024-10-03 21:45:07,158 [shard 0:main] storage_service - Shutting down native transport server
INFO  2024-10-03 21:45:07,158 [shard 0:main] storage_service - Shutting down native transport server was successful
INFO  2024-10-03 21:45:07,158 [shard 0:main] storage_service - Stop transport: shutdown rpc and cql server done
INFO  2024-10-03 21:45:07,158 [shard 0:main] gossip - My status = NORMAL
INFO  2024-10-03 21:45:07,158 [shard 0:main] gossip - Announcing shutdown
INFO  2024-10-03 21:45:07,174 [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC manager-dc that has 1 endpoints
INFO  2024-10-03 21:45:07,181 [shard 0: gms] load_balancer - Examining DC manager-dc (shuffle=false, balancing=true)
INFO  2024-10-03 21:45:07,182 [shard 0: gms] load_balancer - Node bfe2bd6d-c6e0-4890-81c7-d105da4ccc3e: rack=manager-rack avg_load=0, tablets=0, shards=1, state=normal
INFO  2024-10-03 21:45:07,182 [shard 0: gms] load_balancer - Prepared 0 migrations in DC manager-dc
INFO  2024-10-03 21:45:07,182 [shard 0: gms] load_balancer - Prepared 0 migration plans, out of which there were 0 tablet migration(s) and 0 resize decision(s)
INFO  2024-10-03 21:45:07,231 [shard 0: gms] load_balancer - Examining DC manager-dc (shuffle=false, balancing=true)
INFO  2024-10-03 21:45:07,231 [shard 0: gms] load_balancer - Node bfe2bd6d-c6e0-4890-81c7-d105da4ccc3e: rack=manager-rack avg_load=0, tablets=0, shards=1, state=normal
INFO  2024-10-03 21:45:07,231 [shard 0: gms] load_balancer - Prepared 0 migrations in DC manager-dc
INFO  2024-10-03 21:45:07,231 [shard 0: gms] load_balancer - Prepared 0 migration plans, out of which there were 0 tablet migration(s) and 0 resize decision(s)
INFO  2024-10-03 21:45:07,351 [shard 0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gk2_1of7_0vpz423yjtdd9k56l4-big-Filter.db: resizing bitset from 488 bytes to 16 bytes. sstable origin: compaction
INFO  2024-10-03 21:45:07,732 [shard 0:comp] compaction - [Compact system.truncated c17c3c60-81d0-11ef-8abf-da41d75a1898] Compacted 3 sstables to [/var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gk2_1of7_0vpz423yjtdd9k56l4-big-Data.db:level=0]. 17kB to 5863 bytes (~34% of original) in 464ms = 36kB/s. ~384 total partitions merged to 7.
INFO  2024-10-03 21:45:07,735 [shard 0:comp] compaction - [Compact system_schema.tables c1db9a70-81d0-11ef-8abf-da41d75a1898] Compacting [/var/lib/scylla/data/system_schema/tables-afddfb9dbc1e30688056eed6c302ba09/me-3gk2_1of7_1dieo23yjtdd9k56l4-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system_schema/tables-afddfb9dbc1e30688056eed6c302ba09/me-3gk2_1oes_5pbz42dopo3jg29oml-big-Data.db:level=0:origin=compaction]
WARN  2024-10-03 21:45:07,970 [shard 0: gms] gossip - === Gossip round FAIL: seastar::gate_closed_exception (gate closed)
INFO  2024-10-03 21:45:07,970 [shard 0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/system_schema/tables-afddfb9dbc1e30688056eed6c302ba09/me-3gk2_1of7_4fgqo23yjtdd9k56l4-big-Filter.db: resizing bitset from 328 bytes to 16 bytes. sstable origin: compaction
INFO  2024-10-03 21:45:08,199 [shard 0:comp] compaction - [Compact system_schema.tables c1db9a70-81d0-11ef-8abf-da41d75a1898] Compacted 2 sstables to [/var/lib/scylla/data/system_schema/tables-afddfb9dbc1e30688056eed6c302ba09/me-3gk2_1of7_4fgqo23yjtdd9k56l4-big-Data.db:level=0]. 26kB to 14kB (~53% of original) in 392ms = 68kB/s. ~256 total partitions merged to 5.
INFO  2024-10-03 21:45:08,200 [shard 0:comp] compaction - [Compact system.scylla_local c2228e80-81d0-11ef-8abf-da41d75a1898] Compacting [/var/lib/scylla/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gk2_1of7_2lnww23yjtdd9k56l4-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gk2_1oew_3e65c2dopo3jg29oml-big-Data.db:level=0:origin=memtable]
INFO  2024-10-03 21:45:08,358 [shard 0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gk2_1of8_185i823yjtdd9k56l4-big-Filter.db: resizing bitset from 328 bytes to 16 bytes. sstable origin: compaction
INFO  2024-10-03 21:45:08,609 [shard 0:comp] compaction - [Compact system.scylla_local c2228e80-81d0-11ef-8abf-da41d75a1898] Compacted 2 sstables to [/var/lib/scylla/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gk2_1of8_185i823yjtdd9k56l4-big-Data.db:level=0]. 12kB to 6522 bytes (~51% of original) in 323ms = 39kB/s. ~256 total partitions merged to 6.
INFO  2024-10-03 21:45:08,613 [shard 0:comp] compaction - [Compact system.local c2616c40-81d0-11ef-8abf-da41d75a1898] Compacting [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/me-1-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/me-3gk2_1oew_22dgw2dopo3jg29oml-big-Data.db:level=0:origin=memtable]
INFO  2024-10-03 21:45:08,798 [shard 0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/me-3gk2_1of8_3ovy823yjtdd9k56l4-big-Filter.db: resizing bitset from 328 bytes to 8 bytes. sstable origin: compaction
WARN  2024-10-03 21:45:08,977 [shard 0: gms] gossip - === Gossip round FAIL: seastar::gate_closed_exception (gate closed)
INFO  2024-10-03 21:45:09,036 [shard 0:comp] compaction - [Compact system.local c2616c40-81d0-11ef-8abf-da41d75a1898] Compacted 2 sstables to [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/me-3gk2_1of8_3ovy823yjtdd9k56l4-big-Data.db:level=0]. 18kB to 12kB (~66% of original) in 349ms = 53kB/s. ~256 total partitions merged to 1.
INFO  2024-10-03 21:45:09,036 [shard 0:comp] compaction - [Compact system.group0_history c2a21ec0-81d0-11ef-8abf-da41d75a1898] Compacting [/var/lib/scylla/data/system/group0_history-027e42f5683a3ed7b404a0100762063c/me-3gk2_1of7_25sxc23yjtdd9k56l4-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/group0_history-027e42f5683a3ed7b404a0100762063c/me-3gk2_1oet_5owjk2dopo3jg29oml-big-Data.db:level=0:origin=compaction]
INFO  2024-10-03 21:45:09,163 [shard 0:main] gossip - Disable and wait for gossip loop started
INFO  2024-10-03 21:45:09,256 [shard 0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/system/group0_history-027e42f5683a3ed7b404a0100762063c/me-3gk2_1of9_0a2nk23yjtdd9k56l4-big-Filter.db: resizing bitset from 328 bytes to 8 bytes. sstable origin: compaction
INFO  2024-10-03 21:45:09,531 [shard 0:comp] compaction - [Compact system.group0_history c2a21ec0-81d0-11ef-8abf-da41d75a1898] Compacted 2 sstables to [/var/lib/scylla/data/system/group0_history-027e42f5683a3ed7b404a0100762063c/me-3gk2_1of9_0a2nk23yjtdd9k56l4-big-Data.db:level=0]. 11kB to 6343 bytes (~53% of original) in 412ms = 28kB/s. ~256 total partitions merged to 1.
INFO  2024-10-03 21:45:09,535 [shard 0:comp] compaction - [Compact system.raft c2ee42f0-81d0-11ef-8abf-da41d75a1898] Compacting [/var/lib/scylla/data/system/raft-3e17774c57f539939625327cbafbf5bb/me-3gk2_1oew_3squ82dopo3jg29oml-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/raft-3e17774c57f539939625327cbafbf5bb/me-3gk2_1of7_3aiz423yjtdd9k56l4-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/raft-3e17774c57f539939625327cbafbf5bb/me-3gk2_1oet_2sits2dopo3jg29oml-big-Data.db:level=0:origin=compaction]
INFO  2024-10-03 21:45:09,751 [shard 0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/system/raft-3e17774c57f539939625327cbafbf5bb/me-3gk2_1of9_37yds23yjtdd9k56l4-big-Filter.db: resizing bitset from 488 bytes to 8 bytes. sstable origin: compaction
INFO  2024-10-03 21:45:09,975 [shard 0:main] gossip - Gossip is now stopped
INFO  2024-10-03 21:45:09,975 [shard 0:main] storage_service - Stop transport: stop_gossiping done
INFO  2024-10-03 21:45:09,975 [shard 0:main] messaging_service - Shutting down nontls server
INFO  2024-10-03 21:45:09,975 [shard 0:main] messaging_service - Shutting down tls server
INFO  2024-10-03 21:45:09,975 [shard 0:main] messaging_service - Shutting down tls server - Done
INFO  2024-10-03 21:45:09,975 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,975 [shard 0:main] messaging_service - Stopping client for address: 10.99.155.32:0
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Shutting down nontls server - Done
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopping client for address: 10.99.155.32:0 - Done
INFO  2024-10-03 21:45:09,976 [shard 0:main] messaging_service - Stopped clients
INFO  2024-10-03 21:45:09,976 [shard 0:main] storage_service - Stop transport: shutdown messaging_service done
INFO  2024-10-03 21:45:09,976 [shard 0:main] storage_service - Stop transport: shutdown stream_manager done
INFO  2024-10-03 21:45:09,976 [shard 0:main] storage_service - Stop transport: done
INFO  2024-10-03 21:45:09,976 [shard 0:main] tracing - Asked to shut down
INFO  2024-10-03 21:45:09,976 [shard 0:main] tracing - Tracing is down
INFO  2024-10-03 21:45:09,976 [shard 0:main] batchlog_manager - Asked to drain
INFO  2024-10-03 21:45:09,976 [shard 0:main] batchlog_manager - Drained
INFO  2024-10-03 21:45:09,976 [shard 0:main] view - Draining view builder
INFO  2024-10-03 21:45:09,976 [shard 0:main] compaction_manager - Asked to drain
INFO  2024-10-03 21:45:09,976 [shard 0:main] compaction_manager - Stopping 1 tasks for 1 ongoing compactions due to drain
INFO  2024-10-03 21:45:10,030 [shard 0:comp] compaction - [Compact system.raft c2ee42f0-81d0-11ef-8abf-da41d75a1898] Compacted 3 sstables to [/var/lib/scylla/data/system/raft-3e17774c57f539939625327cbafbf5bb/me-3gk2_1of9_37yds23yjtdd9k56l4-big-Data.db:level=0]. 97kB to 85kB (~88% of original) in 394ms = 246kB/s. ~384 total partitions merged to 1.
INFO  2024-10-03 21:45:10,034 [shard 0:main] compaction_manager - Drained
INFO  2024-10-03 21:45:10,034 [shard 0:main] database - Flushing non-system tables
INFO  2024-10-03 21:45:10,040 [shard 0:main] database - Flushed non-system tables
INFO  2024-10-03 21:45:10,040 [shard 0:main] database - Flushing system tables
INFO  2024-10-03 21:45:11,966 [shard 0:main] database - Flushed system tables
INFO  2024-10-03 21:45:12,845 [shard 0:main] init - Shutting down local storage was successful
INFO  2024-10-03 21:45:12,845 [shard 0:main] init - Shutting down view builder API
INFO  2024-10-03 21:45:12,845 [shard 0:main] init - Shutting down view builder API was successful
INFO  2024-10-03 21:45:12,845 [shard 0:main] init - Shutting down hinted handoff API
INFO  2024-10-03 21:45:12,845 [shard 0:main] init - Shutting down hinted handoff API was successful
INFO  2024-10-03 21:45:12,845 [shard 0:main] init - Shutting down view update backlog broker
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down view update backlog broker was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down cf cache hit rate calculator
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down cf cache hit rate calculator was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down load meter API
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down load meter API was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down load meter
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down load meter was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down batchlog manager
INFO  2024-10-03 21:45:12,846 [shard 0:main] batchlog_manager - Asked to stop
INFO  2024-10-03 21:45:12,846 [shard 0:main] batchlog_manager - Stopped
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down batchlog manager was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down service level controller subscription
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down service level controller subscription was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down authorization cache api
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down authorization cache api was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down auth service
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down auth service was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down tracing
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down tracing was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down group 0 usage in local storage
INFO  2024-10-03 21:45:12,846 [shard 0:strm] raft_topology - raft_state_monitor_fiber aborted with seastar::abort_requested_exception (abort requested)
INFO  2024-10-03 21:45:12,846 [shard 0:strm] raft_topology - cleanup fiber aborted
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down group 0 usage in local storage was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down storage service uninit address map
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down storage service uninit address map was successful
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down group 0 service
INFO  2024-10-03 21:45:12,846 [shard 0:main] init - Shutting down group 0 service was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down system distributed keyspace was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down Raft API
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down Raft API was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down sstables format listener
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down sstables format listener was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down storage service API
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down storage service API was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down query processor remote part
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down query processor remote part was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down storage_service
INFO  2024-10-03 21:45:12,848 [shard 0:main] storage_service - Stopped node_ops_abort_thread
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down storage_service was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down topology_state_machine
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down topology_state_machine was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down Raft
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down Raft was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down migration manager
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down migration manager was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down mapreduce service
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down mapreduce service was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down token metadata API
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down token metadata API was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down tablet allocator
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down tablet allocator was successful
INFO  2024-10-03 21:45:12,848 [shard 0:main] init - Shutting down direct_failure_detector
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down direct_failure_detector was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down fd_pinger
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down fd_pinger was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down raft_address_map
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down raft_address_map was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down gossiper API
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down gossiper API was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down gossiper
INFO  2024-10-03 21:45:12,849 [shard 0:main] gossip - gossip is already stopped
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down gossiper was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down messaging service API
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down messaging service API was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down messaging service
INFO  2024-10-03 21:45:12,849 [shard 0:main] messaging_service - Stopping nontls server
INFO  2024-10-03 21:45:12,849 [shard 0:main] messaging_service - Stopping nontls server - Done
INFO  2024-10-03 21:45:12,849 [shard 0:main] messaging_service - Stopping tls server
INFO  2024-10-03 21:45:12,849 [shard 0:main] messaging_service - Stopping tls server - Done
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down messaging service was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down service level controller
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down service level controller was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down config API
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down config API was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down tracing instance
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down tracing instance was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down lifecycle notifier
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down lifecycle notifier was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down storage proxy API
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down storage proxy API was successful
INFO  2024-10-03 21:45:12,849 [shard 0:main] init - Shutting down database
INFO  2024-10-03 21:45:12,849 [shard 0:main] compaction_manager - Asked to drain
INFO  2024-10-03 21:45:12,849 [shard 0:main] compaction_manager - Drained
INFO  2024-10-03 21:45:12,853 [shard 0:main] large_data - Waiting for 0 background handlers
INFO  2024-10-03 21:45:12,854 [shard 0:main] database - Shutting down commitlog
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Shutting down commitlog complete
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Shutting down schema commitlog
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Shutting down schema commitlog complete
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Shutting down system dirty memory manager
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Shutting down dirty memory manager
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Shutting down memtable controller
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Closing user sstables manager
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Closing system sstables manager
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Stopping querier cache
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Stopping concurrency semaphores
INFO  2024-10-03 21:45:12,855 [shard 0:main] database - Joining memtable update action
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down database was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down lang manager
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down lang manager was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down sstables storage manager
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down sstables storage manager was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down compaction_manager
INFO  2024-10-03 21:45:12,855 [shard 0:main] compaction_manager - Asked to stop
INFO  2024-10-03 21:45:12,855 [shard 0:main] compaction_manager - Stopped
INFO  2024-10-03 21:45:12,855 [shard 0:main] task_manager - Stopping module compaction
INFO  2024-10-03 21:45:12,855 [shard 0:main] task_manager - Unregistered module compaction
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down compaction_manager was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down task manager API
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down task manager API was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down task_manager
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down task_manager was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down service_memory_limiter
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down service_memory_limiter was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down sst_dir_semaphore
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down sst_dir_semaphore was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down migration manager notifier
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down migration manager notifier was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down prometheus API server
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down prometheus API server was successful
INFO  2024-10-03 21:45:12,855 [shard 0:main] init - Shutting down API server
INFO  2024-10-03 21:45:12,856 [shard 0:main] init - Shutting down API server was successful
INFO  2024-10-03 21:45:12,856 [shard 0:main] init - Shutting down sighup
INFO  2024-10-03 21:45:12,856 [shard 0:main] init - Shutting down sighup was successful
INFO  2024-10-03 21:45:12,856 [shard 0:main] init - Shutting down configurables
INFO  2024-10-03 21:45:12,856 [shard 0:main] init - Shutting down configurables was successful
ERROR 2024-10-03 21:45:12,856 [shard 0:main] init - Startup failed: std::bad_alloc (std::bad_alloc)
2024-10-03 21:45:12,898 WARN exited: scylla (exit status 1; not expected)
2024-10-03 21:45:13,904 INFO spawned: 'scylla' with pid 130
Error in GnuTLS initialization: Error while performing self checks.
Scylla version 6.1.1-0.20240814.8d90b817660a with build-id e5b6b7749483ec025db9402adc444c3d31a9ab8f starting ...
command used: "/usr/bin/scylla --log-to-syslog 0 --log-to-stdout 1 --network-stack posix --developer-mode=1 --smp 1 --overprovisioned --listen-address 0.0.0.0 --rpc-address 0.0.0.0 --seed-provider-parameters seeds=10.99.155.32 --broadcast-address 10.99.155.32 --broadcast-rpc-address 10.99.155.32 --alternator-address 0.0.0.0 --blocked-reactor-notify-ms 999999999 --prometheus-address=0.0.0.0"
pid: 130
parsed command line options: [log-to-syslog, (positional) 0, log-to-stdout, (positional) 1, network-stack, (positional) posix, developer-mode: 1, smp, (positional) 1, overprovisioned, listen-address: 0.0.0.0, rpc-address: 0.0.0.0, seed-provider-parameters: seeds=10.99.155.32, broadcast-address: 10.99.155.32, broadcast-rpc-address: 10.99.155.32, alternator-address: 0.0.0.0, blocked-reactor-notify-ms, (positional) 999999999, prometheus-address: 0.0.0.0]
INFO  2024-10-03 21:45:14,439 seastar - Reactor backend: epoll
INFO  2024-10-03 21:45:14,440 seastar - Perf-based stall detector creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: falling back to posix timer.
WARN  2024-10-03 21:45:14,441 seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
INFO  2024-10-03 21:45:14,442 [shard 0:main] seastar - updated: blocked-reactor-notify-ms=36000000000
INFO  2024-10-03 21:45:14,446 [shard 0:main] init - installing SIGHUP handler
INFO  2024-10-03 21:45:14,450 [shard 0:main] init - Scylla version 6.1.1-0.20240814.8d90b817660a with build-id e5b6b7749483ec025db9402adc444c3d31a9ab8f starting ...

WARN  2024-10-03 21:45:14,450 [shard 0:main] init - Only 476 MiB per shard; this is below the recommended minimum of 1 GiB/shard; continuing since running in developer mode
WARN  2024-10-03 21:45:14,450 [shard 0:main] init - I/O Scheduler is not properly configured! This is a non-supported setup, and performance is expected to be unpredictably bad.
 Reason found: none of --io-properties and --io-properties-file are set.
To properly configure the I/O Scheduler, run the scylla_io_setup utility shipped with Scylla.

INFO  2024-10-03 21:45:14,451 [shard 0:main] init - starting API server
INFO  2024-10-03 21:45:14,452 [shard 0:main] init - starting prometheus API server
INFO  2024-10-03 21:45:14,452 [shard 0:main] init - creating snitch
WARN  2024-10-03 21:45:14,452 [shard 0:main] snitch_logger - Not gossiping INADDR_ANY as internal IP
INFO  2024-10-03 21:45:14,454 [shard 0:main] init - starting tokens manager
INFO  2024-10-03 21:45:14,454 [shard 0:main] init - starting effective_replication_map factory
INFO  2024-10-03 21:45:14,454 [shard 0:main] init - starting migration manager notifier
INFO  2024-10-03 21:45:14,454 [shard 0:main] init - starting per-shard database core
INFO  2024-10-03 21:45:14,454 [shard 0:main] init - creating and verifying directories
INFO  2024-10-03 21:45:14,853 [shard 0:main] init - starting compaction_manager
INFO  2024-10-03 21:45:14,853 [shard 0:main] task_manager - Registered module compaction
INFO  2024-10-03 21:45:14,854 [shard 0:main] compaction_manager - Set unlimited compaction bandwidth
INFO  2024-10-03 21:45:14,854 [shard 0:main] init - starting database
INFO  2024-10-03 21:45:14,857 [shard 0:main] seastar - updated: blocked-reactor-notify-ms=999999999
INFO  2024-10-03 21:45:14,857 [shard 0:main] init - starting storage proxy
INFO  2024-10-03 21:45:14,866 [shard 0:main] init - starting query processor
INFO  2024-10-03 21:45:14,867 [shard 0:main] init - starting lifecycle notifier
INFO  2024-10-03 21:45:14,867 [shard 0:main] init - creating tracing
INFO  2024-10-03 21:45:14,867 [shard 0:main] init - Scylla API server listening on 127.0.0.1:10000 ...
INFO  2024-10-03 21:45:14,867 [shard 0:main] init - starting system keyspace
INFO  2024-10-03 21:45:14,867 [shard 0:main] init - loading system sstables

What did you expect to happen?

I expected all the containers to be running in the pod.

How can we reproduce it (as minimally and precisely as possible)?

# Allows to override Scylla Manager name showing up in recommended k8s labels
nameOverride: ""
# Allows to override names used in Scylla Manager k8s objects.
fullnameOverride: ""
# Allows to customize Scylla Manager image
image:
  repository: scylladb
  pullPolicy: IfNotPresent
  tag: 3.3.3@sha256:b7b342bf0a8bd1e2374b733a3d40e43504e75ef1b9c21fe85c21e08bd08d47e0
  # tag: 3.3.0@sha256:e8c5b62c9330f91dfca24f109b033df78113d3ffaac306edf6d3c4346e1fa0bf
# Allows to customize Scylla Manager Controller image
controllerImage:
  repository: scylladb
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: ""
# Scylla Manager log level, allowed values are: error, warn, info, debug, trace
logLevel: debug
# Resources allocated to Scylla Manager pods
resources:
  requests:
    cpu: 10m
    memory: 1000Mi
# Resources allocated to Scylla Manager pods
controllerResources:
  requests:
    cpu: 10m
    memory: 1000Mi
# Node selector for Scylla Manager pod
nodeSelector: {}
# Tolerations for Scylla Manager pod
tolerations: []
# Affinity for Scylla Manager pod
affinity: {}
## SecurityContext holds pod-level security attributes
securityContext: {}
# Node selector for Scylla Manager Controller pod
controllerNodeSelector: {}
# Tolerations for Scylla Manager Controller pod
controllerTolerations: []
# Affinity for Scylla Manager Controller pod
controllerAffinity: {}
## ControllerSecurityContext holds pod-level security attributes
controllerSecurityContext: {}
serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""
controllerServiceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""
scylla:
  developerMode: true
  scyllaImage:
    tag: 6.1.1
  agentImage:
    tag: 3.3.3@sha256:40e31739e8fb1d48af87abaeaa8ee29f71607964daa8434fe2526dfc6f665920
    # tag: 3.3.0@sha256:dc2684f51e961d88da5a8eac2d9f165cb7a960cbf91f67f49332e7224317c96b
  datacenter: manager-dc
  racks:
    - name: manager-rack
      members: 1
      storage:
        capacity: 15Gi
        storageClassName: ceph-block
      resources:
        limits:
          cpu: 1
          memory: 1000Mi
        requests:
          cpu: 1
          memory: 1000Mi
# Whether to create Prometheus ServiceMonitor
serviceMonitor:
  create: true

Scylla Operator version

# Operator replicas use leader election. Setting to 1 will disable pdb creation
# and won't be HA; creations or updates of Scylla CRs will fail during operator
# upgrades or disruptions
replicas: 2

# Allows to customize Scylla Operator image
image:
  repository: scylladb
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: ""

# Scylla Operator log level, 0-9 (higher number means more detailed logs)
logLevel: 9
# Resources allocated to Scylla Operator pods
resources:
  requests:
    cpu: 100m
    memory: 20Mi
# Node selector for Scylla Operator pods
nodeSelector: {}

# Tolerations for Scylla Operator pods
tolerations: []

# Affinity for Scylla Operator pods
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: scylla-operator
              app.kubernetes.io/instance: scylla-operator
          topologyKey: kubernetes.io/hostname

webhook:
  # Specifies whether a self signed certificate should be created using cert-manager
  createSelfSignedCertificate: true
  # Name of a secret containing custom certificate
  # If not set and createSelfSignedCertificate is true, a name is generated using fullname
  certificateSecretName: ""

serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""

## SecurityContext holds pod-level security attributes and common container settings.
securityContext: {}

# Replicas for Webhook Server. Setting to 1 will disable pdb creation and
# won't be HA; it won't react during operator upgrades or disruptions.
webhookServerReplicas: 2

# Resources allocated to Webhook Server pods
webhookServerResources:
  requests:
    cpu: 10m
    memory: 20Mi

# Node selector for Webhook Server pods
webhookServerNodeSelector: {}

# Tolerations for Webhook Server pods
webhookServerTolerations: []

# Affinity for Webhook Server pods
webhookServerAffinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: webhook-server
              app.kubernetes.io/instance: webhook-server
          topologyKey: kubernetes.io/hostname

Kubernetes platform name and version

```console $ kubectl version Client Version: v1.30.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.3+rke2r1 ``` Kubernetes platform info:

Please attach the must-gather archive.

scylla-operator-must-gather.log

Anything else we need to know?

No response

tnozicka commented 1 month ago

Please attach the must-gather archive.

TODO

The template that asked for the must-gather explicitly states that it's required as well as https://operator.docs.scylladb.com/v1.14/support/overview.html#gather-data-about-your-cluster

Please supply the must-gather archive if you want someone to look at the issue.

kevinlmadison commented 1 month ago

@tnozicka Thank you, and sorry for the delay! I've attached the archive in the original post.

rzetelskik commented 3 weeks ago

@kevinlmadison you only attached the logs from the must-gather command, what we need is the actual archive it generated