scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.65k stars 1.3k forks source link

[DTEST][DEBUG]: ReBootstrap stucked with tablets enabled in test bootstrap_test.TestBootstrap.test_cluster_become_unavailable_when_kill_node_during_bootstrap #20218

Open aleksbykov opened 3 months ago

aleksbykov commented 3 months ago

Scylla version 6.0.3-0.20240808.a56f7ce21ad4 with build-id 66a8b676ee9e21374a3b46538c930879e592260f

Dtests which are failed:

Looks like node which rebootstrapped is stucked after

INFO  2024-08-11 17:07:47,912 [shard 0:main] raft_group0 - Disabling migration_manager schema pulls because Raft is enabled and we're bootstrapping.
INFO  2024-08-11 17:07:47,912 [shard 0:strm] messaging_service - Starting Messaging Service on address 127.0.77.4 port 7000
INFO  2024-08-11 17:07:47,913 [shard 0:strm] storage_service - entering STARTING mode
INFO  2024-08-11 17:07:47,913 [shard 0:strm] storage_service - Loading persisted ring state
INFO  2024-08-11 17:07:47,916 [shard 1:comp] compaction - [Compact system.truncated 3bcb7e00-5804-11ef-92bd-dbeb4664306b] Compacted 3 sstables to [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gil_1bkz_4ybr428c2cr2qgq4bv-big-Data.db:level=0]. 222kB to 74kB (~33% of original) in 88ms = 2MB/s. ~384 total partitions merged to 6.
INFO  2024-08-11 17:07:47,920 [shard 1:comp] compaction - [Compact system_distributed.cdc_streams_descriptions_v2 3bdc6df0-5804-11ef-92bd-dbeb4664306b] Compacting [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system_distributed/cdc_streams_descriptions_v2-0bf73fd765b236b085e5658131d5df36/me-3gil_1bjl_2viuo2lcx96ebda9wi-big-Data.db:level=0:origin=repair,/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system_distributed/cdc_streams_descriptions_v2-0bf73fd765b236b085e5658131d5df36/me-3gil_1bjt_57jjk2lcx96ebda9wi-big-Data.db:level=0:origin=memtable]
INFO  2024-08-11 17:07:47,970 [shard 0:strm] storage_service - initial_contact_nodes={127.0.77.1, 127.0.77.3, 127.0.77.2}, loaded_endpoints=[0230a3fb-f0f0-4abf-b597-b485a0c29bb2, 44c69c66-2ea2-4642-b17b-5288d2f91adc, d6ef4a07-e03d-47fa-8286-b38a431da945], loaded_peer_features=3
INFO  2024-08-11 17:07:47,970 [shard 0:strm] storage_service - peer=127.0.77.2, supported_features=AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,GROUP0_SCHEMA_VERSIONING,HINTED_HANDOFF_SEPARATE_CONNECTION,HOST_ID_BASED_HINTED_HANDOFF,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,RANGE_TOMBSTONE_AND_DEAD_ROWS_DETECTION,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLETS,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
INFO  2024-08-11 17:07:47,970 [shard 0:strm] storage_service - peer=127.0.77.3, supported_features=AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,GROUP0_SCHEMA_VERSIONING,HINTED_HANDOFF_SEPARATE_CONNECTION,HOST_ID_BASED_HINTED_HANDOFF,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,RANGE_TOMBSTONE_AND_DEAD_ROWS_DETECTION,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLETS,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
INFO  2024-08-11 17:07:47,970 [shard 0:strm] storage_service - peer=127.0.77.1, supported_features=AGGREGATE_STORAGE_OPTIONS,ALTERNATOR_TTL,CDC,CDC_GENERATIONS_V2,COLLECTION_INDEXING,COMPUTED_COLUMNS,CORRECT_COUNTER_ORDER,CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX,CORRECT_NON_COMPOUND_RANGE_TOMBSTONES,CORRECT_STATIC_COMPACT_IN_MC,COUNTERS,DIGEST_FOR_NULL_VALUES,DIGEST_INSENSITIVE_TO_EXPIRY,DIGEST_MULTIPARTITION_READ,EMPTY_REPLICA_MUTATION_PAGES,EMPTY_REPLICA_PAGES,GROUP0_SCHEMA_VERSIONING,HINTED_HANDOFF_SEPARATE_CONNECTION,HOST_ID_BASED_HINTED_HANDOFF,INDEXES,LARGE_COLLECTION_DETECTION,LARGE_PARTITIONS,LA_SSTABLE_FORMAT,LWT,MATERIALIZED_VIEWS,MC_SSTABLE_FORMAT,MD_SSTABLE_FORMAT,ME_SSTABLE_FORMAT,NONFROZEN_UDTS,PARALLELIZED_AGGREGATION,PER_TABLE_CACHING,PER_TABLE_PARTITIONERS,RANGE_SCAN_DATA_VARIANT,RANGE_TOMBSTONES,RANGE_TOMBSTONE_AND_DEAD_ROWS_DETECTION,ROLES,ROW_LEVEL_REPAIR,SCHEMA_COMMITLOG,SCHEMA_TABLES_V3,SECONDARY_INDEXES_ON_STATIC_COLUMNS,SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT,STREAM_WITH_RPC_STREAM,SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES,SUPPORTS_RAFT_CLUSTER_MANAGEMENT,TABLETS,TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,TOMBSTONE_GC_OPTIONS,TRUNCATION_TABLE,TYPED_ERRORS_IN_READ_RPC,UDA,UDA_NATIVE_PARALLELIZED_AGGREGATION,UNBOUNDED_RANGE_TOMBSTONES,UUID_SSTABLE_IDENTIFIERS,VIEW_VIRTUAL_COLUMNS,WRITE_FAILURE_REPLY,XXHASH
INFO  2024-08-11 17:07:48,078 [shard 1:comp] compaction - [Compact system_distributed.cdc_streams_descriptions_v2 3bdc6df0-5804-11ef-92bd-dbeb4664306b] Compacted 2 sstables to [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system_distributed/cdc_streams_descriptions_v2-0bf73fd765b236b085e5658131d5df36/me-3gil_1bkz_5hm7428c2cr2qgq4bv-big-Data.db:level=0]. 74kB to 107kB (~144% of original) in 125ms = 593kB/s. ~256 total partitions merged to 2.
INFO  2024-08-11 17:07:48,256 [shard 0:comp] compaction - [Compact system.cdc_generations_v3 3bd84f40-5804-11ef-baf5-dbec4664306b] Compacted 2 sstables to [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/cdc_generations_v3-c697df3f55393ef5b003843be7ba1422/me-3gil_1bkz_5drb42ucpvjf1okqqz-big-Data.db:level=0]. 109kB to 82kB (~75% of original) in 342ms = 320kB/s. ~256 total partitions merged to 1.
INFO  2024-08-11 17:07:48,258 [shard 0:comp] compaction - [Compact system.compaction_history 3c102820-5804-11ef-baf5-dbec4664306b] Compacting [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/me-3gil_1bjt_5losw2lr0fx7773a8y-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/me-3gil_1bhn_3dj002lr0fx7773a8y-big-Data.db:level=0:origin=compaction]
INFO  2024-08-11 17:07:48,389 [shard 0:comp] compaction - [Compact system.compaction_history 3c102820-5804-11ef-baf5-dbec4664306b] Compacted 2 sstables to [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/me-3gil_1bl0_1jq682ucpvjf1okqqz-big-Data.db:level=0]. 109kB to 74kB (~67% of original) in 126ms = 871kB/s. ~256 total partitions merged to 8.
INFO  2024-08-11 17:07:48,390 [shard 0:comp] compaction - [Compact system.scylla_local 3c244c60-5804-11ef-baf5-dbec4664306b] Compacting [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-2-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gil_1bjt_5g46o2lr0fx7773a8y-big-Data.db:level=0:origin=memtable]
INFO  2024-08-11 17:07:48,561 [shard 0:comp] compaction - [Compact system.scylla_local 3c244c60-5804-11ef-baf5-dbec4664306b] Compacted 2 sstables to [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gil_1bl0_2c0ow2ucpvjf1okqqz-big-Data.db:level=0]. 11kB to 6834 bytes (~57% of original) in 120ms = 98kB/s. ~256 total partitions merged to 6.
INFO  2024-08-11 17:07:48,563 [shard 0:comp] compaction - [Compact system.peers 3c3eb230-5804-11ef-baf5-dbec4664306b] Compacting [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/peers-37f71aca7dc2383ba70672528af04d4f/me-3gil_1bjt_57bts2lr0fx7773a8y-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/peers-37f71aca7dc2383ba70672528af04d4f/me-3gil_1bhm_0xv4w2lr0fx7773a8y-big-Data.db:level=0:origin=compaction]
INFO  2024-08-11 17:07:48,661 [shard 0:comp] compaction - [Compact system.peers 3c3eb230-5804-11ef-baf5-dbec4664306b] Compacted 2 sstables to [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/peers-37f71aca7dc2383ba70672528af04d4f/me-3gil_1bl0_3dba82ucpvjf1okqqz-big-Data.db:level=0]. 29kB to 24kB (~82% of original) in 48ms = 619kB/s. ~256 total partitions merged to 3.
INFO  2024-08-11 17:07:48,663 [shard 0:comp] compaction - [Compact system.truncated 3c4dcd60-5804-11ef-baf5-dbec4664306b] Compacting [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gil_1bky_4ugv42ucpvjf1okqqz-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gil_1bky_4tea82ucpvjf1okqqz-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-6-big-Data.db:level=0:origin=compaction]
INFO  2024-08-11 17:07:48,757 [shard 0:comp] compaction - [Compact system.truncated 3c4dcd60-5804-11ef-baf5-dbec4664306b] Compacted 3 sstables to [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/me-3gil_1bl0_3yj682ucpvjf1okqqz-big-Data.db:level=0]. 16kB to 5973 bytes (~35% of original) in 54ms = 310kB/s. ~384 total partitions merged to 1.
INFO  2024-08-11 17:07:48,759 [shard 0:comp] compaction - [Compact system_distributed.cdc_streams_descriptions_v2 3c5c7360-5804-11ef-baf5-dbec4664306b] Compacting [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system_distributed/cdc_streams_descriptions_v2-0bf73fd765b236b085e5658131d5df36/me-3gil_1bji_18kxs2lr0fx7773a8y-big-Data.db:level=0:origin=repair,/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system_distributed/cdc_streams_descriptions_v2-0bf73fd765b236b085e5658131d5df36/me-3gil_1bjk_0cfj42lr0fx7773a8y-big-Data.db:level=0:origin=repair]
INFO  2024-08-11 17:07:48,955 [shard 0:comp] compaction - [Compact system_distributed.cdc_streams_descriptions_v2 3c5c7360-5804-11ef-baf5-dbec4664306b] Compacted 2 sstables to [/jenkins/workspace/scylla-6.0/dtest-debug/scylla/.dtest/dtest-3kwu0y5q/test/node4/data/system_distributed/cdc_streams_descriptions_v2-0bf73fd765b236b085e5658131d5df36/me-3gil_1bl0_4j3ww2ucpvjf1okqqz-big-Data.db:level=0]. 53kB to 55kB (~103% of original) in 161ms = 334kB/s. ~256 total partitions merged to 2.
INFO  2024-08-11 17:38:02,181 [shard 0:main] compaction_manager - Asked to stop

While node 4 was found error msg:

failed on teardown with "AssertionError: Unexpected errors found:
node4: 1 errors
ERROR 2024-08-11 17:07:02,273 [shard 0: gms] raft_topology - raft_topology_cmd stream_ranges failed with: seastar::abort_requested_exception (abort requested)"

which could be expected, because operation was terminated

aleksbykov commented 3 months ago

I think it is a test issue.


        if tablets_enabled:
            new_node.start(wait_other_notice=True)
            stress_thread = write_in_background(node1, duration_seconds=5)
            results = stress_thread.result()
            assert_cs_success(results)
            logger.debug(format_cs_output(results))

if tablets enabled node trying to be rebootstrapped, but data was not removed, and because prevously bootsrap was aborted, the node with ip and host id is banned by cluster and could not be boostrapped with same host id

mykaul commented 3 months ago

@bhalevy ^^^

BTW, this is also happening on dtest-release, no? https://jenkins.scylladb.com/job/scylla-6.0/job/dtest-release/10/testReport/

kbr-scylla commented 3 months ago

if tablets enabled node trying to be rebootstrapped, but data was not removed, and because prevously bootsrap was aborted, the node with ip and host id is banned by cluster and could not be boostrapped with same host id

Yup. Thanks @aleksbykov.

Looks like the test was broken in scylladb/scylla-dtest@bb71da5bd. It also has weird history after that: scylladb/scylla-dtest@49aa179b27be3178a30594da2d507f0622c605da, scylladb/scylla-dtest@93468e62fe9f8e3a6f3946e84940c09b1c3111c5

@yarongilor @bhalevy has this test ever worked in tablets mode?

yarongilor commented 3 months ago

if tablets enabled node trying to be rebootstrapped, but data was not removed, and because prevously bootsrap was aborted, the node with ip and host id is banned by cluster and could not be boostrapped with same host id

Yup. Thanks @aleksbykov.

Looks like the test was broken in scylladb/scylla-dtest@bb71da5bd. It also has weird history after that: scylladb/scylla-dtest@49aa179, scylladb/scylla-dtest@93468e6

@yarongilor @bhalevy has this test ever worked in tablets mode?

@kbr-scylla , yes, it passes regularly, as in https://jenkins.scylladb.com/job/scylla-master/job/tablets/job/dtest-release-with-tablets/66/testReport/bootstrap_test/TestBootstrap/.

kostja commented 3 months ago

Seen again in https://jenkins.scylladb.com/job/scylla-6.0/job/dtest-debug/24/testReport/bootstrap_test/TestBootstrap/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split012___test_cluster_become_unavailable_when_kill_node_during_bootstrap_gracefully_/

mykaul commented 1 month ago

@pehala , @bhalevy - it's unclear to me if it's a test issue or not and what's the next step here.