scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.57k stars 1.29k forks source link

Loading sstables that do not match the schema causes - boot_strapper - Error during bootstrap: streaming::stream_exc eption (Stream failed) #6646

Open amoskong opened 4 years ago

amoskong commented 4 years ago

Installation details Scylla version (or git commit hash): 4.1.rc2-0.20200609.c6ee86b512f Cluster size: 4 OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-09310b5117d81331a

DB log: https://cloudius-jenkins-test.s3.amazonaws.com/ca3f1548-ba99-44d6-8cf6-bf56bcd5e765/20200613_014948/db-cluster-ca3f1548.zip

Description

In Longevity-200gb-48h of 4.1.rc2, the first decommission nemesis succeeded, but the second decommission nemesis failed for that the new node failed to bootstrap for streaming error.

Note: we just re-disabled repair-based ops (enable_repair_based_node_ops) in 4.1

Test info

Grafana snapshots

Key test log:

### First decommission nemesis succeeded

< t:2020-06-11 19:35:12,303 f:nemesis.py      l:661  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.LimitedChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method nodetool_decommission
< t:2020-06-11 21:36:56,914 f:nemesis.py      l:671  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.LimitedChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method nodetool_decommission

### Second decommission nemesis failed.

< t:2020-06-12 03:28:40,798 f:nemesis.py      l:661  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.LimitedChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method nodetool_decommission

2020-06-12T16:12:11+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !INFO    | scylla: [shard 0] stream_session - [Stream #78b66470-acc7-11ea-9edb-000000000008] Received failed complete message, peer=10.0.45.208
2020-06-12T16:12:11+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] stream_session - [Stream #78b66470-acc7-11ea-9edb-000000000008] Streaming plan for Bootstrap-keyspace1-index-0 failed, peers={10.0.45.208}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
2020-06-12T16:12:11+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] range_streamer - Bootstrap with 10.0.45.208 for keyspace=keyspace1 failed, took 0.049 seconds: streaming::stream_exception (Stream failed)

2020-06-12T16:26:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] range_streamer - Bootstrap failed, took 859 seconds, nr_ranges_remaining=1040
2020-06-12T16:26:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] range_streamer - Bootstrap failed to stream. Will retry in 60 seconds ...

2020-06-12T16:39:38+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] boot_strapper - Error during bootstrap: streaming::stream_exc
eption (Stream failed)
2020-06-12T16:39:38+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !INFO    | scylla: [shard 0] init - Shutting down gossiping
2020-06-12T16:39:38+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !INFO    | scylla: [shard 0] gossip - My status = BOOT
2020-06-12T16:39:38+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] gossip - No local state or state is in silent shutdown, not a
nnouncing shutdown

2020-06-12T16:39:38+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !INFO    | scylla: [shard 4] compaction_manager - Stopped
2020-06-12T16:39:43+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 2] storage_proxy - Failed to apply mutation from 10.0.219.111#2: seastar::gate_closed_exception (gate closed)

2020-06-12 22:12:17.863: (DisruptionEvent Severity.ERROR): type=end name=Decommission node=Node longevity-200gb-48h-verify-limited--db-node-ca3f1548-4 [34.243.88.21 | 10.0.185.204] (seed: False) duration=67416 error=[Node longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 [54.246.66.231 | 10.0.90.97] (seed: False)] NodeSetupFailed: RetryError[Wait for: db_up: timeout - 21600 seconds - expired]

/CC @roydahan @slivne @asias

amoskong commented 4 years ago

Nemesis info

06-11 01:19:11,896  >>>>>>>>>>>>>Started random_disrupt_method modify_table
06-11 01:19:17,959  <<<<<<<<<<<<<Finished random_disrupt_method modify_table
06-11 01:50:46,368  >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-11 01:54:54,546  <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-11 02:26:25,224  >>>>>>>>>>>>>Started random_disrupt_method nodetool_drain
06-11 02:28:55,533  <<<<<<<<<<<<<Finished random_disrupt_method nodetool_drain
06-11 03:00:31,413  >>>>>>>>>>>>>Started random_disrupt_method truncate
06-11 03:02:00,169  <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-11 03:33:31,665  >>>>>>>>>>>>>Started random_disrupt_method abort_repair
06-11 03:56:07,437  <<<<<<<<<<<<<Finished random_disrupt_method abort_repair
06-11 04:27:45,742  >>>>>>>>>>>>>Started random_disrupt_method truncate
06-11 04:29:30,202  <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-11 05:01:01,832  >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-11 05:06:31,745  <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-11 05:38:09,383  >>>>>>>>>>>>>Started random_disrupt_method truncate
06-11 05:39:47,653  <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-11 06:11:22,618  >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-11 06:15:01,885  <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-11 06:46:34,879  >>>>>>>>>>>>>Started random_disrupt_method show_toppartitions
06-11 06:47:06,880  <<<<<<<<<<<<<Finished random_disrupt_method show_toppartitions
06-11 07:18:36,483  >>>>>>>>>>>>>Started random_disrupt_method nodetool_cleanup
06-11 08:04:38,241  <<<<<<<<<<<<<Finished random_disrupt_method nodetool_cleanup
06-11 08:36:04,854  >>>>>>>>>>>>>Started random_disrupt_method stop_start_scylla_server
06-11 08:38:36,563  <<<<<<<<<<<<<Finished random_disrupt_method stop_start_scylla_server
06-11 09:10:08,790  >>>>>>>>>>>>>Started random_disrupt_method no_corrupt_repair
06-11 09:25:53,969  <<<<<<<<<<<<<Finished random_disrupt_method no_corrupt_repair
06-11 09:57:26,032  >>>>>>>>>>>>>Started random_disrupt_method mgmt_backup
06-11 13:13:02,947  <<<<<<<<<<<<<Finished random_disrupt_method mgmt_backup
06-11 13:44:29,534  >>>>>>>>>>>>>Started random_disrupt_method mgmt_repair_cli
06-11 14:19:10,992  <<<<<<<<<<<<<Finished random_disrupt_method mgmt_repair_cli
06-11 14:50:40,494  >>>>>>>>>>>>>Started random_disrupt_method soft_reboot_node
06-11 14:55:14,545  <<<<<<<<<<<<<Finished random_disrupt_method soft_reboot_node
06-11 15:26:40,317  >>>>>>>>>>>>>Started random_disrupt_method abort_repair
06-11 15:47:32,064  <<<<<<<<<<<<<Finished random_disrupt_method abort_repair
06-11 16:19:00,632  >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-11 16:23:34,105  <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-11 16:55:04,388  >>>>>>>>>>>>>Started random_disrupt_method no_corrupt_repair
06-11 17:15:50,709  <<<<<<<<<<<<<Finished random_disrupt_method no_corrupt_repair
06-11 17:47:20,624  >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-11 17:51:53,885  <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-11 18:23:23,489  >>>>>>>>>>>>>Started random_disrupt_method major_compaction
06-11 18:31:59,768  <<<<<<<<<<<<<Finished random_disrupt_method major_compaction
06-11 19:03:30,295  >>>>>>>>>>>>>Started random_disrupt_method modify_table
06-11 19:03:41,134  <<<<<<<<<<<<<Finished random_disrupt_method modify_table
06-11 19:35:12,303  >>>>>>>>>>>>>Started random_disrupt_method nodetool_decommission
06-11 21:36:56,914  <<<<<<<<<<<<<Finished random_disrupt_method nodetool_decommission
06-11 22:08:27,797  >>>>>>>>>>>>>Started random_disrupt_method major_compaction
06-11 22:17:43,776  <<<<<<<<<<<<<Finished random_disrupt_method major_compaction
06-11 22:49:10,254  >>>>>>>>>>>>>Started random_disrupt_method truncate
06-11 22:51:04,066  <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-11 23:22:34,026  >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-11 23:26:12,411  <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-11 23:57:44,178  >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-12 00:02:36,192  <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-12 00:34:12,040  >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-12 00:38:42,110  <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-12 01:10:17,706  >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-12 01:13:59,530  <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-12 01:45:34,879  >>>>>>>>>>>>>Started random_disrupt_method truncate
06-12 01:47:39,563  <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-12 02:19:15,945  >>>>>>>>>>>>>Started random_disrupt_method nodetool_refresh

### refresh nemesis loaded the sstables, but select query failed for known SCT (test) issue

06-12 02:52:23,325  >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-12 02:57:05,000  <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-12 03:28:40,798  >>>>>>>>>>>>>Started random_disrupt_method nodetool_decommission

### New added node (db-node-8) failed to bootstrap for stream_exception

2020-06-12T16:12:11+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !INFO    | scylla: [shard 0] stream_session - [Stream #78b66470-acc7-11ea-9edb-000000000008] Received failed complete message, peer=10.0.45.208
2020-06-12T16:12:11+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] stream_session - [Stream #78b66470-acc7-11ea-9edb-000000000008] Streaming plan for Bootstrap-keyspace1-index-0 failed, peers={10.0.45.208}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
2020-06-12T16:12:11+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] range_streamer - Bootstrap with 10.0.45.208 for keyspace=keyspace1 failed, took 0.049 seconds: streaming::stream_exception (Stream failed)

2020-06-12T16:26:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] range_streamer - Bootstrap failed, took 859 seconds, nr_ranges_remaining=1040
2020-06-12T16:26:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] range_streamer - Bootstrap failed to stream. Will retry in 60 seconds ...

2020-06-12T16:39:38+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] boot_strapper - Error during bootstrap: streaming::stream_exc
eption (Stream failed)

06-12 22:43:41,035  >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-12 22:49:44,436  <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-12 23:21:07,319  >>>>>>>>>>>>>Started random_disrupt_method mgmt_repair_cli
amoskong commented 4 years ago

Hi @asias

There is known problem, in refresh nemesis, we load some sstable files to keyspace1.standard1 (RF=3), but the data only exists in one db nodes, we need to repair all the db nodes one by one.

Related issues:

Currently we didn't repair all db nodes in refresh nemesis. In Longevity-200gb-48h of 4.1.rc2, there is a refresh nemesis before the failed decommission nemesis. Need to verify if it's related.

roydahan commented 4 years ago

probably invalid issue. the 200gb longevity has one column while the refresh try to load schema with 5 columns.

amoskong commented 4 years ago

probably invalid issue.

It is still valid, maybe low priority.

If the schema (column number) doesn't match, the refresh will fail, but it should not effect the cluster, the stream_exception should not be caused in decommission or bootstrap of new node.

the 200gb longevity has one column while the refresh try to load schema with 5 columns.

I will verify manually.

asias commented 4 years ago

I looked at the log. Bootstrap and repair failed on similar issues "C1 missing in current schema in sstable files". Do we load wrong sstables in the tests?

Repair:

longevity-200gb-48h-verify-limited--db-node-ca3f1548-5/messages.log:2020-06-13T01:47:34+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-5 !INFO    | scylla: [shard 1] repair - Got error in row level repair: std::runtime_error (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)

Bootstrap:


2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !INFO    | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Executing streaming plan for Bootstrap-keyspace1-index-0 with peers=10.0.90.97, slave
2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] stream_transfer_task: Fail to send to 10.0.90.97:0: sstables::malformed_sstable_exception (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)
2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Failed to send: sstables::malformed_sstable_exception (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)
2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Streaming error occurred, peer=10.0.90.97
2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Streaming plan for Bootstrap-keyspace1-index-0 failed, peers={10.0.90.97}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
amoskong commented 4 years ago

I looked at the log. Bootstrap and repair failed on similar issues "C1 missing in current schema in sstable files". Do we load wrong sstables in the tests?

Yes.

Expected columns: 'C0 C1 C2 C3 C4' Actually columns: 'C0'

The backup sstable files have different schema with the cluster.

Repair:

longevity-200gb-48h-verify-limited--db-node-ca3f1548-5/messages.log:2020-06-13T01:47:34+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-5 !INFO    | scylla: [shard 1] repair - Got error in row level repair: std::runtime_error (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)

Bootstrap:


2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !INFO    | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Executing streaming plan for Bootstrap-keyspace1-index-0 with peers=10.0.90.97, slave
2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] stream_transfer_task: Fail to send to 10.0.90.97:0: sstables::malformed_sstable_exception (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)
2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Failed to send: sstables::malformed_sstable_exception (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)
2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Streaming error occurred, peer=10.0.90.97
2020-06-12T16:27:28+00:00  longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Streaming plan for Bootstrap-keyspace1-index-0 failed, peers={10.0.90.97}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
slivne commented 4 years ago

(not urgent)