Open amoskong opened 4 years ago
06-11 01:19:11,896 >>>>>>>>>>>>>Started random_disrupt_method modify_table
06-11 01:19:17,959 <<<<<<<<<<<<<Finished random_disrupt_method modify_table
06-11 01:50:46,368 >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-11 01:54:54,546 <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-11 02:26:25,224 >>>>>>>>>>>>>Started random_disrupt_method nodetool_drain
06-11 02:28:55,533 <<<<<<<<<<<<<Finished random_disrupt_method nodetool_drain
06-11 03:00:31,413 >>>>>>>>>>>>>Started random_disrupt_method truncate
06-11 03:02:00,169 <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-11 03:33:31,665 >>>>>>>>>>>>>Started random_disrupt_method abort_repair
06-11 03:56:07,437 <<<<<<<<<<<<<Finished random_disrupt_method abort_repair
06-11 04:27:45,742 >>>>>>>>>>>>>Started random_disrupt_method truncate
06-11 04:29:30,202 <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-11 05:01:01,832 >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-11 05:06:31,745 <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-11 05:38:09,383 >>>>>>>>>>>>>Started random_disrupt_method truncate
06-11 05:39:47,653 <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-11 06:11:22,618 >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-11 06:15:01,885 <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-11 06:46:34,879 >>>>>>>>>>>>>Started random_disrupt_method show_toppartitions
06-11 06:47:06,880 <<<<<<<<<<<<<Finished random_disrupt_method show_toppartitions
06-11 07:18:36,483 >>>>>>>>>>>>>Started random_disrupt_method nodetool_cleanup
06-11 08:04:38,241 <<<<<<<<<<<<<Finished random_disrupt_method nodetool_cleanup
06-11 08:36:04,854 >>>>>>>>>>>>>Started random_disrupt_method stop_start_scylla_server
06-11 08:38:36,563 <<<<<<<<<<<<<Finished random_disrupt_method stop_start_scylla_server
06-11 09:10:08,790 >>>>>>>>>>>>>Started random_disrupt_method no_corrupt_repair
06-11 09:25:53,969 <<<<<<<<<<<<<Finished random_disrupt_method no_corrupt_repair
06-11 09:57:26,032 >>>>>>>>>>>>>Started random_disrupt_method mgmt_backup
06-11 13:13:02,947 <<<<<<<<<<<<<Finished random_disrupt_method mgmt_backup
06-11 13:44:29,534 >>>>>>>>>>>>>Started random_disrupt_method mgmt_repair_cli
06-11 14:19:10,992 <<<<<<<<<<<<<Finished random_disrupt_method mgmt_repair_cli
06-11 14:50:40,494 >>>>>>>>>>>>>Started random_disrupt_method soft_reboot_node
06-11 14:55:14,545 <<<<<<<<<<<<<Finished random_disrupt_method soft_reboot_node
06-11 15:26:40,317 >>>>>>>>>>>>>Started random_disrupt_method abort_repair
06-11 15:47:32,064 <<<<<<<<<<<<<Finished random_disrupt_method abort_repair
06-11 16:19:00,632 >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-11 16:23:34,105 <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-11 16:55:04,388 >>>>>>>>>>>>>Started random_disrupt_method no_corrupt_repair
06-11 17:15:50,709 <<<<<<<<<<<<<Finished random_disrupt_method no_corrupt_repair
06-11 17:47:20,624 >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-11 17:51:53,885 <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-11 18:23:23,489 >>>>>>>>>>>>>Started random_disrupt_method major_compaction
06-11 18:31:59,768 <<<<<<<<<<<<<Finished random_disrupt_method major_compaction
06-11 19:03:30,295 >>>>>>>>>>>>>Started random_disrupt_method modify_table
06-11 19:03:41,134 <<<<<<<<<<<<<Finished random_disrupt_method modify_table
06-11 19:35:12,303 >>>>>>>>>>>>>Started random_disrupt_method nodetool_decommission
06-11 21:36:56,914 <<<<<<<<<<<<<Finished random_disrupt_method nodetool_decommission
06-11 22:08:27,797 >>>>>>>>>>>>>Started random_disrupt_method major_compaction
06-11 22:17:43,776 <<<<<<<<<<<<<Finished random_disrupt_method major_compaction
06-11 22:49:10,254 >>>>>>>>>>>>>Started random_disrupt_method truncate
06-11 22:51:04,066 <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-11 23:22:34,026 >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-11 23:26:12,411 <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-11 23:57:44,178 >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-12 00:02:36,192 <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-12 00:34:12,040 >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-12 00:38:42,110 <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-12 01:10:17,706 >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-12 01:13:59,530 <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-12 01:45:34,879 >>>>>>>>>>>>>Started random_disrupt_method truncate
06-12 01:47:39,563 <<<<<<<<<<<<<Finished random_disrupt_method truncate
06-12 02:19:15,945 >>>>>>>>>>>>>Started random_disrupt_method nodetool_refresh
### refresh nemesis loaded the sstables, but select query failed for known SCT (test) issue
06-12 02:52:23,325 >>>>>>>>>>>>>Started random_disrupt_method hard_reboot_node
06-12 02:57:05,000 <<<<<<<<<<<<<Finished random_disrupt_method hard_reboot_node
06-12 03:28:40,798 >>>>>>>>>>>>>Started random_disrupt_method nodetool_decommission
### New added node (db-node-8) failed to bootstrap for stream_exception
2020-06-12T16:12:11+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !INFO | scylla: [shard 0] stream_session - [Stream #78b66470-acc7-11ea-9edb-000000000008] Received failed complete message, peer=10.0.45.208
2020-06-12T16:12:11+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] stream_session - [Stream #78b66470-acc7-11ea-9edb-000000000008] Streaming plan for Bootstrap-keyspace1-index-0 failed, peers={10.0.45.208}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
2020-06-12T16:12:11+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] range_streamer - Bootstrap with 10.0.45.208 for keyspace=keyspace1 failed, took 0.049 seconds: streaming::stream_exception (Stream failed)
2020-06-12T16:26:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] range_streamer - Bootstrap failed, took 859 seconds, nr_ranges_remaining=1040
2020-06-12T16:26:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] range_streamer - Bootstrap failed to stream. Will retry in 60 seconds ...
2020-06-12T16:39:38+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-8 !WARNING | scylla: [shard 0] boot_strapper - Error during bootstrap: streaming::stream_exc
eption (Stream failed)
06-12 22:43:41,035 >>>>>>>>>>>>>Started random_disrupt_method nodetool_enospc
06-12 22:49:44,436 <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
06-12 23:21:07,319 >>>>>>>>>>>>>Started random_disrupt_method mgmt_repair_cli
Hi @asias
There is known problem, in refresh nemesis, we load some sstable files to keyspace1.standard1 (RF=3), but the data only exists in one db nodes, we need to repair all the db nodes one by one.
Related issues:
Currently we didn't repair all db nodes in refresh nemesis. In Longevity-200gb-48h of 4.1.rc2, there is a refresh nemesis before the failed decommission nemesis. Need to verify if it's related.
probably invalid issue. the 200gb longevity has one column while the refresh try to load schema with 5 columns.
probably invalid issue.
It is still valid, maybe low priority.
If the schema (column number) doesn't match, the refresh will fail, but it should not effect the cluster, the stream_exception should not be caused in decommission or bootstrap of new node.
the 200gb longevity has one column while the refresh try to load schema with 5 columns.
I will verify manually.
I looked at the log. Bootstrap and repair failed on similar issues "C1 missing in current schema in sstable files". Do we load wrong sstables in the tests?
Repair:
longevity-200gb-48h-verify-limited--db-node-ca3f1548-5/messages.log:2020-06-13T01:47:34+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-5 !INFO | scylla: [shard 1] repair - Got error in row level repair: std::runtime_error (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)
Bootstrap:
2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !INFO | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Executing streaming plan for Bootstrap-keyspace1-index-0 with peers=10.0.90.97, slave
2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] stream_transfer_task: Fail to send to 10.0.90.97:0: sstables::malformed_sstable_exception (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)
2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Failed to send: sstables::malformed_sstable_exception (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)
2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Streaming error occurred, peer=10.0.90.97
2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Streaming plan for Bootstrap-keyspace1-index-0 failed, peers={10.0.90.97}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
I looked at the log. Bootstrap and repair failed on similar issues "C1 missing in current schema in sstable files". Do we load wrong sstables in the tests?
Yes.
Expected columns: 'C0 C1 C2 C3 C4' Actually columns: 'C0'
The backup sstable files have different schema with the cluster.
Repair:
longevity-200gb-48h-verify-limited--db-node-ca3f1548-5/messages.log:2020-06-13T01:47:34+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-5 !INFO | scylla: [shard 1] repair - Got error in row level repair: std::runtime_error (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db)
Bootstrap:
2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !INFO | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Executing streaming plan for Bootstrap-keyspace1-index-0 with peers=10.0.90.97, slave 2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] stream_transfer_task: Fail to send to 10.0.90.97:0: sstables::malformed_sstable_exception (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db) 2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Failed to send: sstables::malformed_sstable_exception (Column C1 missing in current schema in sstable /var/lib/scylla/data/keyspace1/standard1-999fca40ab7911eab3c2000000000008/mc-359942-big-Data.db) 2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Streaming error occurred, peer=10.0.90.97 2020-06-12T16:27:28+00:00 longevity-200gb-48h-verify-limited--db-node-ca3f1548-2 !WARNING | scylla: [shard 0] stream_session - [Stream #9b84ff50-acc9-11ea-9edb-000000000008] Streaming plan for Bootstrap-keyspace1-index-0 failed, peers={10.0.90.97}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
(not urgent)
Installation details Scylla version (or git commit hash): 4.1.rc2-0.20200609.c6ee86b512f Cluster size: 4 OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-09310b5117d81331a
DB log: https://cloudius-jenkins-test.s3.amazonaws.com/ca3f1548-ba99-44d6-8cf6-bf56bcd5e765/20200613_014948/db-cluster-ca3f1548.zip
Description
In Longevity-200gb-48h of 4.1.rc2, the first decommission nemesis succeeded, but the second decommission nemesis failed for that the new node failed to bootstrap for streaming error.
Note: we just re-disabled repair-based ops (
enable_repair_based_node_ops
) in 4.1Test info
Grafana snapshots
Key test log:
/CC @roydahan @slivne @asias