scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.32k stars 1.26k forks source link

scylla-server.service didn't finish within 500 seconds after upgrade from 4.6.rc1 to 4.6.rc2 #9895

Closed fgelcer closed 1 year ago

fgelcer commented 2 years ago

Test details

Test: upgrade_test.UpgradeTest.test_rolling_upgrade Build number: 11 Backend: aws: eu-west-1 Kernel version: 5.4.0-1035-aws Test-id: 1d370df0-01d4-49c5-9853-2435d442877d Start time: 2022-01-06 13:04:44 End time: 2022-01-06 17:38:47 Started by user: beni.peled Cassandra-stress uses shared-aware driver

System under test

ScyllaDB version: 4.6.rc1-0.20211208.542394c82 with build-id 96acb395cfbc08ffdc30ac34e1333b99d93b83d7 (ami-026af3fe59e2ee4ba) Target ScyllaDB repo: http://downloads.scylladb.com/unstable/scylla/branch-4.6/deb/unified/2022-01-02T09:25:53Z/scylladb-4.6/scylla.list Instance type: i3.2xlarge Number of ScyllaDB nodes: 4

it happened during Step5 - Upgrade rest of the Nodes. it means that, other 1 node was fully upgraded, 1 node was upgraded and then rolled back to previous version, and then we started to upgrade node-3.

after node-3 was upgraded (from 4.6.rc1 to 4.6.rc2) it started reshaping and did not end within the 500 seconds SCT waits for a node to start:

2022-01-06 17:38:46.080: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=97aab677-ba7b-4499-ab59-edd44d9248c3, source=UpgradeTest.test_rolling_upgrade (upgrade_test.UpgradeTest)() message=Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 644, in test_rolling_upgrade
    self.upgrade_node(self.db_cluster.node_to_upgrade)
  File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 51, in inner
    func_result = func(self, *args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 239, in upgrade_node
    node.start_scylla_server(verify_up_timeout=500)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2360, in start_scylla_server
    self.start_service(service_name='scylla-server', timeout=timeout)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2349, in start_service
    self._service_cmd(service_name=service_name, cmd='start', timeout=timeout, ignore_status=ignore_status)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2346, in _service_cmd
    self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 64, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
    result = connection.run(**command_kwargs)
  File "<decorator-gen-3>", line 2, in run
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 30, in opens
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 723, in run
    return self._run(self._remote_runner(), command, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/invoke/context.py", line 102, in _run
    return runner.run(command, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/invoke/runners.py", line 380, in run
    return self._run_body(command, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/invoke/runners.py", line 442, in _run_body
    return self.make_promise() if self._asynchronous else self._finish()
  File "/usr/local/lib/python3.10/site-packages/invoke/runners.py", line 507, in _finish
    raise CommandTimedOut(result, timeout=timeout)
invoke.exceptions.CommandTimedOut: Command did not complete within 500 seconds!

Command: 'sudo systemctl start scylla-server.service'

Stdout:

Stderr:

logs can be found here: db logs sct logs

roydahan commented 2 years ago

How long did it take the reshape to finish eventually? Check the db logs.

fgelcer commented 2 years ago

How long did it take the reshape to finish eventually? Check the db logs.

it did not finish, because once the test was marked as "failed", then termination was set to all nodes...

but here, reshape started:

Jan 06 17:30:17 rolling-upgrade-4-6-centos-db-node-1d370df0-3 scylla[202646]:  [shard 0] sstable_directory - Table ks1.table1_scylla_cdc_log with compaction strategy TimeWindowCompactionStrategy found SSTables that need reshape. Starting reshape process

and this is the last message in the node's log:

Jan 06 17:46:38 rolling-upgrade-4-6-centos-db-node-1d370df0-3 scylla[202646]:  [shard 0] compaction - [Reshape ks1.table1_scylla_cdc_log 996d5420-6f18-11ec-a455-20079ac1cb61] Reshaping [/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4472-big-Data.db:level=0:origin=reshape,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3936-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3928-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3952-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3960-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3968-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3984-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3976-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4008-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4016-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3992-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4000-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4024-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4040-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4032-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4080-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4048-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4072-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4056-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4064-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4096-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4088-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4104-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4112-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4136-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4144-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4120-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4128-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4176-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4152-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4160-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4168-big-Data.db:level=0:origin=repair]

meaning it took more than 16 minutes, and did not finish (16 minutes is already 960 seconds)...

slivne commented 2 years ago

@roydahan / @fgelcer is this a regression (e.g. did we ever before have reshape start on upgrade path in 4.6 ?)

@bhalevy / @raphaelsc - why is reshape started - no one changed smp count / smb_bit (AFAIK we do not) - we can't have reshape start running when people do patch level upgrades

roydahan commented 2 years ago

@roydahan / @fgelcer is this a regression (e.g. did we ever before have reshape start on upgrade path in 4.6 ?)

  • did we change the smp count / smb_bits used ? No, but it's not resharding, it's reshaping. So maybe changes related to compaction strategy (if any) caused it. Do we have anything in the logs that tell us what's the reshaping doing and why?

@bhalevy / @raphaelsc - why is reshape started - no one changed smp count / smb_bit (AFAIK we do not) - we can't have reshape start running when people do patch level upgrades

bhalevy commented 2 years ago

@bhalevy / @raphaelsc - why is reshape started - no one changed smp count / smb_bit (AFAIK we do not) - we can't have reshape start running when people do patch level upgrades

@slivne, we already had this discussion about reshape. Unlike resharding which has to happen (if smp count changed), reshape on startup is not. It might indicate regular compaction running behind before the node was restarted, or possibly a change in the compaction strategy type or configuration (e.g. min_threshold or sstable_size_in_mb). There could be reason to want to wait for it or just continue on with starting and resume regular compaction (maybe with relatively high number of shares reflecting how out-of-shape the sstables are vs. the desired state - and this should hold true at any time, not only after restart).

Bottom line is that you're asking to change the default behavior and rather than always starting reshape and allowing to stop it using the api you're suggesting to perform reshape only if configure to do so via scylla.yaml or a command line option.

I'm not against that, but I want the definition to be clear and us to all agree on it before we start implementing anything.

bhalevy commented 2 years ago

As for this issue, we need to understand if there's a bug in scylla causing reshape after upgrade to take a long time or is it the expected behavior in 4.6 given the workload before the upgrade took place.

raphaelsc commented 2 years ago

I can see that SSTables being reshaped on boot are originated from repair. After commit a4053dbb7217, data segregation is postponed to off-strategy phase. So the issue here is that the test shutdown the instance before it had the chance to reshape the repaired data. To fix this problem, we should allow next instance to start off-strategy on those files, which is asynchronous, rather than forcing reshape on them, which will block boot until the operation is completed.

raphaelsc commented 2 years ago

@bhalevy If you agree on my analysis, I will cook a patch to fix the problem.

bhalevy commented 2 years ago

Yes, triggering off-startegy compaction after restart is something we already agreed to do.

We can restore the maintenance sstable set when populating the table based on the sstables origin and detecting repair / streaming, and then kickstartung offstrategy compaction.

In addition, we talked about an api to trigger offstartegy compaction at will so that the manager or the test can call after repair and wait for compaction to finish before moving on to the next nemesis.

That said, we need to resume offstrategy compaction if the node is restarted or crashes during repair anyhow.

raphaelsc commented 2 years ago

Yes, triggering off-startegy compaction after restart is something we already agreed to do.

We can restore the maintenance sstable set when populating the table based on the sstables origin and detecting repair / streaming, and then kickstartung offstrategy compaction.

In addition, we talked about an api to trigger offstartegy compaction at will so that the manager or the test can call after repair and wait for compaction to finish before moving on to the next nemesis.

That said, we need to resume offstrategy compaction if the node is restarted or crashes during repair anyhow.

+1. thanks

raphaelsc commented 2 years ago

Patchset fixing this problem was sent to mailing list

slivne commented 2 years ago

@roydahan / @fgelcer do we run repair at all in this scenario ?

(in the documentation - there is no request to run repair during an upgrade https://docs.scylladb.com/upgrade/upgrade-opensource/upgrade-guide-from-4.5-to-4.6/upgrade-guide-from-4.5-to-4.6-ubuntu-20-04/ - I am not asking to change the scenario I am trying to validate this is not a consequence of something we haven't thought off ... (as an example repair based operations working while they are not supposed to)

fgelcer commented 2 years ago

@roydahan / @fgelcer do we run repair at all in this scenario ?

(in the documentation - there is no request to run repair during an upgrade https://docs.scylladb.com/upgrade/upgrade-opensource/upgrade-guide-from-4.5-to-4.6/upgrade-guide-from-4.5-to-4.6-ubuntu-20-04/ - I am not asking to change the scenario I am trying to validate this is not a consequence of something we haven't thought off ... (as an example repair based operations working while they are not supposed to)

@slivne , in the failure described in this issue, the node was stuck after packages were upgraded, during node.start_scylla_server(verify_up_timeout=500) command... in any case, we do not run repair after that (but anyway, it timed out before having the service up and running).

slivne commented 2 years ago

Even though this is patched and merged I still don;t understand whats going on

Based on @fgelcer there is no repair in the scenario

What tags an sstable as its origin from repair ?

fgelcer commented 2 years ago

Even though this is patched and merged I still don;t understand whats going on

Based on @fgelcer there is no repair in the scenario

  • @fgelcer I can't find the logs with sstables source being repaired - where are the logs ?

because during the startup of that node, there is no repair running... i will download the logs and put here some more info about it

  • @bhalevy / @raphaelsc I do not understand how we have sstables marked as source from repair if no one ran repair ?

What tags an sstable as its origin from repair ?

ShlomiBalalis commented 2 years ago

@slivne I had this error appear in a 1tb longevity.

Scylla version: 4.6.rc2-0.20220102.e8a1cfb6f with build-id 5d7b96e39c909424e8224207a162fc2c82b67214

The node in question (176.34.80.35 | 10.0.3.236) had a decommission running, which we stopped by rebooting the node. After the rebooting has finished, sct waited for the node to start up, only to time out since the reshaping process in the startup took 12 minutes: 2022-01-13T21:51:25+00:00 longevity-tls-1tb-7d-4-6-db-node-382188e0-2 ! INFO | [shard 0] database - Reshaped 202GB in 724.47 seconds, 278MB/s

For comparison, this is a reshaping that took place earlier in the run on a different node, while resharding: 2022-01-11T05:29:37+00:00 longevity-tls-1tb-7d-4-6-db-node-382188e0-4 ! INFO | [shard 0] database - Reshaped 832GB in 1109.01 seconds, 750MB/s Here, the reshaping was nearly three times as fast.

Logs:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                          Log links for testrun with test id 382188e0-fc19-494c-886f-6fd253fbc651                                                                                                          |
+-----------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Date            | Log type       | Link                                                                                                                                                                                                                                                   |
+-----------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 20220113_215220 | grafana        | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_215220/grafana-screenshot-longevity-1tb-7days-test-scylla-per-server-metrics-nemesis-20220113_215505-longevity-tls-1tb-7d-4-6-monitor-node-382188e0-1.png |
| 20220113_215220 | grafana        | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_215220/grafana-screenshot-overview-20220113_215220-longevity-tls-1tb-7d-4-6-monitor-node-382188e0-1.png                                                   |
| 20220113_220721 | critical       | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/critical-382188e0.log.tar.gz                                                                                                                       |
| 20220113_220721 | db-cluster     | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/db-cluster-382188e0.tar.gz                                                                                                                         |
| 20220113_220721 | debug          | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/debug-382188e0.log.tar.gz                                                                                                                          |
| 20220113_220721 | email_data     | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/email_data-382188e0.json.tar.gz                                                                                                                    |
| 20220113_220721 | error          | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/error-382188e0.log.tar.gz                                                                                                                          |
| 20220113_220721 | event          | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/events-382188e0.log.tar.gz                                                                                                                         |
| 20220113_220721 | left_processes | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/left_processes-382188e0.log.tar.gz                                                                                                                 |
| 20220113_220721 | loader-set     | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/loader-set-382188e0.tar.gz                                                                                                                         |
| 20220113_220721 | monitor-set    | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/monitor-set-382188e0.tar.gz                                                                                                                        |
| 20220113_220721 | normal         | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/normal-382188e0.log.tar.gz                                                                                                                         |
| 20220113_220721 | output         | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/output-382188e0.log.tar.gz                                                                                                                         |
| 20220113_220721 | event          | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/raw_events-382188e0.log.tar.gz                                                                                                                     |
| 20220113_220721 | sct            | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/sct-382188e0.log.tar.gz                                                                                                                            |
| 20220113_220721 | summary        | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/summary-382188e0.log.tar.gz                                                                                                                        |
| 20220113_220721 | warning        | https://cloudius-jenkins-test.s3.amazonaws.com/382188e0-fc19-494c-886f-6fd253fbc651/20220113_220721/warning-382188e0.log.tar.gz                                                                                                                        |
+-----------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
slivne commented 2 years ago

@fgelcer

Jan 06 17:46:38 rolling-upgrade-4-6-centos-db-node-1d370df0-3 scylla[202646]:  [shard 0] compaction - [Reshape ks1.table1_scylla_cdc_log 996d5420-6f18-11ec-a455-20079ac1cb61] Reshaping [/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4472-big-Data.db:level=0:origin=reshape,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3936-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3928-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3952-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3960-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3968-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3984-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3976-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4008-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4016-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-3992-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4000-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4024-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4040-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4032-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4080-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4048-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4072-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4056-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4064-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4096-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4088-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4104-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4112-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4136-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4144-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4120-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4128-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4176-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4152-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4160-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/ks1/table1_scylla_cdc_log-7ee5e0406ef511ecadae186adf292732/md-4168-big-Data.db:level=0:origin=repair]

you have

....md-4168-big-Data.db:level=0:origin=repair

I don;t understand what created those files - if there is no repair -where are the logs for that node - I can't find them

bhalevy commented 2 years ago

@slivne the log messages says "origin=repair"

slivne commented 2 years ago

The issue was opened for upgrades - based on fabio there is no repair run in upgrade - how did we get an sstable marked as origin from repair:

In an upgrade case there should be no reshaping - I want to understand how that happend.

raphaelsc commented 2 years ago

@fgelcer is the test responsible for generating all files? or does it perform a refresh on files pulled from an external source?

raphaelsc commented 2 years ago
$ cat rolling-upgrade-4-6-centos-db-node-1d370df0-1/messages.log | grep "starting user"
2022-01-06T12:58:27+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace system_auth, repair id [id=1, uuid=1fd0640f-a97f-4b11-a954-ef2fd13c29c1], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}
2022-01-06T17:19:41+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace ks_no_range_ghost_test, repair id [id=1, uuid=f215625d-3761-461b-a0fd-27506c679e37], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}
2022-01-06T17:19:42+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace keyspace_complex, repair id [id=2, uuid=f652f3aa-bb1a-4a23-8e92-34ff3799af42], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}
2022-01-06T17:20:11+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace keyspace1, repair id [id=3, uuid=1148ac9d-fc4f-46f5-8a20-99cdfe5662e0], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}
2022-01-06T17:20:23+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace system_traces, repair id [id=4, uuid=07437c40-9e98-4348-bfce-e7645b079eab], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}
2022-01-06T17:20:25+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace ks1, repair id [id=5, uuid=79488a8e-634f-4c96-a136-396b0516a82e], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}
2022-01-06T17:24:44+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace keyspace_fill_db_data, repair id [id=6, uuid=86a9f0d4-b7c0-42d2-a041-388e195b34ac], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}
2022-01-06T17:27:20+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace system_auth, repair id [id=7, uuid=e1a108ca-95c5-44aa-8758-6412604caff0], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}
2022-01-06T17:27:24+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace keyspace_entire_test, repair id [id=8, uuid=4b767006-8fa5-4192-80db-561d4f18f621], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}
2022-01-06T17:27:44+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 !    INFO |  [shard 0] repair - starting user-requested repair for keyspace system_distributed_everywhere, repair id [id=9, uuid=cb0230b0-fa9c-4f03-86eb-c93ab3192b5c], options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { parallelism -> parallel}, { primaryRange -> false}}

then off-strategy kicks in 5 min later as expected on sstables produced by repair.

off-strategy for keyspace ks1 starting 5 min after user requested repair on it:

2022-01-06T17:29:43+00:00 rolling-upgrade-4-6-centos-db-node-1d370df0-1 ! INFO | [shard 6] table - Starting off-strategy compaction for ks1.table1_scylla_cdc_log, 695 candidates were found

@slivne @bhalevy FYI

roydahan commented 2 years ago

In the test there is a repair in step 4:

step = 'Step4 - Verify data during mixed cluster mode '
        self.log.info(step)
        self.fill_and_verify_db_data('after rollback the second node')
        self.log.info('Repair the first upgraded Node')
        self.db_cluster.nodes[indexes[0]].run_nodetool(sub_cmd='repair')
fgelcer commented 2 years ago

i'm seeing some similar behavior, but this time not for rolling upgrade, but for 1TB job (Scylla version 4.6.rc5-0.20220203.5694ec189 with build-id f5d85bf5abe6d2f9fd3487e2469ce1c34304cc14)

the test starts to RollingConfigChangeInternodeCompression, where we set inter node compression to all, and then restart each node... 1st node to be restarted was node-1 (no problems there), and then node-5... node-5 took at least 10 min during reshape, and we wait 500 seconds, so the nemesis timed out, and the test failed, because cluster health wasn't find (node-5 was still DN when the nemesis ended)

here some parts of the log (before we restarted the service, it was compacting):

Feb 11 13:28:26 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[972]:  [shard 1] compaction - [Compact keyspace1.standard1 7e65d980-8b3e-11ec-8258-d641bcb4dda1] Compacting [/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-952281-big-Data.db:le
vel=3:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-952267-big-Data.db:level=3:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963649-big-Data.db:level=2:origin=compaction]
Feb 11 13:28:31 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 systemd[1]: Stopping Scylla JMX...
Feb 11 13:28:32 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 systemd[1]: scylla-jmx.service: Main process exited, code=exited, status=143/n/a
Feb 11 13:28:32 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 systemd[1]: scylla-jmx.service: Failed with result 'exit-code'.
Feb 11 13:28:32 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 systemd[1]: Stopped Scylla JMX.
Feb 11 13:28:32 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[972]:  [shard 0] compaction_manager - Asked to stop

then, the service started:

Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Populating Keyspace keyspace1
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Keyspace keyspace1: Reading CF standard1 id=4a92c960-8814-11ec-8884-67f28954f76d version=77f477e2-c0fa-39fe-8171-18709b4aff8b
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000000968885.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001168466.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001156610.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000000966101.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000000964370.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001165481.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001168372.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001153621.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001167058.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001159000.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001172660.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001148627.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001117849.sstable", removing
Feb 11 13:30:05 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Found temporary sstable directory: "/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/0000000001163035.sstable", removing
Feb 11 13:30:10 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 3] LeveledManifest - Turns out that level 1 is not disjoint, found 47 overlapping SSTables, so compacting everything on behalf of keyspace1.standard1

then, we had the 1st reshape message (it was too large to paste in here, so this is only the first part of it):

Feb 11 13:30:10 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 3] sstable_directory - Table keyspace1.standard1 with compaction strategy LeveledCompactionStrategy found SSTables that need reshape. Starting reshape process
Feb 11 13:30:10 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 3] compaction - [Reshape keyspace1.standard1 bc8f3da0-8b3e-11ec-bf75-ee876a599907] Reshaping [/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964407-big-Data.db:
level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963595-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964505-big-Data.db:level=1:origin=compaction,/var
/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965695-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965009-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/sta
ndard1-4a92c960881411ec888467f28954f76d/md-960543-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964029-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f
28954f76d/md-960501-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965247-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964057-big-Data.db:
level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963455-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965681-big-Data.db:level=1:origin=compaction,/var
/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964519-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-955349-big-Data.db:level=2:origin=compaction,/var/lib/scylla/data/keyspace1/sta
ndard1-4a92c960881411ec888467f28954f76d/md-965233-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965261-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f
28954f76d/md-963469-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963511-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-959675-big-Data.d
b:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965275-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965093-big-Data.db:level=1:origin=compaction,/v
ar/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964715-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965219-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/s
tandard1-4a92c960881411ec888467f28954f76d/md-965205-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965191-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec88846
7f28954f76d/md-964533-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965177-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-955405-big-Data
.db:level=2:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964897-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963539-big-Data.db:level=1:origin=compaction,
/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964673-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964603-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/s
tandard1-4a92c960881411ec888467f28954f76d/md-965569-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963637-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec88846
7f28954f76d/md-960585-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-966017-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965331-big-Data
.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965387-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965513-big-Data.db:level=1:origin=compaction,
/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965653-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965373-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1
/standard1-4a92c960881411ec888467f28954f76d/md-965065-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965639-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888
467f28954f76d/md-965429-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-959955-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965135-big-Da
ta.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965877-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-960333-big-Data.db:level=1:origin=compactio
n,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965667-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-955447-big-Data.db:level=2:origin=compaction,/var/lib/scylla/data/keyspac
e1/standard1-4a92c960881411ec888467f28954f76d/md-964085-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964351-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec8
88467f28954f76d/md-965415-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965443-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965289-big-
Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964183-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964645-big-Data.db:level=1:origin=compact
ion,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965541-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965303-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keysp
ace1/standard1-4a92c960881411ec888467f28954f76d/md-965107-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964155-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411e
c888467f28954f76d/md-965317-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964589-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964421-bi
g-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964197-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963973-big-Data.db:level=1:origin=compa
ction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965625-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963623-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/key
space1/standard1-4a92c960881411ec888467f28954f76d/md-964169-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965863-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c96088141
1ec888467f28954f76d/md-966059-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-960431-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964211-bi
g-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964001-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964435-big-Data.db:level=1:origin=compa
ction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-960347-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964617-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/key
space1/standard1-4a92c960881411ec888467f28954f76d/md-966003-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963525-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c96088141
1ec888467f28954f76d/md-965457-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965527-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964575-
big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-960459-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-960557-big-Data.db:level=1:origin=com
paction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965737-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-960613-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/k
eyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965555-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964547-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881
411ec888467f28954f76d/md-965765-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964561-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-96404
3-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-960375-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-966073-big-Data.db:level=1:origin=c
ompaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964701-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-965975-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data
/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-963483-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-964813-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c9608
81411ec888467f28954f76d/md-964225-big-Data.db:level=1:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-960473-big-Data.db:level=1:origin=compaction, (...)

and then, the next message in the log was another reshape message (again, too large to paste it completely):

Feb 11 13:40:53 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 3] compaction - [Reshape keyspace1.standard1 bc8f3da0-8b3e-11ec-bf75-ee876a599907] Reshaped 886 sstables to [/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172
685-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172699-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172713-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1
-4a92c960881411ec888467f28954f76d/md-1172727-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172741-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172755-big-Data.db:level=3
,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172769-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172783-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467
f28954f76d/md-1172797-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172811-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172825-big-Data.db:level=3,/var/lib/scylla/data/k
eyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172839-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172853-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172867-b
ig-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172881-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172895-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92
c960881411ec888467f28954f76d/md-1172909-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172923-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172937-big-Data.db:level=3,/var
/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172951-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172965-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f2895
4f76d/md-1172979-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1172993-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173007-big-Data.db:level=3,/var/lib/scylla/data/keyspa
ce1/standard1-4a92c960881411ec888467f28954f76d/md-1173021-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173035-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173049-big-Da
ta.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173063-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173077-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c9608
81411ec888467f28954f76d/md-1173091-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173105-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173119-big-Data.db:level=3,/var/lib/
scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173133-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173147-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d
/md-1173161-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173175-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173189-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/s
tandard1-4a92c960881411ec888467f28954f76d/md-1173203-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173217-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173231-big-Data.db
:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173245-big-Data.db:level=3,/var/lib/scylla/data/keyspace1/standard1-4a92c960881411ec888467f28954f76d/md-1173259-big-Data.db:level=3, (...)

and then, reshape was done, so it resumed to compact:

Feb 11 13:40:54 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] database - Reshaped 150GB in 643.44 seconds, 234MB/s
Feb 11 13:40:54 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] init - starting view update generator
Feb 11 13:40:54 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] init - setting up system keyspace
Feb 11 13:40:54 longevity-tls-1tb-7d-4-6-db-node-a9d099b4-5 scylla[380785]:  [shard 0] init - starting commit log

@raphaelsc , was the fix you mentioned above back ported to 4.6?

here are the logs, if something else could be of any interest: sct.log db logs

asias commented 1 year ago

@raphaelsc I saw

commit b6828e899ae214d8571464ec121f237069c1c4f1
Merge: c727360eca a144d30162
Author: Botond Dénes <bdenes@scylladb.com>
Date:   Fri Jan 14 14:05:09 2022 +0200

    Merge "Postpone reshape of SSTables created by repair" from Raphael

    "
    SSTables created by repair will potentially not conform to the
    compaction strategy
    layout goal. If node shuts down before off-strategy has a chance to
    reshape those files, node will be forced to reshape them on restart.
    That
    causes unexpected downtime. Turns out we can skip reshape of those files
    on boot, and allow them to be reshaped after node becomes online, as if
    the node never went down. Those files will go through same procedure as
    files created by repair-based ops. They will be placed in maintenance
    set,
    and be reshaped iteratively until ready for integration into the main
    set.
    "

    Fixes #9895.

    tests: UNIT(dev).

    * 'postpone_reshape_on_repair_originated_files' of https://github.com/raphaelsc/scylla:
      distributed_loader: postpone reshape of repair-originated sstables
      sstables: Introduce filter for sstable_directory::reshape
      table: add fast path when offstrategy is not needed
      sstables: add constant for repair origin

Can we close this issue now?

mykaul commented 1 year ago

Closing, please re-open if needed. (and not seeing any real need to backport anything?)