scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.54k stars 1.29k forks source link

During rebuild there is a drop of operations #9060

Open slivne opened 3 years ago

slivne commented 3 years ago

Similar issue got in the longevity-multi-keyspaces-60h-test running on the branch 4.5.rc5 Cassandra-stress exception got during rebuild (CorruptThenRebuild nemesis)

2021-07-15 15:41:42.689: (DisruptionEvent Severity.NORMAL): type=CorruptThenRebuild subtype=start target_node=Node longevity-1000-keyspaces-4-5-db-node-3a9de768-4 [13.53.38.98 | 10.0.0.10] (seed: False) duration=None

After Scylla restart (stop and start):

2021-07-15 16:07:54.000: (DatabaseLogEvent Severity.NORMAL): type=STOP regex=Stopping Scylla Server line_number=910267 node=Node longevity-1000-keyspaces-4-5-db-node-3a9de768-4 [13.53.38.98 | 10.0.0.10] (seed: F
alse)
2021-07-15T16:07:54+00:00  longevity-1000-keyspaces-4-5-db-node-3a9de768-4 !INFO    | systemd[1]: Stopping Scylla Server...
2021-07-15 16:09:31.000: (DatabaseLogEvent Severity.NORMAL): type=BOOT regex=Starting Scylla Server line_number=910944 node=Node longevity-1000-keyspaces-4-5-db-node-3a9de768-4 [13.53.38.98 | 10.0.0.10] (seed: F
alse)
2021-07-15T16:09:31+00:00  longevity-1000-keyspaces-4-5-db-node-3a9de768-4 !INFO    | systemd[1]: Starting Scylla Server...

A few sstables were deleted. Rebuild was started at 2021-07-15 16:11:40:

2021-07-15 16:11:40.534: (NodetoolEvent Severity.NORMAL): type=rebuild subtype=start node=longevity-1000-keyspaces-4-5-db-node-3a9de768-4

Failed to apply mutation (named_semaphore_timed_out) started at 2021-07-15 18:07:33

< t:2021-07-15 18:07:33+00:00  longevity-1000-keyspaces-4-5-db-node-3a9de768-4 !WARNING | scylla:  [shard 10] storage_proxy - Failed to apply mutation from 10.0.1.146#10: seastar::named_semaphore_timed_out (Semaphore timed out: smp_service_group#2 0->10 semaphore)

Cassandra-stress failes (failed to write/read the keys) started at 2021-07-15 23:51:52:

< t:2021-07-15 23:51:52,922 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG > java.io.IOException: Operation x10 on key(s) [36394f4c313538503131]: Error executing: (WriteFailureException): Cassandra failure during write query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)
< t:2021-07-15 23:52:37,350 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > java.io.IOException: Operation x10 on key(s) [36394f4c313538503131]: Error executing: (WriteFailureException): Cassandra failure during write query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)
< t:2021-07-16 00:10:46,628 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG > java.io.IOException: Operation x10 on key(s) [365033374e3636333630]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
< t:2021-07-16 00:10:53,433 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout durinjava.io.IOException: Operation x10 on key(s) [365033374e3636333630]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
< t:2021-07-16 00:12:54,263 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG > java.io.IOException: Operation x10 on key(s) [50354c343631374f3331]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
< t:2021-07-16 00:12:55,664 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > java.io.IOException: Operation x10 on key(s) [50354c343631374f3331]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
< t:2021-07-16 00:49:38,067 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG > java.io.IOException: Operation x10 on key(s) [3532373938334b4c3631]: Error executing: (WriteFailureException): Cassandra failure during write query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)
< t:2021-07-16 00:49:38,485 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > com.datastax.driver.core.exceptions.ReadTimejava.io.IOException: Operation x10 on key(s) [3532373938334b4c3631]: Error executing: (WriteFailureException): Cassandra failure during write query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)
< t:2021-07-16 00:50:25,817 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG > java.io.IOException: Operation x10 on key(s) [4f3530394e3533373430]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
< t:2021-07-16 00:50:37,092 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica respondjava.io.IOException: Operation x10 on key(s) [4f3530394e3533373430]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)

All missed keys were found on the 3 nodes (RF=3)

Screenshot from 2021-07-18 15-24-05 Screenshot from 2021-07-18 15-24-25 Screenshot from 2021-07-18 15-25-02 Screenshot from 2021-07-18 15-33-02

The node is marked by green - node that runs rebuild

Live monitor

Installation details Kernel version: 5.4.0-1035-aws Scylla version (or git commit hash): 4.5.rc5-0.20210711.7f96719c5 Cluster size: 6 nodes (i3.8xlarge) Scylla running with shards number (live nodes): longevity-1000-keyspaces-4-5-db-node-3a9de768-1 (13.51.167.63 | 10.0.2.120): 30 shards longevity-1000-keyspaces-4-5-db-node-3a9de768-2 (13.51.69.18 | 10.0.3.247): 30 shards longevity-1000-keyspaces-4-5-db-node-3a9de768-3 (13.48.71.98 | 10.0.3.238): 30 shards longevity-1000-keyspaces-4-5-db-node-3a9de768-4 (13.53.38.98 | 10.0.0.10): 30 shards longevity-1000-keyspaces-4-5-db-node-3a9de768-5 (13.51.193.145 | 10.0.1.242): 30 shards longevity-1000-keyspaces-4-5-db-node-3a9de768-6 (13.48.45.37 | 10.0.1.146): 30 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-09a138d84af5dd9e9 (aws: eu-north-1)

Test: longevity-multi-keyspaces-60h-test Test name: longevity_test.LongevityTest.test_batch_custom_time Test config file(s):

Restore Monitor Stack command: $ hydra investigate show-monitor 3a9de768-c86f-4b82-aafd-b0f9d35085dd Show all stored logs command: $ hydra investigate show-logs 3a9de768-c86f-4b82-aafd-b0f9d35085dd

Test id: 3a9de768-c86f-4b82-aafd-b0f9d35085dd

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/3a9de768-c86f-4b82-aafd-b0f9d35085dd/20210717_033618/db-cluster-3a9de768.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/3a9de768-c86f-4b82-aafd-b0f9d35085dd/20210717_033618/loader-set-3a9de768.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/3a9de768-c86f-4b82-aafd-b0f9d35085dd/20210717_033618/monitor-set-3a9de768.tar.gz sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/3a9de768-c86f-4b82-aafd-b0f9d35085dd/20210717_033618/sct-runner-3a9de768.tar.gz

Jenkins job URL

slivne commented 3 years ago

@juliayakovlev please confirm this is a regression in rebuild for the exact same workload for rebuild nemesis between 4.4 and 4.5