scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.55k stars 1.29k forks source link

Cassandra-stress got WriteTimeoutException and ReadTimeoutException while altering table with new columns #9102

Open aleksbykov opened 3 years ago

aleksbykov commented 3 years ago

Installation details Scylla version (or git commit hash): 4.6.dev-0.20210720.dcd05f77b with build-id 544ca11fed80ea34a94eaf292290c0920faec3d0 Cluster size: 6 nodes OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-090bdceda0999b246(eu-north-1)

Test: longevity-50gb-3days-test Test config file: longevity-50GB-3days-authorization-and-tls-ssl.yaml

Issue description Next c-s commands are used to generate dataset and workload during test: prepare_write_cmd:

stress_cmd:

Stress commands were been running about a day, when nemesis AddDropColumn add new column to table with query: 'ALTER TABLE standard1 ADD ( YZYZVOI2RR map<timestamp,varchar>, NEWOPXEEHQ double );'

After that, while new schema was synced between nodes, node5 start reporting a lot of errors and warnings:

DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla:  [shard 7] storage_proxy - Failed to apply mutation from 10.0.0.215#7: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.3.184#5: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla:  [shard 10] storage_proxy - Failed to apply mutation from 10.0.3.184#10: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla:  [shard 3] storage_proxy - Failed to apply mutation from 10.0.1.60#3: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla: message repeated 7 times: [  [shard 3] storage_proxy - Failed to apply mutation from 10.0.1.60#3: seastar::semaphore_timed_out (Semaphore timedout)]
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla:  [shard 4] storage_proxy - Failed to apply mutation from 10.0.1.216#4: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla:  [shard 7] storage_proxy - Failed to apply mutation from 10.0.3.184#7: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !ERR     | scylla:  [shard 11] storage_proxy - exception during mutation write to 10.0.2.104: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla:  [shard 7] storage_proxy - Failed to apply mutation from 10.0.3.184#7: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla:  [shard 4] storage_proxy - Failed to apply mutation from 10.0.1.216#4: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !WARNING | scylla:  [shard 7] storage_proxy - Failed to apply mutation from 10.0.3.184#7: seastar::semaphore_timed_out (Semaphore timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !ERR     | scylla:  [shard 8] view - Error applying view update to 10.0.2.104 (view: mview.users_by_last_name, base token: 5879805386273412502, view token: 1606062046388061453): seastar::timed_out_error (timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !ERR     | scylla:  [shard 8] view - Error applying view update to 10.0.2.104 (view: mview.users_by_last_name, base token: 5902354463894454694, view token: 9217204721083211027): seastar::timed_out_error (timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !ERR     | scylla:  [shard 8] view - Error applying view update to 10.0.2.104 (view: mview.users_by_last_name, base token: -3050749084584198888, view token: 7843344290595203264): seastar::timed_out_error (timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !ERR     | scylla:  [shard 8] view - Error applying view update to 10.0.2.104 (view: mview.users_by_last_name, base token: -2050790747959044127, view token: -213525958756543256): seastar::timed_out_error (timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !ERR     | scylla:  [shard 8] view - Error applying view update to 10.0.2.104 (view: mview.users_by_first_name, base token: -2321254850967599089, view token: -7378783356736540154): seastar::timed_out_error (timedout)
DEBUG > 2021-07-23T23:28:52+00:00  longevity-tls-50gb-3d-master-db-node-1703d316-5 !ERR     | scylla:  [shard 8] view - Error applying view update to 10.0.2.104 (view: mview.users_by_first_name, base token: 5330377005461511645, view token: 7996731367867137508): seastar::timed_out_error (timedout)

And c-s read command got next error:

< t:2021-07-23 23:29:09,555 f:base.py         l:222  c:RemoteCmdRunner      p:DEBUG > com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
< t:2021-07-23 23:29:09,559 f:base.py         l:222  c:RemoteCmdRunner      p:DEBUG > java.io.IOException: Operation x10 on key(s) [373138504d354c313130]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
< t:2021-07-23 23:29:09,559 f:base.py         l:222  c:RemoteCmdRunner      p:DEBUG > com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)

After that c-s command terminated.

Live monitor: http://13.53.214.97:3000/d/m2j9TKZnk/longevity-50gb-3days-scylla-per-server-metrics-nemesis-master?orgId=1&from=1626985117320&to=1627283091954&var-by=instance&var-cluster=&var-dc=All&var-node=All&var-shard=All&var-sct_tags=DisruptionEvent&var-sct_tags=CoreDumpEvent

Nodes:

longevity-tls-50gb-3d-master-db-node-1703d316-1      | eu-north-1a | 13.51.171.222 
longevity-tls-50gb-3d-master-db-node-1703d316-3      | eu-north-1a | 13.51.47.27   
longevity-tls-50gb-3d-master-db-node-1703d316-5      | eu-north-1a | 13.51.55.209  
longevity-tls-50gb-3d-master-db-node-1703d316-6      | eu-north-1a | 13.48.105.245 
longevity-tls-50gb-3d-master-db-node-1703d316-8      | eu-north-1a | 13.48.67.127  
longevity-tls-50gb-3d-master-db-node-1703d316-9      | eu-north-1a | 13.49.226.123 

Restore Monitor Stack command: $ hydra investigate show-monitor 1703d316-469f-477b-91ef-dcc3cd3268d7 Show all stored logs command: $ hydra investigate show-logs 1703d316-469f-477b-91ef-dcc3cd3268d7

Test id: 1703d316-469f-477b-91ef-dcc3cd3268d7

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/1703d316-469f-477b-91ef-dcc3cd3268d7/20210723_233850/grafana-screenshot-longevity-50gb-3days-scylla-per-server-metrics-nemesis-20210723_234143-longevity-tls-50gb-3d-master-monitor-node-1703d316-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/1703d316-469f-477b-91ef-dcc3cd3268d7/20210723_233850/grafana-screenshot-overview-20210723_233850-longevity-tls-50gb-3d-master-monitor-node-1703d316-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/1703d316-469f-477b-91ef-dcc3cd3268d7/20210726_042434/db-cluster-1703d316.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/1703d316-469f-477b-91ef-dcc3cd3268d7/20210726_042434/loader-set-1703d316.tar.gz

Jenkins job URL

slivne commented 3 years ago

@roydahan add items

roydahan commented 3 years ago

Similar and maybe a dup of #9054 #9053 #8969