scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.49k stars 1.28k forks source link

500k errors of paxos - CAS: mutation_write_timeout_exception accompanied by semaphore timeouts #8715

Closed bentsi closed 3 years ago

bentsi commented 3 years ago

Installation details Kernel version: 5.4.0-1035-aws Scylla version (or git commit hash): 4.6.dev-0.20210525.6144656b2 with build-id bd05ced7ce11303a84d0123e35079c5ebe69bf08 Cluster size: 4 nodes (i3.2xlarge) Scylla running with shards number (live nodes):

Test: longevity-lwt-3h-test Test name: longevity_lwt_test.LWTLongevityTest.test_lwt_longevity Test config file(s):

Issue description

During the test run got 500k errors. Errors started to appear immediately after the stress workload began.

2021-05-26 07:17:05.000: (DatabaseLogEvent Severity.WARNING): type=SYSTEM_PAXOS_TIMEOUT regex=.*mutation_write_*|.*Operation timed out for system.paxos.*|.*Operation failed for system.paxos.* line_number=366900 node=Node longevity-lwt-3h-master-db-node-af943a81-4 [13.49.69.31 | 10.0.3.142] (seed: False)
2021-05-26T07:17:05+00:00  longevity-lwt-3h-master-db-node-af943a81-4 !ERR     | scylla:  [shard 3] paxos - CAS[188114] prune: failed exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-05-26 07:17:05.000: (DatabaseLogEvent Severity.WARNING): type=SYSTEM_PAXOS_TIMEOUT regex=.*mutation_write_*|.*Operation timed out for system.paxos.*|.*Operation failed for system.paxos.* line_number=366901 node=Node longevity-lwt-3h-master-db-node-af943a81-4 [13.49.69.31 | 10.0.3.142] (seed: False)
2021-05-26T07:17:05+00:00  longevity-lwt-3h-master-db-node-af943a81-4 !INFO    | scylla:  [shard 4] rpc - client 10.0.3.249:50236 msg_id 503619:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-05-26 07:17:05.000: (DatabaseLogEvent Severity.WARNING): type=SYSTEM_PAXOS_TIMEOUT regex=.*mutation_write_*|.*Operation timed out for system.paxos.*|.*Operation failed for system.paxos.* line_number=366902 node=Node longevity-lwt-3h-master-db-node-af943a81-4 [13.49.69.31 | 10.0.3.142] (seed: False)
2021-05-26T07:17:05+00:00  longevity-lwt-3h-master-db-node-af943a81-4 !ERR     | scylla:  [shard 4] paxos - CAS[187838] prune: failed exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-05-26 07:17:05.000: (DatabaseLogEvent Severity.WARNING): type=SYSTEM_PAXOS_TIMEOUT regex=.*mutation_write_*|.*Operation timed out for system.paxos.*|.*Operation failed for system.paxos.* line_number=366903 node=Node longevity-lwt-3h-master-db-node-af943a81-4 [13.49.69.31 | 10.0.3.142] (seed: False)
2021-05-26T07:17:05+00:00  longevity-lwt-3h-master-db-node-af943a81-4 !INFO    | scylla:  [shard 1] rpc - client 10.0.1.171:65137 msg_id 518597:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-05-26 07:17:10.000: (DatabaseLogEvent Severity.WARNING): type=SEMAPHORE_TIME_OUT regex=semaphore_timed_out line_number=366912 node=Node longevity-lwt-3h-master-db-node-af943a81-4 [13.49.69.31 | 10.0.3.142] (seed: False)

Sempahore timeouts

2021-05-26T07:17:10+00:00  longevity-lwt-3h-master-db-node-af943a81-4 !WARNING | scylla:  [shard 3] storage_proxy - Failed to apply mutation from 10.0.3.249#3: seastar::named_semaphore_timed_out (Semaphore timed out: smp_service_group#2 0->3 semaphore)
21-05-26 07:17:09.000: (DatabaseLogEvent Severity.WARNING): type=SEMAPHORE_TIME_OUT regex=semaphore_timed_out line_number=204369 node=Node longevity-lwt-3h-master-db-node-af943a81-1 [13.51.157.252 | 10.0.3.249] (seed: True)
021-05-26T07:17:09+00:00  longevity-lwt-3h-master-db-node-af943a81-1 !WARNING | scylla:  [shard 2] storage_proxy - Failed to apply mutation from 10.0.0.27#2: seastar::named_semaphore_timed_out (Semaphore timed out: smp_service_group#2 5->2 semaphore)
2021-05-26 07:17:11.000: (DatabaseLogEvent Severity.WARNING): type=SEMAPHORE_TIME_OUT regex=semaphore_timed_out line_number=204370 node=Node longevity-lwt-3h-master-db-node-af943a81-1 [13.51.157.252 | 10.0.3.249] (seed: True)
2021-05-26T07:17:11+00:00  longevity-lwt-3h-master-db-node-af943a81-1 !WARNING | scylla:  [shard 1] storage_proxy - Failed to apply mutation from 10.0.3.142#1: seastar::named_semaphore_timed_out (Semaphore timed out: smp_service_group#2 2->1 semaphore)

Restore Monitor Stack command: $ hydra investigate show-monitor af943a81-6174-4f9d-a093-57b959cfce8e Show all stored logs command: $ hydra investigate show-logs af943a81-6174-4f9d-a093-57b959cfce8e

Test id: af943a81-6174-4f9d-a093-57b959cfce8e

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/af943a81-6174-4f9d-a093-57b959cfce8e/20210526_071740/grafana-screenshot-longevity-lwt-3h-test-scylla-per-server-metrics-nemesis-20210526_072149-longevity-lwt-3h-master-monitor-node-af943a81-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/af943a81-6174-4f9d-a093-57b959cfce8e/20210526_072554/db-cluster-af943a81.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/af943a81-6174-4f9d-a093-57b959cfce8e/20210526_072554/loader-set-af943a81.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/af943a81-6174-4f9d-a093-57b959cfce8e/20210526_072554/monitor-set-af943a81.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/af943a81-6174-4f9d-a093-57b959cfce8e/20210526_072554/sct-runner-af943a81.zip

Jenkins job URL

roydahan commented 3 years ago

Probably same as: #6155 , #8270 , #8706