scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.59k stars 1.29k forks source link

[Alternator] - ERROR site.ycsb.db.DynamoDBClient -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable_no_lwt.usertable_no_lwt - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null) #10525

Open fgelcer opened 2 years ago

fgelcer commented 2 years ago

i'm seeing it on alternator longevity for 4.5.6:

Installation details Kernel version: 5.4.0-1035-aws Scylla version (or git commit hash): 4.5.6-0.20220424.85d3fe744 Cluster size: 6 nodes (i3.4xlarge) Scylla running with shards number (live nodes): alternator-3h-4-5-db-node-6ef7edc2-1 (16.16.66.22 | 10.0.0.60): 14 shards alternator-3h-4-5-db-node-6ef7edc2-2 (13.51.13.163 | 10.0.2.56): 14 shards alternator-3h-4-5-db-node-6ef7edc2-4 (16.171.65.166 | 10.0.2.84): 14 shards alternator-3h-4-5-db-node-6ef7edc2-5 (13.48.70.197 | 10.0.2.128): 14 shards alternator-3h-4-5-db-node-6ef7edc2-6 (13.48.104.229 | 10.0.0.192): 14 shards alternator-3h-4-5-db-node-6ef7edc2-7 (16.170.237.120 | 10.0.3.51): 14 shards alternator-3h-4-5-db-node-6ef7edc2-8 (13.49.72.158 | 10.0.1.220): 14 shards Scylla running with shards number (terminated nodes): alternator-3h-4-5-db-node-6ef7edc2-3 (13.48.26.23 | 10.0.1.69): 14 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-05a25f88ebc3136e6 (aws: eu-north-1)

Test: longevity-alternator-3h-test Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

Issue description

2022-04-27 07:21:01.158: (YcsbStressEvent Severity.ERROR): type=error node=Node alternator-3h-4-5-loader-node-6ef7edc2-1 [13.49.76.180 | 10.0.0.179] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=6990500 -p fieldcount=10 -p fieldlength=512 -p operationcount=200200300 -p dataintegrity=true -p table=usertable_no_lwt -p maxexecutiontime=10800 -s  -P /tmp/dynamodb.properties
errors:
2845926 [Thread-13] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable_no_lwt.usertable_no_lwt - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null)
2022-04-27 07:21:04.290: (YcsbStressEvent Severity.ERROR): type=error node=Node alternator-3h-4-5-loader-node-6ef7edc2-1 [13.49.76.180 | 10.0.0.179] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=6990500 -p fieldcount=10 -p fieldlength=512 -p operationcount=200200300 -p dataintegrity=true -p table=usertable_no_lwt -p maxexecutiontime=10800 -s  -P /tmp/dynamodb.properties
errors:
2849059 [Thread-11] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::mutation_write_failure_exception (Operation failed for alternator_usertable_no_lwt.usertable_no_lwt - received 1 responses and 1 failures from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null)
2022-04-27 07:21:04.995: (YcsbStressEvent Severity.ERROR): type=error node=Node alternator-3h-4-5-loader-node-6ef7edc2-4 [13.49.70.32 | 10.0.3.50] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=6990500 -p fieldcount=10 -p fieldlength=512 -p operationcount=200200300 -p dataintegrity=true -p maxexecutiontime=10800 -s  -P /tmp/dynamodb.properties
errors:
2859810 [Thread-5] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null)
2022-04-27 07:21:05.420: (YcsbStressEvent Severity.ERROR): type=error node=Node alternator-3h-4-5-loader-node-6ef7edc2-1 [13.49.76.180 | 10.0.0.179] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=6990500 -p fieldcount=10 -p fieldlength=512 -p operationcount=200200300 -p dataintegrity=true -p table=usertable_no_lwt -p maxexecutiontime=10800 -s  -P /tmp/dynamodb.properties
errors:
2850190 [Thread-6] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable_no_lwt.usertable_no_lwt - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null)
2022-04-27 07:21:07.101: (YcsbStressEvent Severity.ERROR): type=error node=Node alternator-3h-4-5-loader-node-6ef7edc2-4 [13.49.70.32 | 10.0.3.50] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=6990500 -p fieldcount=10 -p fieldlength=512 -p operationcount=200200300 -p dataintegrity=true -p maxexecutiontime=10800 -s  -P /tmp/dynamodb.properties
errors:
2861918 [Thread-2] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null)
2022-04-27 07:21:09.031: (YcsbStressEvent Severity.ERROR): type=error node=Node alternator-3h-4-5-loader-node-6ef7edc2-1 [13.49.76.180 | 10.0.0.179] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=6990500 -p fieldcount=10 -p fieldlength=512 -p operationcount=200200300 -p dataintegrity=true -p table=usertable_no_lwt -p maxexecutiontime=10800 -s  -P /tmp/dynamodb.properties
errors:
2853799 [Thread-7] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable_no_lwt.usertable_no_lwt - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null)

Restore Monitor Stack command: $ hydra investigate show-monitor 6ef7edc2-475d-4ede-ae31-4a8a54dd1a72 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs 6ef7edc2-475d-4ede-ae31-4a8a54dd1a72

Test id: 6ef7edc2-475d-4ede-ae31-4a8a54dd1a72

Logs: grafana - [https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_093744/grafana-screenshot-alternator-20220427_094238-alternator-3h-4-5-monitor-node-6ef7edc2-1.png](https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_093744/grafana-screenshot-alternator-20220427_094238-alternator-3h-4-5-monitor-node-6ef7edc2-1.png) grafana - [https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_093744/grafana-screenshot-longevity-alternator-3h-test-scylla-per-server-metrics-nemesis-20220427_094222-alternator-3h-4-5-monitor-node-6ef7edc2-1.png](https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_093744/grafana-screenshot-longevity-alternator-3h-test-scylla-per-server-metrics-nemesis-20220427_094222-alternator-3h-4-5-monitor-node-6ef7edc2-1.png) db-cluster - [https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_094808/db-cluster-6ef7edc2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_094808/db-cluster-6ef7edc2.tar.gz) loader-set - [https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_094808/loader-set-6ef7edc2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_094808/loader-set-6ef7edc2.tar.gz) monitor-set - [https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_094808/monitor-set-6ef7edc2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_094808/monitor-set-6ef7edc2.tar.gz) sct-runner - [https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_094808/sct-runner-6ef7edc2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6ef7edc2-475d-4ede-ae31-4a8a54dd1a72/20220427_094808/sct-runner-6ef7edc2.tar.gz)

Jenkins job URL

nyh commented 2 years ago

Is this the same as issue #9812? And/or #10257?

fgelcer commented 2 years ago

Is this the same as issue #9812? And/or #10257?

could be, but @slivne asked to split into a new issue here

fgelcer commented 2 years ago

for reference, this job, running with 4.5.5 (ami-0276e2bbc13093d95 on eu-north-1) passed without similar issue. also, there are previous 4.5 builds that passed...

yarongilor commented 2 years ago

Reproduced in 5.0 :

Installation details

Kernel Version: 5.13.0-1021-aws Scylla version (or git commit hash): 5.0~rc3-20220406.f92622e0d with build-id 2b79c4744216b294fdbd2f277940044c899156ea Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-07835983d717b1ea3 (aws: eu-west-1)

Test: longevity-alternator-200gb-48h-test Test id: d918f269-fa5e-4057-b74f-88062e9d5d0e Test name: scylla-5.0/longevity/longevity-alternator-200gb-48h-test Test config file(s):

Issue description

>>>>>>> Happenned during a nemesis of disrupt_no_corrupt_repair alternator-48h-5-0-db-node-d918f269-6 Succeeded 2022-05-03 08:38:06 2022-05-03 09:34:31

Got YCSB errors of:

2022-05-03 09:05:03.622: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=fe4fd1ed-ec5e-427d-8a2b-0c29bd8989fa: type=error node=Node alternator-48h-5-0-loader-node-d918f269-1 [34.251.53.215 | 10.0.1.211] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
31230561 [Thread-4] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)

2022-05-03 09:05:07.542: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=68350e77-ea80-4e3e-8511-17b753e44ee1: type=error node=Node alternator-48h-5-0-loader-node-d918f269-3 [34.245.62.224 | 10.0.2.148] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
31212917 [Thread-10] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)

2022-05-03 09:05:12.079: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=72c25dc0-f3e7-4937-83c6-e0a8f98ae939: type=error node=Node alternator-48h-5-0-loader-node-d918f269-2 [3.248.217.20 | 10.0.0.25] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
31228768 [Thread-2] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)

<<<<<<<

Logs:

Jenkins job URL

slivne commented 2 years ago

@fgelcer as far as I recall you said this did not happen in scylla version < 4.5.5 and only started happening on 4.5.6 - is that correct ?

I don't see why any patch in 4.5.5.. would cause this

[shlomi@shlomi-scylladb-com scylla]$ git shortlog scylla-4.5.5..
Avi Kivity (5):
      Update seastar submodule (pidof command not installed)
      Update seastar submodule (logger deadlock with large messages)
      transport: return correct error codes when downgrading v4 {WRITE,READ}_FAILURE to {WRITE,READ}_TIMEOUT
      Update tools/java submodule (bad IPv6 addresses in nodetool)
      Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes

Benny Halevy (1):
      table: clear: serialize with ongoing flush

Calle Wilund (1):
      cdc: Ensure columns removed from log table are registered as dropped

Eliran Sinvani (2):
      cql3 statements: Change dependency test API to express better it's purpose
      prepared_statements: Invalidate batch statement too

Juliusz Stasiewicz (1):
      CQL: Replace assert by exception on invalid auth opcode

Piotr Sarna (1):
      cql3: fix qualifying restrictions with IN for indexing

Raphael S. Carvalho (1):
      compaction: LCS: don't write to disengaged optional on compaction completion

Tomasz Grabiec (2):
      utils/chunked_vector: Fix sigsegv during reserve()
      loading_cache: Make invalidation take immediate effect

Vlad Zolotarov (1):
      loading_shared_values/loading_cache: get rid of iterators interface and return value_ptr from find(...) instead

Yaron Kaikov (1):
      release: prepare for 4.5.6

If the answer is yes - this reproduces on 4.5.6 and not on earlier releases - lets bisect - since its not clear what caused this change

fgelcer commented 2 years ago

@fgelcer as far as I recall you said this did not happen in scylla version < 4.5.5 and only started happening on 4.5.6 - is that correct ?

when you asked, i searched for such failure on 4.5 results (prior to 4.5.6) and didn't see such failures

KnifeyMoloko commented 2 years ago

Reproduced in enterprise-2022.1 rc4:

Installation details

Kernel Version: 5.13.0-1022-aws Scylla version (or git commit hash): 2022.1~rc4-20220510.0007e0625 with build-id 41d9ba21416afba85744e3e5308d1f2e5403ce65 Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-06f239bc4ddf227d3 (aws: us-east-1)

Test: longevity-alternator-3h-test Test id: d64b2ff7-9e5b-4c78-975c-ea92c3811753 Test name: enterprise-2022.1/longevity/longevity-alternator-3h-test Test config file(s):

Issue description

>>>>>>> These errors happened in between our nemesis, after ToggleCdcFeaturePropertiesOnTable nemesis, during cluster health validation.

2022-05-11 02:03:27.274: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=5c7ed44e-f423-4ed3-b23e-93d800cca2eb: type=error node=Node alternator-3h-2022-1-loader-node-d64b2ff7-1 [44.204.210.209 | 10.0.0.124] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=6990500 -p fieldcount=10 -p fieldlength=512 -p operationcount=200200300 -p dataintegrity=true -p table=usertable_no_lwt -p maxexecutiontime=10800 -s  -P /tmp/dynamodb.properties
errors:

1448182 [Thread-6] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable_no_lwt.usertable_no_lwt - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)
2022-05-11 02:03:31.043: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=6dcd5221-b759-48d7-b60f-0aa303cca4d7: type=error node=Node alternator-3h-2022-1-loader-node-d64b2ff7-4 [3.234.206.152 | 10.0.3.216] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=6990500 -p fieldcount=10 -p fieldlength=512 -p operationcount=200200300 -p dataintegrity=true -p maxexecutiontime=10800 -s  -P /tmp/dynamodb.properties
errors:

1463410 [Thread-21] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)

<<<<<<<

Logs:

Jenkins job URL

yarongilor commented 2 years ago

Reproduced on same above test, in RC7.

Installation details

Kernel Version: 5.13.0-1025-aws Scylla version (or git commit hash): 5.0~rc7-20220602.41a00c744 with build-id 12671ee67c8f9a992f1e64f5647ff883e5f495e0 Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-09ad25bdd94f417a5 (aws: eu-north-1)

Test: longevity-alternator-200gb-48h-test Test id: 3253fd7f-ddb9-4af7-8d9e-f8b2e9f69fd7 Test name: scylla-5.0/longevity/longevity-alternator-200gb-48h-test Test config file(s):

Issue description

>>>>>>>


2022-06-09 19:47:05.753: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=4426b336-8e3e-4024-a1cf-cc31ed4e226b: type=error node=Node alternator-48h-5-0-loader-node-3253fd7f-4 [13.49.224.238 | 10.0.0.146] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
20431291 [Thread-19] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to alternator:8080 [alternator/10.0.1.146] failed: Connection refused (Connection refused)

2022-06-09 20:57:14.691: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=c8846e42-399f-4582-9103-fdefdaa2b69a: type=error node=Node alternator-48h-5-0-loader-node-3253fd7f-3 [13.53.198.225 | 10.0.2.138] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
24651292 [Thread-7] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)

2022-06-09 20:57:16.463: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=21588754-e2fc-4af4-995e-6e0101992f52: type=error node=Node alternator-48h-5-0-loader-node-3253fd7f-2 [13.48.48.229 | 10.0.2.55] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
24664018 [Thread-14] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)

2022-06-09 20:57:18.583: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=447550ef-cb40-476e-913d-6b12abac051a: type=error node=Node alternator-48h-5-0-loader-node-3253fd7f-1 [13.51.157.248 | 10.0.2.228] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
24677292 [Thread-12] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)

<<<<<<<

Logs:

Jenkins job URL

yarongilor commented 1 year ago

Issue description

reproduced a similar error (after an add-remove-dc nemesis):

2023-02-09 14:24:17.714: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=627fc2d9-36b8-4c72-a8c9-8ce1667e00d6: type=error node=Node alternator-3h-5-2-loader-node-6bb4b3d6-1 [3.92.42.94 | 10.12.2.210] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=6990500 -p fieldcount=10 -p fieldlength=512 -p operationcount=200200300 -p dataintegrity=true -p table=usertable_no_lwt -p maxexecutiontime=10800 -s  -P /tmp/dynamodb.properties
errors:
5259371 [Thread-2] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable_no_lwt.usertable_no_lwt - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)
WARNING (100)

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1028-aws Scylla version (or git commit hash): 5.2.0~rc1-20230207.8ff4717fd010 with build-id 78fbb2c25e9244a62f57988313388a0260084528

Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-08975a99197206651 (aws: us-east-1)

Test: longevity-alternator-3h-test Test id: 6bb4b3d6-fa67-4123-a5c9-d87340dd6392 Test name: scylla-5.2/longevity/longevity-alternator-3h-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 6bb4b3d6-fa67-4123-a5c9-d87340dd6392` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=6bb4b3d6-fa67-4123-a5c9-d87340dd6392) - Show all stored logs command: `$ hydra investigate show-logs 6bb4b3d6-fa67-4123-a5c9-d87340dd6392` ## Logs: - **db-cluster-6bb4b3d6.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/db-cluster-6bb4b3d6.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/db-cluster-6bb4b3d6.tar.gz) - **sct-runner-6bb4b3d6.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/sct-runner-6bb4b3d6.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/sct-runner-6bb4b3d6.tar.gz) - **monitor-set-6bb4b3d6.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/monitor-set-6bb4b3d6.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/monitor-set-6bb4b3d6.tar.gz) - **loader-set-6bb4b3d6.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/loader-set-6bb4b3d6.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/loader-set-6bb4b3d6.tar.gz) - **parallel-timelines-report-6bb4b3d6.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/parallel-timelines-report-6bb4b3d6.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6bb4b3d6-fa67-4123-a5c9-d87340dd6392/20230209_161435/parallel-timelines-report-6bb4b3d6.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-5.2/job/longevity/job/longevity-alternator-3h-test/6/)
mykaul commented 1 year ago

I wonder how do we know that we do not overload the cluster? @yarongilor ?

yarongilor commented 1 year ago
  • hydra investigate show-logs 6bb4b3d6-fa67-4123-a5c9-d87340dd6392

@mykaul , the cluster is around 50% loaded at this time, according to Grafana. As can be seen with more details in this snapshot: https://snapshots.raintank.io/dashboard/snapshot/NqS3FJ9fYaUh5TueLBOHxoHP1GoOLQCE

nyh commented 1 year ago

I know it's easy to mark this failure "Alternator" and assign it to me because the reads are done via the Alternator API, but Alternator reads translate to simple CQL CL=2 reads. Do we have exactly the same issue for CQL reads? If we do, let's close this one - it doesn't help to complicate this issue by yet another subsystem (Alternator) if it can be reproduced without it.

@yarongilor said the cluster is only 50% loaded at the time so the "obvious" guess like some machines being overloaded by other things (repair, hints, compactions, materialized views, or who knows what) don't work.

So maybe the replica read timeout is real, and caused by one of the nemesises? For example, if we kill a replica after read was already started from it, we can get a timeout, and it is justified? Is speculative_retry enabled in this test? Maybe for some reason it isn't working well? If we have a bug in speculative retry and this issue happens when nodes were going down, then it can explain what we see.

Another wild guess (with absolutely no evidence to back it up) that maybe the cluster isn't CPU-bound but IO bound, and is at 100% IO load. I don't know how to check that.... If we reach 100% IO load, reads may be very slow and time out because of issues like https://github.com/scylladb/scylladb/issues/8873 or a lot of other things.

P.S. How can I see in the metrics the exact time points where you suspect there was a problem? Do we know if something of interest happened right before this time - maybe some nemesis added a new node or killed a node or deleted sstables or rebooted a node or I don't know what?

As usual, it would have been nice if there was some attempt to reproduce this issue with fewer and fewer nemeses to rule out some of the subsystems. We can always blame Alternator/Repair/Compaction/MV/SI/RBNO/etc. for this issue, but if we can reproduce it without some of these subsystems, we will know it is not to blame.

mykaul commented 1 year ago

Without reproduction, I suggest closing it. It's been ~half a year since it was run, we need to be more responsive (and I'll make sure we are) with issues.

nyh commented 1 year ago

How is the "half a year" relevant? QA keeps posting comments about this issue still happening. What will we gain if we close this issue and they open a new one, with less historic context?

If you have ideas how to be more responsive, I'm all ears. I tried to explain in my comment today why I have no idea how to proceed with this issue without dropping everything else I do and starting to learn what this test was doing when it experienced the read timeouts. Is this really what you want me to do? And even then, I'm 99% sure it has nothing to do with Alternator (but I'd like to see confirmation or denial from QA - whether the same issue is seen also in a CQL workload).

nyh commented 1 year ago

Oh, I see that @yarongilor posted today, but his comment was indeed based on a half-year old run. I agree then - if this hasn't reproduced in half a year, let's close it.

fgelcer commented 1 year ago

@yarongilor , check last runs of such test if they reproduced the issue, and if not, close it

aleksbykov commented 5 months ago

Looks like issue was reproduced with

Packages

Scylla version: 6.0.0~rc1-20240530.b25dd2696f6a with build-id 944ecf73fdff430d96c5c36992b54afa46a25c2f

Kernel Version: 5.15.0-1062-aws

Issue description

After serial of nemesis: node cleanup and then major compaction, ycsb command failed with same error:

2024-06-04 13:55:41.673: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=f7194fa6-8fdd-48eb-866a-1dad7a293b01: type=error node=Node alternator-ttl-4-loaders-no-lwt-sis-loader-node-8135673d-1 [3.252.193.136 | 10.4.0.208]
stress_cmd=bin/ycsb load dynamodb -P workloads/workloadc -threads 13 -p recordcount=8589934401 -p fieldcount=2 -p fieldlength=16 -p insertstart=4294967200 -p insertcount=2147483600  -p table=usertable_no_lwt -p dynamodb.ttlKey=ttl -p dynamodb.ttlDuration=43200 -s  -P /tmp/dynamodb.properties -p maxexecutiontime=180600
errors:
96111926 [Thread-5] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::unavailable_exception (Cannot achieve consistency level for cl LOCAL_QUORUM. Requires 2, alive 1) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)

Monitoring:

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0c402c606e66f6f77 (aws: undefined_region)

Test: longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus-test Test id: 8135673d-9fc5-40c5-aa1e-615fc9996d3b Test name: scylla-6.0/alternator/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 8135673d-9fc5-40c5-aa1e-615fc9996d3b` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=8135673d-9fc5-40c5-aa1e-615fc9996d3b) - Show all stored logs command: `$ hydra investigate show-logs 8135673d-9fc5-40c5-aa1e-615fc9996d3b` ## Logs: - **db-cluster-8135673d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/db-cluster-8135673d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/db-cluster-8135673d.tar.gz) - **output-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/output-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/output-8135673d.log.tar.gz) - **debug-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/debug-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/debug-8135673d.log.tar.gz) - **events-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/events-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/events-8135673d.log.tar.gz) - **normal-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/normal-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/normal-8135673d.log.tar.gz) - **argus-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/argus-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/argus-8135673d.log.tar.gz) - **raw_events-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/raw_events-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/raw_events-8135673d.log.tar.gz) - **critical-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/critical-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/critical-8135673d.log.tar.gz) - **warning-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/warning-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/warning-8135673d.log.tar.gz) - **summary-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/summary-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/summary-8135673d.log.tar.gz) - **error-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/error-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/error-8135673d.log.tar.gz) - **sct-8135673d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/sct-8135673d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/sct-8135673d.log.tar.gz) - **loader-set-8135673d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/loader-set-8135673d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/loader-set-8135673d.tar.gz) - **monitor-set-8135673d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/monitor-set-8135673d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/8135673d-9fc5-40c5-aa1e-615fc9996d3b/20240604_134535/monitor-set-8135673d.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-6.0/job/alternator/job/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus-test/6/) [Argus](https://argus.scylladb.com/test/84828afc-9858-4086-a8ea-a8b9981bc93c/runs?additionalRuns[]=8135673d-9fc5-40c5-aa1e-615fc9996d3b)
yarongilor commented 2 months ago

reproduced in 2024.2. perhaps it requires also a deeper look into grafana.

YcsbStressEvent
ERROR
disrupt_run_unique_sequence
2024-09-10 08:26:09.490
not-set
Node alternator-48h-2024-2-loader-node-e91c7bc4-3 [54.247.17.97 | 10.4.3.37]
2024-09-10 08:26:09.490: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=ac64d07b-39fa-46a6-93b0-fb30fc85ae70: type=error node=Node alternator-48h-2024-2-loader-node-e91c7bc4-3 [54.247.17.97 | 10.4.3.37]
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
59978049 [Thread-4] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::read_timeout_exception (Operation timed out for alternator_usertable.usertable - received only 1 responses from 2 CL=LOCAL_QUORUM.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null; Proxy: null)

Packages

Scylla version: 2024.2.0~rc2-20240904.4c26004e5311 with build-id a8549197de3c826053f88ddfd045b365b9cd8692

Kernel Version: 5.15.0-1068-aws

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0555cb82c50d0d5f1 (aws: undefined_region)

Test: longevity-alternator-200gb-48h-test Test id: e91c7bc4-6235-4649-9874-627c46f1c8f8 Test name: enterprise-2024.2/alternator/longevity-alternator-200gb-48h-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor e91c7bc4-6235-4649-9874-627c46f1c8f8` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=e91c7bc4-6235-4649-9874-627c46f1c8f8) - Show all stored logs command: `$ hydra investigate show-logs e91c7bc4-6235-4649-9874-627c46f1c8f8` ## Logs: - **db-cluster-e91c7bc4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/db-cluster-e91c7bc4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/db-cluster-e91c7bc4.tar.gz) - **sct-runner-events-e91c7bc4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/sct-runner-events-e91c7bc4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/sct-runner-events-e91c7bc4.tar.gz) - **sct-e91c7bc4.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/sct-e91c7bc4.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/sct-e91c7bc4.log.tar.gz) - **loader-set-e91c7bc4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/loader-set-e91c7bc4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/loader-set-e91c7bc4.tar.gz) - **monitor-set-e91c7bc4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/monitor-set-e91c7bc4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e91c7bc4-6235-4649-9874-627c46f1c8f8/20240911_090203/monitor-set-e91c7bc4.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2024.2/job/alternator/job/longevity-alternator-200gb-48h-test/3/) [Argus](https://argus.scylladb.com/test/0526c1eb-7dd8-42d3-a759-a899ffedca5e/runs?additionalRuns[]=e91c7bc4-6235-4649-9874-627c46f1c8f8)