mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.) after upgraded node rollback

yarongilor commented 3 years ago

Installation details Scylla version (or git commit hash): base version: 4.3.0-0.20210110.000585522 target version: 4.5.dev-0.20210120.4d581f1bb Cluster size: 3 OS (RHEL/CentOS/Ubuntu/AWS AMI): http://downloads.scylladb.com/unstable/scylla/master/rpm/centos/2021-01-20T16:32:50Z/scylla

Test: rolling-upgrade-alternator-test Test name: rolling-upgrade-alternator-test

Issue description

====================================

during rolling-upgrade test with alternator, node-1 (10.142.0.78) was upgraded, then started rollback. then node-3 got errors of:

< t:2021-01-21 11:20:13,531 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-21T11:19:50+00:00  rolling-upgrade-master-centos-db-node-217536dc-0-3 !ERR     | scylla: [shard 5] storage_proxy - exception during mutation write to 10.142.0.78: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)

< t:2021-01-21 11:18:38,575 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-21T11:18:24+00:00  rolling-upgrade-master-centos-db-node-217536dc-0-3 !ERR     | scylla: [shard 4] storage_proxy - Exception when communicating with 10.142.0.78, to read from system.paxos: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)

====================================

errors log:

< t:2021-01-21 11:20:04,624 f:cluster.py      l:1391 c:sdcm.cluster         p:DEBUG > 2021-01-21T11:19:50+00:00  rolling-upgrade-master-centos-db-node-217536dc-0-3 !ERR     | scylla: [shard 5] storage_proxy - exception during mutation write to 10.142.0.78: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
< t:2021-01-21 11:20:13,469 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-21T11:19:49+00:00  rolling-upgrade-master-centos-db-node-217536dc-0-3 !ERR     | scylla: [shard 5] storage_proxy - exception during mutation write to 10.142.0.78: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
< t:2021-01-21 11:20:13,474 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-21T11:19:49+00:00  rolling-upgrade-master-centos-db-node-217536dc-0-3 !ERR     | scylla: [shard 4] storage_proxy - exception during mutation write to 10.142.0.78: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
< t:2021-01-21 11:20:13,526 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-21T11:19:50+00:00  rolling-upgrade-master-centos-db-node-217536dc-0-3 !ERR     | scylla: [shard 5] storage_proxy - exception during mutation write to 10.142.0.78: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
< t:2021-01-21 11:20:13,531 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-21T11:19:50+00:00  rolling-upgrade-master-centos-db-node-217536dc-0-3 !ERR     | scylla: [shard 5] storage_proxy - exception during mutation write to 10.142.0.78: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)

< t:2021-01-21 11:18:38,575 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-21T11:18:24+00:00  rolling-upgrade-master-centos-db-node-217536dc-0-3 !ERR     | scylla: [shard 4] storage_proxy - Exception when communicating with 10.142.0.78, to read from system.paxos: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)

< t:2021-01-21 11:18:38,575 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-21T11:18:24+00:00  rolling-upgrade-master-centos-db-node-217536dc-0-3 !ERR     | scylla: [shard 4] storage_proxy - Exception when communicating with 10.142.0.78, to read from system.paxos: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)

then shortly after that ycsb stress also failed with:

493764 [Thread-68] INFO  com.amazonaws.http.AmazonHttpClient  -Unable to execute HTTP request: alternator:8080 failed to respond
org.apache.http.NoHttpResponseException: alternator:8080 failed to respond

and:

2021-01-21 11:19:51.672: (YcsbStressEvent Severity.ERROR): type=error node=Node rolling-upgrade-master-centos-loader-node-217536dc-0-1 [34.74.132.199 | 10.142.0.90] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloada -threads 100 -p recordcount=1005010 -p readproportion=0 -p updateproportion=1 -p scanproportion=0 -p insertproportion=0 -p requestdistribution=uniform -p fieldcount=10 -p fieldlength=512 -p operationcount=1005010 -p table=usertable -s  -P /tmp/dynamodb.properties -p maxexecutiontime=22200
errors:

1765609 [Thread-32] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.services.dynamodbv2.model.InternalServerErrorException: Internal server error: exceptions::mutation_write_failure_exception (Operation failed for alternator_usertable.usertable - received 1 responses and 1 failures from 2 CL=LOCAL_SERIAL.) (Service: AmazonDynamoDBv2; Status Code: 500; Error Code: InternalServerError; Request ID: null)

Restore Monitor Stack command: $ hydra investigate show-monitor 217536dc-085c-4513-824d-177fa44a1fa8 Show all stored logs command: $ hydra investigate show-logs 217536dc-085c-4513-824d-177fa44a1fa8

Test id: 217536dc-085c-4513-824d-177fa44a1fa8

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/217536dc-085c-4513-824d-177fa44a1fa8/20210121_122213/db-cluster-217536dc.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/217536dc-085c-4513-824d-177fa44a1fa8/20210121_122213/loader-set-217536dc.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/217536dc-085c-4513-824d-177fa44a1fa8/20210121_122213/monitor-set-217536dc.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/217536dc-085c-4513-824d-177fa44a1fa8/20210121_122213/sct-runner-217536dc.zip

Jenkins job URL

slivne commented 3 years ago

Are we dojng lwt operations in the upgrade test ?

yarongilor commented 3 years ago

Are we dojng lwt operations in the upgrade test ?

no LWT ops during upgrade test.

roydahan commented 3 years ago

It's a rolling upgrade with alternator, so it has LWT.

slivne commented 3 years ago

@gleb-cloudius / @kostja

During a rollback operation - we have the following instructions

" Restore all tables of system and system_schema from previous snapshot, 4.3 uses a different set of system tables. " https://docs.scylladb.com/upgrade/upgrade-opensource/upgrade-guide-from-4.2-to-4.3/upgrade-guide-from-4.2-to-4.3-debian-10/#restore-system-tables"

Does that actually work for system.paxos - I think we may have a mistake in the rollback procedure in which we have removed info from system.paxos (and this should never be done).Can you please comment on this.
Can this rollback cause a heavy load on the rolled back node trying to catch up in any way ?

kostja commented 3 years ago

There are no changes in system.paxos metadata between 4.2 and 4.3. Thus rolling back the contents of system tables will not change table metadata, only change the table data. system.paxos contains incomplete paxos rounds. E.g. it may contain the state of the Alternator operations which timed out. Rolling them back will remove that state, and this (especially on a single node) would mean these operations didn't start. So to sum up, I think it's fine to remove the contents of system.paxos, as long as the desired effect - remove the partial state of incomplete Alternator operations during upgrade - is intended.

kostja commented 3 years ago

@slivne @yarongilor I hope this helps, please let me know if not.

aleksbykov commented 3 years ago

Issue reproduced during next job: https://jenkins.scylladb.com/view/nexts/job/scylla-4.4/job/rolling-upgrade/job/rolling-upgrade-alternator-test/42/

After upgrade first node db-node1 from Scylla version 4.3.2-0.20210301.5cdc1fa66 to Scylla version 4.4.2-0.20210520.93457807b with build-id 74e3e6d9dc6bf73c6ed405321e811dbe8717220b and starting scylla 4.4.2, there are a lot of warnings and errors:

2021-05-24T17:44:55+00:00  rolling-upgrade-4-4-centos-db-node-33fd4096-0-1 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.142.0.145#4: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-05-24T17:44:55+00:00  rolling-upgrade-4-4-centos-db-node-33fd4096-0-1 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.142.0.145#4: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-05-24T17:44:55+00:00  rolling-upgrade-4-4-centos-db-node-33fd4096-0-1 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.142.0.145#4: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-05-24T17:44:55+00:00  rolling-upgrade-4-4-centos-db-node-33fd4096-0-1 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.142.0.149#4: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)

This messages continue to appear during whole tests. All nodes were upgraded successfully db logs: https://cloudius-jenkins-test.s3.amazonaws.com/33fd4096-fff6-43d1-8ea6-d81cfacb80b2/20210524_184017/db-cluster-33fd4096.zip

scylladb / scylladb

mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.) after upgraded node rollback #7948