Open yarongilor opened 3 years ago
Are we dojng lwt operations in the upgrade test ?
Are we dojng lwt operations in the upgrade test ?
no LWT ops during upgrade test.
It's a rolling upgrade with alternator, so it has LWT.
@gleb-cloudius / @kostja
During a rollback operation - we have the following instructions
" Restore all tables of system and system_schema from previous snapshot, 4.3 uses a different set of system tables. " https://docs.scylladb.com/upgrade/upgrade-opensource/upgrade-guide-from-4.2-to-4.3/upgrade-guide-from-4.2-to-4.3-debian-10/#restore-system-tables"
Does that actually work for system.paxos - I think we may have a mistake in the rollback procedure in which we have removed info from system.paxos (and this should never be done).Can you please comment on this.
Can this rollback cause a heavy load on the rolled back node trying to catch up in any way ?
There are no changes in system.paxos metadata between 4.2 and 4.3. Thus rolling back the contents of system tables will not change table metadata, only change the table data. system.paxos contains incomplete paxos rounds. E.g. it may contain the state of the Alternator operations which timed out. Rolling them back will remove that state, and this (especially on a single node) would mean these operations didn't start. So to sum up, I think it's fine to remove the contents of system.paxos, as long as the desired effect - remove the partial state of incomplete Alternator operations during upgrade - is intended.
@slivne @yarongilor I hope this helps, please let me know if not.
Issue reproduced during next job: https://jenkins.scylladb.com/view/nexts/job/scylla-4.4/job/rolling-upgrade/job/rolling-upgrade-alternator-test/42/
After upgrade first node db-node1 from Scylla version 4.3.2-0.20210301.5cdc1fa66 to Scylla version 4.4.2-0.20210520.93457807b with build-id 74e3e6d9dc6bf73c6ed405321e811dbe8717220b and starting scylla 4.4.2, there are a lot of warnings and errors:
2021-05-24T17:44:55+00:00 rolling-upgrade-4-4-centos-db-node-33fd4096-0-1 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.142.0.145#4: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-05-24T17:44:55+00:00 rolling-upgrade-4-4-centos-db-node-33fd4096-0-1 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.142.0.145#4: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-05-24T17:44:55+00:00 rolling-upgrade-4-4-centos-db-node-33fd4096-0-1 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.142.0.145#4: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-05-24T17:44:55+00:00 rolling-upgrade-4-4-centos-db-node-33fd4096-0-1 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.142.0.149#4: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
This messages continue to appear during whole tests. All nodes were upgraded successfully db logs: https://cloudius-jenkins-test.s3.amazonaws.com/33fd4096-fff6-43d1-8ea6-d81cfacb80b2/20210524_184017/db-cluster-33fd4096.zip
Installation details Scylla version (or git commit hash): base version: 4.3.0-0.20210110.000585522 target version: 4.5.dev-0.20210120.4d581f1bb Cluster size: 3 OS (RHEL/CentOS/Ubuntu/AWS AMI): http://downloads.scylladb.com/unstable/scylla/master/rpm/centos/2021-01-20T16:32:50Z/scylla
Test: rolling-upgrade-alternator-test Test name: rolling-upgrade-alternator-test
Issue description
====================================
during rolling-upgrade test with alternator, node-1 (10.142.0.78) was upgraded, then started rollback. then node-3 got errors of:
====================================
errors log:
then shortly after that ycsb stress also failed with:
and:
Restore Monitor Stack command:
$ hydra investigate show-monitor 217536dc-085c-4513-824d-177fa44a1fa8
Show all stored logs command:$ hydra investigate show-logs 217536dc-085c-4513-824d-177fa44a1fa8
Test id:
217536dc-085c-4513-824d-177fa44a1fa8
Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/217536dc-085c-4513-824d-177fa44a1fa8/20210121_122213/db-cluster-217536dc.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/217536dc-085c-4513-824d-177fa44a1fa8/20210121_122213/loader-set-217536dc.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/217536dc-085c-4513-824d-177fa44a1fa8/20210121_122213/monitor-set-217536dc.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/217536dc-085c-4513-824d-177fa44a1fa8/20210121_122213/sct-runner-217536dc.zip
Jenkins job URL