scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

Misleading error message on failing manager test `test_restore_backup_with_task` #6647

Open karol-kokoszka opened 1 year ago

karol-kokoszka commented 1 year ago

Prerequisites

Versions

Logs

Description

There is a misleading error output on failures in test_restore_backup_with_task. This test is performing the restore and waits until the restore is done. The timeout is set to 1000 seconds. If the timeout is exceeded, then the whole suit should be aborted and no next tests should be scheduled.

Right now, even tough the assertion, checking if the restore task completed, timed out, another test-case was started. The other test case failed, because it couldn't schedule the repair. It couldn't schedule the repair, because there was already ongoing repair on the node. The ongoing repair was triggered by the failed test_restore_backup_with_task.

Steps to Reproduce

Run https://jenkins.scylladb.com/view/scylla-manager/job/manager-3.2/job/centos-sanity-test.

More detials: https://jenkins.scylladb.com/view/scylla-manager/job/manager-3.2/job/centos-sanity-test/12/consoleFull#2095715862fcc21424-66d2-4bd8-8e0d-9746405e5b16

Expected behavior: Whenever there is a failure on test_restore_backup_with_task it should hard stop and report the error in this test-case.

Actual behavior: test_restore_backup_with_task failed on the task status assertion, but another test-case was started (test_repair_multiple_keyspace_types). The test_repair_multiple_keyspace_types failed ofc, because it wanted to perform the repair, but there was still ongoing repair job started by the test_restore_backup_with_task.

fgelcer commented 1 year ago

i think this is intentional that the test won't abort the whole suite, but perhaps we need to understand how to stop the task that has timed out