scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Terminate failed repair jobs #3806

Open Michal-Leszczynski opened 2 months ago

Michal-Leszczynski commented 2 months ago

Right now we don't terminate failed repair jobs by default - the problem is that they might have failed because of a timeout on our side and in fact still be running. This causes two problems:

karol-kokoszka commented 2 months ago

Gromming notes

The goal is to call the Scylla API to kill the repair job that timeout on the job status check to assure that the job is not handled by the Scylla server anymore.

The timeout for the repair status check is set for 30 minutes right now. @asias Any clue what would be the best timeout we can set for waiting on the repair job status ?

We need to have the integration test covering this scenario.

  1. Timeout the repair job
  2. Assert that the job is terminated and no longer running on the Scylla server.

It may create a need for controlling the timeout value via yaml or other configuration.

The issue describes just a corner case.