scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Scheduler error checking is broken when task returns context errors #3884

Open Michal-Leszczynski opened 2 weeks ago

Michal-Leszczynski commented 2 weeks ago

Discovered in https://github.com/scylladb/scylla-enterprise/issues/4285, scheduler checks if task ended with error/pause/going out of maintenance window by matching returned error:

func statusFromError(err error) Status {
    switch {
    case err == nil:
        return StatusDone
    case errors.Is(err, context.Canceled):
        return StatusStopped
    case errors.Is(err, context.DeadlineExceeded):
        return StatusWaiting
    default:
        return StatusError
    }
}

This means that if task ended with the following error:

"get repair target: create repair plan: calculate max host intensity: 172.19.96.244: get total memory: context deadline exceeded"

SM would mistake this error for going out of maintenance window. This results in incorrect task status, but also incorrect rescheduling of this task.

In order to fix that, SM shouldn't rely on general context errors (via WithDeadline), but it should check only for SM specific errors (via WithDeadlineCause).

cc: @karol-kokoszka