Support Asynchronous Successive Halving Algorithm (ASHA) in Katib

andreyvelich commented 2 years ago

/kind feature

In the AutoML and Training summit on 2021-07-16 we had couple of request to support ASHA in Katib. This item is also in our ROADMAP 2021. Let's start a discussion how we can implement it in this issue.

/cc @gaocegege @johnugeorge @jbottum

andreyvelich commented 2 years ago

/help

google-oss-robot commented 2 years ago

@andreyvelich: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubeflow/katib/issues/1582): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 2 years ago

/lifecycle frozen

forsaken628 commented 1 month ago

Did github.com/c-bata/goptuna/successivehalving work for this? https://github.com/forsaken628/katib/blob/f85ecab6cc196d90281e8289e43263e707b6212c/pkg/suggestion/v1beta1/goptuna/converter.go#L99-L127

andreyvelich commented 1 month ago

Thank you for pointing to this @forsaken628! Do we know how does this algorithm work in Goptuna ? Do we require to change scheduling policy for Trials ? E.g. stop them during execution like with EarlyStopping algorithms. cc @c-bata

c-bata commented 1 month ago

The original algorithm described in the ASHA paper requires stopping the trial evaluation. However, this restriction makes the implementation complex and can potentially waste computational resources. So that the ASHA algorithm in Goptuna (and Optuna) has been modified accordingly. I would say it's not so difficult to use the Goptuna's ASHA algorithm from Katib.

Here is an example code to use Goptuna's ASHA algorithm.

# See the full example code at https://github.com/c-bata/goptuna/blob/main/_examples/gorgonia_iris/main.go
func main() {
    storage := rdb.NewStorage(db)
    pruner, _ := successivehalving.NewPruner(
        successivehalving.OptionSetReductionFactor(3))
    study, err := goptuna.CreateStudy(
        "gorgonia-iris",
        goptuna.StudyOptionStorage(storage),
        goptuna.StudyOptionSampler(tpe.NewSampler()),
        goptuna.StudyOptionPruner(pruner),
        goptuna.StudyOptionDirection(goptuna.StudyDirectionMaximize),
    )
    err = study.Optimize(objective, 200)
    :
}

func objective(trial goptuna.Trial) (float64, error) {
    :
     for i := 1; i <= 10000; i++ {
        if err = solver.Step(model); err != nil {
            return 0, err
        }
        acc = accuracy(predicted.Data().([]float64), Y.Value().Data().([]float64))

        # Report an intermediate value and check whether this trial should be pruned or not.
        if err := trial.ShouldPrune(i, acc); err != nil {
            return 0, err  # goptuna.ErrTrialPruned may be returned
        }
    }
    return ...
}

More detailed information about our modified ASHA algorithm is described at the https://arxiv.org/abs/1907.10902

andreyvelich commented 1 month ago

This is great, thanks for sharing @c-bata! How can we implement trial.ShouldPrune API if we don't report intermediate Trial value to the Suggestion service ?

forsaken628 commented 1 month ago

The EarlyStopping service provides ShouldPrune, and the metric collector checks for the need to terminate a trial by calling EarlyStopping.ShouldPrune. WDYS?

andreyvelich commented 1 month ago

What parameters do you need to provide for ShouldPrune API ? How you can synchronize data between algorithm service and early stopping service ?

Do we need to implement the EarlyStopping service methods in the ASHA algorithm service ?

kubeflow / katib

Support Asynchronous Successive Halving Algorithm (ASHA) in Katib #1582