Support for PBT (Population Based Training)

romeokienzler commented 4 years ago

/kind feature

Support for PBT (Population Based Training) Ray tune recently came out with support for PBT, Deepmind showed exceptional performance. Are we considering to support PBT in Katib as well?

https://arxiv.org/abs/1711.09846 https://deepmind.com/blog/article/population-based-training-neural-networks

andreyvelich commented 4 years ago

Thank you for this information @romeokienzler, I think it's very exciting.

We definitely should investigate it and try to adopt it in Katib. What do you think @gaocegege @johnugeorge ?

gaocegege commented 3 years ago

My colleague @ezioliao is working on our internal AutoML system and he is interested in this issue.

ezioliao commented 3 years ago

My colleague @ezioliao is working on our internal AutoML system and he is interested in this issue.

With Pleasure. PBT is implemented in our own AutoML system and I'm willing to solve it.

andreyvelich commented 3 years ago

That would be great! Thank you @ezioliao, let us know if you need any help.

andreyvelich commented 3 years ago

@ezioliao Do you have time resources to work on this in 2021 ? We can include PBT support in the 2021 Roadmap.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 3 years ago

/lifecycle frozen

andreyvelich commented 3 years ago

/help

google-oss-robot commented 3 years ago

@andreyvelich: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubeflow/katib/issues/1382): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

johnugeorge commented 3 years ago

/cc @richardsliu

SunnyGhj commented 2 years ago

@andreyvelich Hello, I am very interested in this proposal. May I ask when it will be launched and how can I participate

romeokienzler commented 2 years ago

@ezioliao

johnugeorge commented 2 years ago

@hongjunGu2019 Please see #1833

SunnyGhj commented 2 years ago

@hongjunGu2019 Please see #1833

Thanks

david-thrower commented 2 years ago

One enhancement to this which, I propose on this is an addition of a parameter to this, which a user can use to configure the training task to filter out invalid permutations of parameter values from each generation's trials, given rules like minimum_skip_connection_depth must be < maximum_skip_connection_depth , so as to filter out planned trials that would throw ValueErrors before they are executed. This would make more elaborate algorithms practical to train with the training op.

johnugeorge commented 2 years ago

@david-thrower We have an interface to add validation checks per suggestion algorithm. Ref: https://github.com/kubeflow/katib/pull/1924

david-thrower commented 2 years ago

@johnugeorge , I really appreciate you pointing that out, especiallyat this time of night in your time zone. If I am ever in town, I owe you a coffee drink. Sorry I missed that recent change. I'm glad to see this. I may get more sleep this week than I thought. I will look into it. To clarify, this removes suggestions, not throws exceptions, right?

johnugeorge commented 2 years ago

ValidateAlgorithmSettings function is a common interface which can be implemented for any suggestion algorithm. If ValidateAlgorithmSettings fails, Katib controller changes Suggestionand Experiment status to failed.

https://github.com/kubeflow/katib/issues/1126

david-thrower commented 2 years ago

@johnugeorge I see. This is where the problem I am faced with lies. When I have a massive parameter space where as many as 1/4 of the permutations of values are invalid and I may be running 1000+ trials on a a distributed cluster with IPU, TPU, or A100s, and we will get an error status from of 250 of the 1000 trials, this requires me to set an very large maxFailedTrialCount to escape the run failing out. If I do that, then I have to ignore any other unforeseen errors that may arise (that should prompt me to abort the run and de-provision the cluster until I have the issue debugged...). What I had in mind was more of a pre-screening and preclusion from these forseen invalid suggestions being included in the oracles / metadata to begin with or being assigned a separate status that isn't seen by Katib / Training as an error, that way the trial does not count towards maxFailedTrialCount, as this is a foreseen invalid trial, which these are inherently going to clutter the suggestions created by any algorithm having a permutation / mutation / random selection step. I hope my clarification makes sense.

johnugeorge commented 2 years ago

I see. But the what is the real reason for this issue? Is it that the user configures invalid parameter range for an algorithm?

david-thrower commented 2 years ago

Example:

minimum_skip_connections: [1:10] maximum_skip_conections: [1:10]

There are both valid and invalid permutations of these in this range (any permutation where minimum... is < maximum ... is valid, and not so, is invalid). Any single range I could set to preclude invalid permutations: (e.g. setting minimum to [1:5] and maximum to [6:10]), this would eliminate many if not most valid permutations. (e.g. minimum = 7, maximum = 9 in this case ..). The NAS I am developing has many min max pairs in its prams, and each trial intrinsically needs them because of an ensemble - like setup that needs both the minimum and maximum to deduce the optimal solution(s) in this range of many ranges to be separately studied. There may be an exponential number of model architecture parameters + traditional hyperparameter permutations to sample from with the same narrow param range, so there's no practical way to make the trainer take a single straight number in lieu of the range) ... The only other workaround is to (within the train function), make the valid permutations of the pairs a list and do this: {

# list[i][0] is minimum
# ist[i][1] is maximum

options = np.random.randint((100,2))
options = options[np.less(options[:,0],options[:,1]),:]

# options now looks like:
# [[1,2],
# [1,3] ...,
# [9,10]]

i = hp.Choice("min_and_max_skip_connections", options)

min_skip_conn = options[i][0]
max_skip_conn = options[i][1] 

# ... on to the code that parses the model from these params and numerous others ....

}

The problem with the approach above (using the engineered fused parameter "min_max_skip_connections") is that a [Bayesian | Hyperband | Genetic] algo can't extrapolate as strong of a mathematical meaning to what values are likely optimal for the 2 individual parameters and predict where to best sample from moving forward, as it would if it were sampling the 2 individual parameters separately (with the invalid options dropped and not influencing it). A human can't easily either. If I look at a rectangular coordinates plot of the engineered single fused parameter, I can't draw a meaningful pattern of what values for these 2 individual parameters are optimal and deduce a range without looking back and forth between a printout of the list and what index number is saturated on the rectangular coordinates plot.

One unrealistic workaround would be to just hard code the trainer function to return infiniti for the loss for invalid trials, but this would create a huge problem for all strategies other than grid and random search: If the optimal solution is (max_skip_connections = 9, and min_skip_connections = 7), if its first iteration / generation tried (max_skip_connections = 7, and min_skip_connections = 5) any "smart" algo like population based, hyperband, etc would be led to drop the nearby space, such as (max_skip_connections = 7, and min_skip_connections = 9), unless it did an inordinately large number of random trials in the first iteration where several nearby trials weighed against the failed possibility ... Which defeats the purpose of this step of the study, as the purpose is to quickly find promising ranges to be investigated in more detail / eliminate dead ones, without great computational expense...

I hope the need for this is making sense now. It is a very complex issue, taking a verbose explanation to articulate, but nonetheless, a real problem that is encumbering a lot of experiments, which I have seen similar issues raised on the repos of various tuners ... It is motivating me to write my own tuner, which I hope to not need to do, but have accepted that this may be the de facto option, unless I just fall back on setting the maxFailedTrialCount to a large number ... ).

johnugeorge commented 2 years ago

Yes. it makes sense. The only option that I can think of, is to add these checks within the algorithm and skip if parameters are invalid. But as you said, it should not be considered as failed. So, we will need a new status return to indicate that trial was skipped.

We will take this up in the next WG meeting.

david-thrower commented 2 years ago

@johnugeorge, I appreciate the attention to the issue. This is no doubt, a weird issue, but nonetheless, I can foresee this becoming a more common requirement as more complex algorithms become "production material".

andreyvelich commented 1 year ago

The initial version of PBT was implemented by @a9p! You can find the docs here: https://www.kubeflow.org/docs/components/katib/experiment/#population-based-training-pbt. Thank you @a9p for the great contribution!

kubeflow / katib

Support for PBT (Population Based Training) #1382