openml-labs / gama

An automated machine learning tool aimed to facilitate AutoML research.
https://openml-labs.github.io/gama/master/
Apache License 2.0
92 stars 29 forks source link

[Feature Proposal] Bayesian Optimisation Implementation #202

Closed simonprovost closed 12 months ago

simonprovost commented 1 year ago

Hi @PGijsbers! Hope you are doing okay!

In reference to #201, I would like to initiate this new Github issue with a more concrete proposal for a new search algorithm, namely Bayesian Optimisation, that must adheres to the GAMA design philosophy.

Prior to delving into the proposal in detail, and in accordance with Issue #201, I would like to point out that I tested and researched SMAC. Initially, we resolved some issues with the Authors through a pull request. Now, SMAC with a CASH problem is operating flawlessly. The final step in the interim is to implement this out-of-the-box framework via GAMA.

Therefore, to address the issue, on top-of-all I would like to propose a solution that not only integrates SMAC in a way that adheres to the standard basic guidelines outlined in the SMAC's README but also blends it harmoniously with GAMA's framework, ensuring we uphold GAMA's multiprocessing procedures. This approach is chosen for three critical reasons. First, multiprocessing, as seen in GAMA's async evaluator, stands as a pivotal aspect of GAMA's process. Second, preserving this functionality ensures that any future enhancements to GAMA would only necessitate minimal alterations to any currently and soon implemented search algorithms. Third, the optimize call, as detailed in the SMAC's README, executes its operation internally, leaving us without direct control. By opting for one of SMAC's flexible interfaces (proposed below), we can effectively manage the task without spawning multiple SMAC instances or inhibiting our implementation from being multiprocessing.

Consequently, retrieve below a preview, step-by-step implementation to ensure we are in compliance at the very least with what's below:

  1. BO File Creation: This entails the very-basic creation of a new BO file similar to the pattern content in the GAMA's Random Search algorithm.

  2. Dynamic Defaults Modification: The objective here is that through the Dynamic Default method, we translate/convert GAMA's search space into SMAC's automatically. Converting the search space from GAMA to SMAC should not present many difficulties. Nonetheless, we anticipate potential complexities when handling GAMA's search space param check component, also known as conditions in (SMAC) ConfigSpace. Therefore, this translation due to varying conditions in ConfigSpace and GAMA's param verification may present an obstacle. In order to provide flexibility, rather than overcomplicating the transformation's process, I propose enabling users to alter the search space via a parameter during the initialisation of the BO search algorithm. I.e, If there are no conditional relationships between hyperparameters and the user has not overridden the ConfigSpace search space, a translation from GAMA to SMAC (ConfigSpace) would be performed automatically. Nevertheless, if the user desires a more complex search space (e.g, conditions between hyperparameters), an override of this one could be transmitted to the BO's search algorithm instantiation, but will be kept internally for SMAC's framework to continue functioning flawlessly; GAMA has no need for the ConfigSpace. As a result, this strategy creates a balance between user control and algorithmic effectiveness.

  3. Search Process Decomposition: The search process implementation will involve several steps:

_Note that here will consider utilising the Ask&Tell SMAC interface, which allows us to request a candidate setting based on an internal Bayesian optimisation process, and then evaluate ourselves the candidate in order to communicate the results back to the SMAC created instance. Further example of this interface will be available at the bottom of this post._

  1. Instantiate Necessary Objects: The first step involves easily instantiating all the objects needed for SMAC's interface to function correctly.

  2. Loop with SMAC.ask() Call: We enter a loop, while (max_evaluations is None) or (len(output) < max_evaluations):, similar to what RandomSearch or ASHA of GAMA are doing. Within this loop, we first put all the future = operations.wait_next(async_) # and IF conditions to append the result in case future is none calls, but most importantly we run the SMAC.ask(), which provides the next candidate setting for evaluation.

  3. Bayesian Optimisation Evaluation Function: Following this, we need to invoke an async_.submit() procedure such as what RandomSearch is doing. As first arg, will be put an internal function to the BO's search algorithm, named bayesian_optimisation_evaluate using same parameter design as how evaluate_pipeline is (here and here). Our internal function should performs three core tasks:

    • Primarily, construct a pipeline from the candidate setting that SMAC.ask() provided.
    • Secondly, call GAMA's operations.evaluate (with the acquired parameters from the bayesian_optimisation_evaluate) with the just-created individual pipeline, and let GAMA do the evaluation and all the stuff it internally proceeds with.
    • Lastly, we acquire the results from the operations.evaluate(potentially those specifics will be helpful), and finally updates SMAC's instantiation using SMAC.tell(). This ensures that for future SMAC.ask() calls, the algorithm proposes the most efficient individual based on Bayesian optimisation principles.

On the other hand, on second arg of the async_.submit(), instead of operations.individual() that for exemple RandomSearch() does, we simply put the acquired Configuration from SMAC.ask() for bayesian_optimisation_evaluate to do its dedicated job as aforementioned.

As a result, the new implementation will adhere to GAMA's design, utilise SMAC's Ask & Tell interface, and if more than one core is available and specified by the user, the feature will execute in a single SMAC's instance but will be multi-processed thanks to GAMA's dedicated component, so the SMAC.ask() will be called as multiple as the GAMA allows and the SMAC.tell() will updated as much as GAMA will ask it to do. Therefore, the BO's search algorithm will be multi process but by GAMA async evaluator object and not SMAC itself, plus will save all individuals and results for GAMA post-processing execution without much difference than what RandomSearch() or ASHA() do.

While this serves as an initial proof-of-concept implementation, I thought about it a lot before writing all that, but note that more will be refined over the summer to incorporate meta-learning into SMAC at the initialisation and hopefully, a pull request will be submitted for review by mid/end of the summer. I.e, I first need to make something work with my design then I will take time to generalise it well for GAMA's community.

Request for Feedback

I would immensely value feedback and direction regarding this implementation workplan. Ie., Is there anything am I missing in regards to GAMA's internal process? any possible steps I should review further? I am not attempting to get you to adhere to the SMAC's API process because you do not know about it. Rather, I am focusing on my approach to implementing this and whether or not we at least addressed all of the GAMA's points.

Additional References

To assist in understanding the concepts and frameworks referenced, here are some helpful resources:

Thank you so much for your time. I know I am asking a lot of the GAMA team these days, but to make sure we are all on the same page, I am trying to do as well as you did for the GAMA community so that in the future the community can have a plethora of options that are all designed in the same way 💪

Wishing you a wonderful Sunday evening! Best wishes,

simonprovost commented 1 year ago

EDIT 1.0: I will provide a more detailed update later; for now, it is time to take the win for today. I am not entirely certain that everything is functioning properly, but I am currently, at the time of writing this down, able to perform an 8-core multi-processed SMAC BO run with the GAMA async component, and it appears to function quite well.

There is a lot much more work to be done, and on various aspects, but I am now confident that a POC is feasible; within a week, I will be able to elaborate whether I have a fully workable POC. Therefore, if this POC seems to be promising, over the summer we would have a fully generalised BO's search algorithm for the GAMA's community.

In the interim, it would be preferable to await , if you planned answering, the more in-depth update later in the day-or-so; In a nutshell, I have taken a slightly different route from time to time than the one(s) proposed above, but the overall plan remains kind of the same anyway 💪

Cheers,

PGijsbers commented 1 year ago

Just dropping a message to let you know I am currently working night and day on the automl benchmark because we've got a deadline coming up. It might take a little longer than usual for me to respond, but I think I should be able to do so sometime next week at the latest.

simonprovost commented 1 year ago

Just dropping a message to let you know I am currently working night and day on the automl benchmark because we've got a deadline coming up. It might take a little longer than usual for me to respond, but I think I should be able to do so sometime next week at the latest.

Indeed ! Thank you for informing me and I wish you and your team the best of luck with this deadline. I will keep this thread as current as feasible till then so that you have all the tools-in-hands to (dis-)approve / guide.

Cheers Pieter 👌

PGijsbers commented 1 year ago

Would also appreciate it if you remove/mark text which is no longer relevant (when things change) 👍

simonprovost commented 1 year ago

As it has been a couple of days since I first released this proposal, and due to the fact that as you may be aware, I worked things on and updated the mentioned path proposed a little, I advise against responding to this issue until I have provided an update. I'd recommend you guys wait for my go-update signal so I do not squander your valuable time. If we could have paused a Github issue, I would have done so.

Cheers for your understanding,

simonprovost commented 1 year ago

Re @PGijsbers,

It is now my turn to meet a deadline for my first year of Ph.D! The upcoming days, if not weeks, will be quite hectic for me. Should we 'close' this issue for the time being so that none of your Kanban issues are affected and I reopen everything as soon as I can?

In the meantime, I have a workable BO search optimiser operating under GAMA's design, so things are looking positive. Nonetheless, I will pose a few queries/dilemmas as soon as possible.

Cheers.

PGijsbers commented 1 year ago

No, it's fine to leave it open. I'll close it when I answer it (early next week). When you get back to it, feel free to re-open if you have follow-up questions.

PGijsbers commented 1 year ago

Sorry for the delayed reply. Your proposal seems very reasonable.

  1. I do not object to the search algorithm itself choosing a different method of parallelization (e.g., using SMAC's). This can be advantageous if the underlying software can make better use of the parallelization (there are some BO methods designed with parallelization in mind, I don't know whether SMAC supports those). The important thing is that the provided constraints are observed. The default AsyncEvaluator is mainly there to provide an easy out-of-the-box solution.

  2. Param checks are temporarily disabled and I am not 100% sure if I will turn them on again. The current implementation adds complexity with very little benefit: the failures it prevents are failures that are super quick. Feel free to aspect that part of the search space definition, SMAC/BO should pick up on the failed configurations anyway.

  3. I have been wanting to have a ConfigSpace definition of the search space instead of my custom format. I would be very happy with a PR which converts the current search spaces to ConfigSpace ones so long as the current algorithms remain functional (probably mostly requires updates to how individuals are generated). You then wouldn't have to do the conversion specific for BO. Should you pursue this route, please set up a PR independently of any changes for BO. i.e., set up a PR that only changes the search space definitions/parsing/individual generation. The changes are probably substantial enough on their own, which makes reviewing hard enough already without adding more features :)

    Cheers

simonprovost commented 12 months ago

Hello @PGijsbers,

I hope this finds you well. I hope the paper submission with your team was successful.

  1. SMAC Parallelisation and GAMA Integration

I value your opinion regarding the possible benefits of utilising the parallelisation method from whichever Bayesian Optimisation (BO) library we choose. However, my current stance on this topic has changed slightly.

As you correctly noted, the default AsyncEvaluator provides a straightforward, out-of-the-box solution. I am an ardent supporter of this strategy. I believe this component is essential for fusing GAMA with any future implementation of a new search algorithm, e.g., in this current thread's case, BO.

Considering the long-term application and potential future enhancements of the AsyncEvaluator led to this conclusion. Any such enhancements would be advantageous for the implementation of (BO's)/the new search algorithm. Alternatively, if BO's/the new internal framework's multi-threading component were to be removed or malfunction, we would be forced to manually fork and modify the code or, in the worst case scenario, wait for the authors to fix the issue - which is far from ideal.

In contrast, with GAMA's multi-threading component (i.e., AsyncEvaluator), we have a clear understanding of where to investigate or place debugging statements if issues arise, while simultaneously reaping the benefits of any enhancements in the future.

In conclusion, I want to reassure you that my proof-of-concept (POC) implementation of this strategy, i.e. using SMAC's Ask&Tell Interface in conjunction with AsyncEvaluator, is operational and appears to be functioning flawlessly (both on OSX and Linux at the very least).

  1. Param Check Implementation Complexity

I concur with your assessment of the current implementation's increased complexity and negligible benefits. The ability of the system to prevent typically rapid failures does not appear to outweigh the expense of maintaining such complexity. Regarding the aspect of the param check, I will share my thoughts in the following point (transition to configspace).

  1. Transition to ConfigSpace

I strongly support your proposal to transition the search space from a custom format (current) to a ConfigSpace definition. I am happy to undertake this PR. Despite the fact that the PR will necessitate a few modifications to how individuals are generated, the overall process I reckon should remain largely unchanged, with the exception of the use of ConfigSpace instead of your current custom format.

Therefore, as you pointed out, this modification will allow me to eliminate the custom transformation from the BO's search algorithm, greatly simplifying matters.

In conclusion, I also concur that these changes should be split into multiple PRs for you and your team to handle them in a simpler manner. The initial pull request will concentrate on migrating to ConfigSpace and removing the param check as ConfigSpace would allow so. The second PR will implement the BO's search algorithm based on SMAC.

Could you double confirm this plan?


To conclude this lengthy response, I have a working POC of BO with SMAC integrated within GAMA and GAMA's multi-processing component as of today, also happily allowing me to complete my Ph.D. first year a few days ago. In addition, my supervisor and I have agreed that during the summer I will concentrate on improving the Python aspect of my work (i.e., among a tonnes of things, submitting the aforementioned two pull requests) and return to further research-oriented tasks in the fall.

Keeping this in mind, let us consider this issue to be a completed feature proposal. We appear to have reached a consensus on the underlying logic, so the next step will be to examine the practical implications, which I hope to begin soon 🙏

Thank you for your support with this Pieter ! I would appreciate it if you could confirm everything twice to make sure we are in-line 🫡

Wishing you a great week!!!

PGijsbers commented 12 months ago

The initial pull request will concentrate on migrating to ConfigSpace and removing the param check as ConfigSpace would allow so. The second PR will implement the BO's search algorithm based on SMAC.

Sounds good to me! We may need to have a look at how to exactly integrate the BO PR, I think something like optional dependencies might probably be best since I understand SMAC is not always easy to install across all systems (mostly referring to Windows here). That way users can make of BO search if they have the requisite dependencies installed, but otherwise can still use GAMA without. Otherwise I think we should be good to go.

during the summer I will...

Just a heads up that I will be out of office from the end of July through the end of August. I may or may not choose to work on (some of) my open source projects, but I will not make any commitment for that time period. Otherwise I will be happy to have a look once I get back in office.

PGijsbers commented 12 months ago

Oh and as far as PRs go, you could even split removing the "check_params" logic into a separate PR. Smaller is better!

simonprovost commented 12 months ago

Sounds amazing! Have a wonderful annual leave, Pieter, and if not before, I will see you in September 👌

I will manage the PRs properly, do not worry :)

Cheers!