ConfigSpace Technology Integration for Enhanced GAMA Configuration and Management 🥇

Hi @PGijsbers

I am excited to announce that I have made progress on the #208 feature proposal 🎉 The custom configuration space setup improvement, i.e., by utilising ConfigSpace.

I am happy to say that GAMA has been successfully updated to utilise ConfigSpace's technology. While most of the discussions can be found in issue #208, I will list some points that need your attention below ⬇️

In the meantime, it took me a considerable amount of time to create this pull request as I was not familiar with the complete internal system architecture beforehand. Please bear with me for any modifications, but I believe it holds promise. I am open to discussing any aspects, including the reasons and methods behind the changes, as well as any further adjustments that may be necessary 👌

It is important to consider the consequences of introducing a new technology. Therefore, I would like to bring your attention to a few key changes and requests, as follow:

The configuration files, including the classification and regression py files, have been modified to utilise a ConfigSpace object. This object is responsible for accessing the appropriate ConfigSpace configuration for classifiers and preprocessors in the case of classification, or regressors and preprocessors in the case of regression. I have made the decision to divide them (estimators/preprocessors) for the purpose of enhancing comprehensibility and manageability. This is particularly important now that it has become easier to remove the preprocessors of a specific configuration (classification || regression). However, I am in full agreement that we currently possess more comprehensive files. However, the comprehensibility of the content and the quality of the documentation make it effortless for users to grasp concepts such as adding, modifying, and more. I hope you will be in agreement with this, happy to discuss worthiness of this addition 🫡. Next, I would like to bring your attention to the following: Considering that ConfigSpace excels at handling constraints such as forbidden clauses, it offers significant advantages for GAMA in the short term (e.g. handling #188 again). However, I am uncertain about the correctness of the previous custom configuration parameter checks. Therefore, I would like to draw your attention in this direction, hence, letting me know about any additional enhancements to specific algorithms that require constrained hyperparameter values, such as LinearSVC to start with. @HadiKutabi feel free to give us recomendation as to what should be constrained. A starting point could be there.
During the process of adapting the internal system, I removed a selection of components while incorporating others to align with the ConfigSpace technology. What frightened me the most is the evolutionary programming aspect. Henceforth, I kindly request your attention regarding the following matter: please ensure that the mutations align with your intended design while executing them on EA. Even if I did include numerous print statements throughout the codebase to ensure that each mutation was doing what it was supposed to do, as well as that the testing worked. This in-depth examination of you will allow us to confirm that the mutations are working as intended. I would like to avoid any potential misunderstanding from my part, here.
Lastly, I wish to discuss the subject of testing. Once more, I implemented modifications and verified that every test executed smoothly without any issues. Nevertheless, I wish to direct your focus towards the test configuration content itself. In order to facilitate testing, a simpler search space was demonstrated. However, in comparison to the initial classification or regression search spaces, the sole noteworthy modification pertains to the number of estimators utilised e.g, by ensemble-based classifiers such as Random Forest. It has been mentioned that the configuration space has been "simplified" for testing purposes in order to increase its lightweightness. Nonetheless, its dimensions appear to be quite substantial, or at the very least, similar to the originals (classification/regression). Would not it be better to decrease the quantity of classifiers and preprocessors to, say, three for each? I would greatly appreciate any insights in this direction.

What contributions have been made

✅ Improving GAMA's system to support ConfigSpace search space instead of the existing custom one
✅ Classification/Regression/TestConfiguration successfully converted with ConfigSpace
✅ Testing passes smoothly
✅ Informative commits
✅ Contribution to the documentation of the new configuration space, ConfigSpace: (How to modify, add, etc)

I would greatly appreciate knowing when you would have the time to address this, I must say, highly significant pull request, particularly considering the recent contribution of Auto-Sklearn in creating a "toolkit" for developing an Auto-ML system. While others are also trying, I genuinely believe that GAMA could greatly benefit from this pull request, both to keep being in the "competition" but also for the community that already is using it ☀️

For those who are new to this PR and are not familiar with any other discussion (#191, #198, #201, #202) that we had via the the Github issues section of GAMA with @PGijsbers. Furthermore, among the points discussed in #208, this PR's enhancement will also result in the seamless integration of SMAC for Bayesian optimisation (as discussed further in #202), which will introduce an additional search algorithm to GAMA and greatly increase its flexibility, ideally, increase its community's involvement 💪

Cheers,

Thanks for the PR. I will have a closer look later, but just had a superficial glance now. I can not give you a timeline on the review, unfortunately. I have a hard deadline on some changes for an in-person workshop in January, and I am also being transferred to a new project. This PR is large and affects every part of GAMA, so it's something I really want to take time for, which unfortunately means it's likely I won't get to it until later in January.

It is all fine, @PGijsbers! I will wait for the time needed, continue with my Ph.D implementations on the side, and believe that this current PR is great and promising; I am sure we will reach an agreement and minor-quality (!= minor-quantity, possibly major-quantity though) corrections will be requested, so I will keep doing so on my end 👌

Meanwhile, I wish you the best of luck with both the workshop and the new project! I hope GAMA, on the other hand, does not become "deprecated" in your lab! There are great things that could come out of it, and I will recommend GAMA for any new AutoML project around my faculty, for sure !

It would be greatly appreciated if you could also label the PR here so that no one gets confused! (By the way, I will make one more commit so the tests can run!) 💪

Wishing you a lovely winter time, Cheers,

Meanwhile, I wish you the best of luck with both the workshop and the new project! I hope GAMA, on the other hand, does not become "deprecated" in your lab! There are great things that could come out of it, and I will recommend GAMA for any new AutoML project around my faculty, for sure !

Thank you! For full transparency, I would like to share the following with you. Recently, the AutoML Toolkit has come to our attention. Perhaps you are already familiar with it, but if not it's worth a closer look. Its aims align with GAMA's in many ways, so we are planning to look into whether it makes sense to collaborate with the project.

It is even possible that, if there is enough overlap, we might prefer to "absorb" GAMA into AMLTK. For example, by re-implementing GAMA with AMLTK components, extending AMLTK as necessary to do so. The benefits are that AMLTK currently is under active development and has an outlook for continued support. This significantly reduces the maintenance work for GAMA and it could automatically benefit from new research in the AMLTK ecosystem. By focusing community efforts behind one project, I think everyone would come out ahead, and it is more likely we can provide support for the (hopefully few) GAMA-specific components.

Of course, this all depends on a closer look at AMLTK - does it really achieve what it sets out to do? does it do that in a way that aligns with our goals? These are questions we need to answer before making any decision on this matter. And even if we would decide to go that route and you disagree with this decision, we could look into how to structure the future of GAMA. For example, perhaps you could maintain a fork which we could refer to, or we could consider adding you as a maintainer of this project and set up a second project for the "amltk-based GAMA".

I hope to provide more clarity on this in January. In the mean time, I would also really value your input in the matter. As someone now intimately familiar with GAMA, if you find the time to have a look at AMLTK I would love to hear your thoughts on the project - and what you think the right path forward is.

I also want to make clear that regardless of our decision in this matter, I am committed to continuing to provide support for the work that is already planned and discussed to the best of my ability.

Cheers

ps. Feel free to reach out over e-mail on this matter (you can find my e-mail on this page).

Sorry for the lengthy response ! I tried to be as thorough as possible in order to avoid being vague or confused! I do not believe that proper answers to any questions should be provided soon; we could continue in January! So here we go for my responses

Thank you! For full transparency, I would like to share the following with you. Recently, the AutoML Toolkit has come to our attention. Perhaps you are already familiar with it, but if not it's worth a closer look. Its aims align with GAMA's in many ways, so we are planning to look into whether it makes sense to collaborate with the project.

Thank you for being very transparent, @PGijsbers; I predicted that somehow before opening this PR, and it actually makes sense to me. Yes, I am aware of AMLTK; I wish I could have attended the autumn school last week or so to participate in the hands-on activities with them. Meanwhile, I remain convinced that GAMA has something to offer aside. The flexible way to provide your new system, while still being easy to understand overall, or at the very least, the learning curve not being too steep, allows us AutoML researchers to do stuff with it, plus, in my case, I need a flexible system because I am going to apply it not to conventional classification tasks but in slightly different ways, so I will need a fork and do my changes. AMLTK is a fantastic toolkit, I have no doubts about that, but while it appears to be an easy and flexible API, as well as being highly powerful, I am not sure if it is possible to supply your own variable work, if this makes any sense, s.t varying as in varying in comparison to standard machine learning tasks. The learning curve for their API may be steeper compared to GAMA if the existing system (AMLTK) does not allow the researcher to perform what he/she is searching for, while GAMA is still simple to understand if they wish to pursue this route.

However, for your lab, I believe it is a great significant step to look into anyway, which is a little sad because I would have hoped that more than one person would actively contribute to GAMA, but in the meantime, I do believe that both of these systems have something to offer everyone, and while AMLTK may be the first great system to look into for researchers, I also believe GAMA may be one for other, allowing the "easy" system to be modifiable by the researchers in case AMLTK would not have allowed them to do so.

Nonetheless, I have not yet tried AMLTK, so I could be completely wrong, and possibly GAMA should be deleted from GItHub, but for the time being I am going to stick with it because I have put too much effort into designing my system around it. See further below.

It is even possible that, if there is enough overlap, we might prefer to "absorb" GAMA into AMLTK. For example, by re-implementing GAMA with AMLTK components, extending AMLTK as necessary to do so. The benefits are that AMLTK currently is under active development and has an outlook for continued support. This significantly reduces the maintenance work for GAMA and it could automatically benefit from new research in the AMLTK ecosystem. By focusing community efforts behind one project, I think everyone would come out ahead, and it is more likely we can provide support for the (hopefully few) GAMA-specific components.

The benefits are that AMLTK currently is under active development and has an outlook for continued support. This significantly reduces the maintenance work for GAMA and it could automatically benefit from new research in the AMLTK ecosystem. – This is something I wholeheartedly agree with. I do not work with you on a daily basis, but I can feel that you appear to be quite busy these days, and you may not be able to manage everything at the same time, so perhaps incorporating AMLTK as per GAMA would be a fantastic investigation, help for your lab, and the community around OpenML, indeed! Allowing the community to profit from both. Meanwhile, this is quite blurry for the both of us, whether AMLTK and GAMA could be "joint" at this point; Therefore, keep me updated on how you believe you would handle it, or what "additional" features GAMA would come up with for AMLTK – And vice-versa. In such scenario, I will consider whether I could offer my assistance if it would benefit my Ph.D :)

Of course, this all depends on a closer look at AMLTK - does it really achieve what it sets out to do? does it do that in a way that aligns with our goals? These are questions we need to answer before making any decision on this matter. And even if we would decide to go that route and you disagree with this decision, we could look into how to structure the future of GAMA. For example, perhaps you could maintain a fork which we could refer to, or we could consider adding you as a maintainer of this project and set up a second project for the "amltk-based GAMA".

These are questions we need to answer before making any decision on this matter. And even if we would decide to go that route and you disagree with this decision – I just wanted to stress that I completely understand the move here and appreciate the honesty. Furthermore, I consider myself a "nobody" in this community thus far, therefore, I have no say in whether or not we should keep GAMA to be honest. Now, I would say I am content to be the possible old GAMA's maintainer indeed, cheers for the propal :) However, once you have determined how GAMA will benefit AMLTK and vice versa, we then should assess if maintaining GAMA will be a "benefit" or not, and whether being the principal contributor to the new project would not be more advantegeous while archiving the old GAMA. I hope that makes sense.

I hope to provide more clarity on this in January. In the mean time, I would also really value your input in the matter. As someone now intimately familiar with GAMA, if you find the time to have a look at AMLTK I would love to hear your thoughts on the project - and what you think the right path forward is.

Pieter, I appreciate much all this :) For the time being, I do not have the time to investigate AMLTK because I have my professor on my back to start experiments, and I have not finished implementing the system yet.. so I reckon I will have to do some try-of-AMLTK in the same time as you in January, I believe 👌 Nonetheless, I believe that opening a new Github Issue and tagging me in it when you begin the investigation would be worthwhile - I will give my feedbacks/visions/views, count on it! However, it may also encourage others to investigate.

So, for the time being, I will stick with GAMA's current stable version and do my PRs around it, all for the sake of GAMA and to support my Ph.D. If AMLTK-based GAMA proves to be worthwhile, count me in for next year! If I do not see, on the other hand, a benefit in it for my present endeavours, and that keeping alive the old GAMA for specific purposes, you can count on me to be maintaining it, but we will have to find a few justifications as to why keeping it would be valuable for newcomers compared to AMLTK-based GAMA, otherwise I will keep the fork local. I am confident that we will find answers to all of the current unanswered questions, we just need time ⏳

I also want to make clear that regardless of our decision in this matter, I am committed to continuing to provide support for the work that is already planned and discussed to the best of my ability.

Yes, indeed! ConfigSpace will assist GAMA in any form, old or new. Pieter, I am glad to hear all of it. I will send an email after finishing my present task of building my system around GAMA to discuss possible AMLTK-based GAMA with all of you, most probably not in 2023 though.. but as I mentioned before, starting an issue for that would be excellent! In any case, I will save the email because it is a nice one to have in my pocket for summer internships 🇳🇱

Wishing you a great Christmas time to you and the entire lab ! 🎄

Cheers

–––––––––

PS: I have no clues why the workflows are still erratic. Let us look into that after your review in January, shall we?

Thanks Simon, and happy holidays to you!

PS: I have no clues why the workflows are still erratic. Let us look into that after your review in January, shall we?

seems reasonable to me :)

Hi Simon,

I started reviewing the code. There are several suggested changes to simplify the code and make it more readable. I haven't dug into the changes for genetic programming yet.

Amazing ! cheers @PGijsbers 🎉 and Happy new year !

I will look it up later because I am about to have a system ready for my experiments, and then as soon as the experiments are started – I will cover all of this up. Feel free to ask me questions if they are urgent, otherwise wait for me to cover the reviews.

Cheers,

Thanks, Simon. Happy new year to you too. There's not going to be anything urgent here. Though I have a busy schedule until mid-February (part of which is vacation 🎉), so delays here might result in me not being able to process the whole backlog before then. On my end, that is not an issue.

Finished my first look, and my first impression is very positive :) while there are definitely some things that I'd like to see changed to improve readability and maintainability, overall it seems like it's a very focused change and only parts that I expected to be affected are - which is a good sign!

However, it's still a lot to grok at once, especially since I hadn't used config space before. I want to do another round after the changes are processed, because I hope that also helps me better understand the changes to genetic programming classes and operators.

Finished my first look, and my first impression is very positive :) while there are definitely some things that I'd like to see changed to improve readability and maintainability, overall it seems like it's a very focused change and only parts that I expected to be affected are - which is a good sign!

However, it's still a lot to grok at once, especially since I hadn't used config space before. I want to do another round after the changes are processed, because I hope that also helps me better understand the changes to genetic programming classes and operators.

Wonderful, @PGijsbers! I appreciate your time 🙏 I will definitely answer everything thoroughly and make the necessary changes in consequences. I will put them in FIXUP mode before rebasing them back to their origins as soon as you accept them. Furthermore, I would like to back up your wish towards a few rounds of reviews for this PR so that you can better understand it overall (i.e., especially ConfigSpace), in order to help you out for the next few PRs!

I wish you a wonderful February holiday and hope I am able to finish by the end of it.

Thank you for the positive feedback:)

Cheers

I am not sure how github deals with fixup. Feel free to use it, but you can also just use regular commits as we use a squash and merge when merging (and so, the specific commit history in the feature branch doesn't matter much).

BRIEF UDPATE:

I intend to start working back on all this from the end of this month onwards. I have an important deadline for the 20th and had no time to look into those but will definitely afterward.

Cheers for your patience!

Simon

Don't worry about it! I am not really working on GAMA right now, just making sure the broken installs are fixed.

openml-labs / gama

ConfigSpace Technology Integration for Enhanced GAMA Configuration and Management 🥇 #210