[Implementation] Developing a Custom Auto-ML System Variant with GAMA: Questions & Concerns

simonprovost commented 1 year ago

Dear @PGijsbers and the rest of the authors,

I have finally completed some Ph.D.-related tasks (writing mini-thesis, building a Python library, etc.) and am now returning to finally more serious matters. The construction of an Auto-ML system variant based on your GAMA proposed framework. This (questions-based) issue is a result of the tremendous support I received in issue #191. Therefore, ensure that you relate to it in the event of confusion 🫡

I have investigated all of the routes you suggested, but prior to that, I have got to say that this flexible generic implementation of GAMA is pretty much a gold-mine, some lack of documentation and some highly pythonic types missingnesses are present, but the architecture employed, and all other processes are otherwise very elegant! This surely gives me more hope that I will be able to create a version of my Auto-ML variant based on GAMA. Anyway, as suggested I investigated how to configure a new search space, which makes perfect sense now. The empty list to refer to a shared parameter across all other hyperparameters with the same name in other algorithms is a fantastic feature, however! I also have investigated, about adding a new metric and reworking a pipeline's evaluation.

Following this, I spent a full day researching this topic, and while digging deeper, I also learned what an Individual is. I was initially very confused because it is located in the sub-section of GAMA devoted to ''genetic programming''; as a result, I assumed it was solely related to the evolutionary's algorithm. Nonetheless, it was a mistake on my part, as it is a highly adaptable, generic approach to dealing with individuals, applicable to any case circumstance in which individuals are required. This by using genetic programming operators/primitives, which is also very elegant. I also have discovered about the OperatorSet and other bunch of stuff. Following this, however, I investigated how the Automated Imbalance Machine Learning (paper) implemented their system and what modifications they made to GAMA to make it function as they aimed. Consequently, FYI I must acknowledge that I may undergo a comparable degree of change, thereby pursuing a comparable path. Nonetheless, a couple of concerns have been raised, and I will highlight them below:

I have developed three algorithms that Scikit-learn does not explicitly propose. In spite of the fact that I derived them from ClassifierMixin to adhere to the Scikit-learn API, they nonetheless use a custom Classifier Mixin that I implemented to perform additional checks on our goal. However, this custom classifier mixin ultimately inherits from ClassifierMixin. Consequently, I believe that this will still be detected as a classifier mixin, correct? I believe that I am discussing about what's here (same applies for TransformerMixin though).
I also have implemented a custom scikit-learn pipeline that inherits from the original, but performs an additional task internally for each step's node. Not a very big-deal, a few lines of codes, but necessary in some manner. Consequently, I have observed that the Automated Imbalanced Machine Learning branch has replaced Scikit-pipeline with the Imblearn-based pipeline. As long as it functions similarly to the Scikit-learn pipeline (i.e, inherit from its base class), there should be no cause for concern, correct?
In #191, I mention that some modifications to the search space may occur if a particular algorithm is selected. This is actually a phase from the individual's generation process actually and not during fitting, my mistake for earlier. In our design, there are a minimum of two steps in a pipeline. The first one actually opens some doors while closing others. I would therefore like to implement that using the design you have created based on a random creation of individual. Consequently, if I wish to customise the configuration of an individual, I believe I need only create additional functions in that script and invoke them when creating the OperatorSet(). Do you firstly concur? If so, given that this is in the GAMA base class, to avoid interfering with the other use-case scenario considering it is the base class, I deemed it worthwhile to create a new GamaClassifier-LIKE similar to what the Clustering branch did , yet for my purpose, thus, tricking the creation of the operator's set in the initialisation of this class with my new function to create an individual, would you agree?
Considering that we can now create an individual at our convenience and that the pipeline used in the framework is tailored to our needs (i.e, our custom scikit-pipeline). If I wanted to add the extra-needed parameter to the initialisation of this pipeline's build, would you alter that somewhere, such as here or here as well ? or somewhere-else too? In addition, where I mentioned the necessary adjustments, I will add the extra-parameter in a recursive manner so that it can be accessed at that specific ''here'' locations. Note that I did not specify when the export of a champion model to code occurs, as I plan to deal that subject in a later time.
Finally, in our use case, basic encoding is not a priority because we already deal with data that are clean enough for computation. Is there a way to disable the basic_encoding step, or it is a top necessary to execute, if so why would that? If no ways are design, I will make sure to enhance it in that direction, if you concur it is a good idea?

In conclusion, I'd like to ask if there's anything else I should be aware of, or any other area you'd suggest I explore?

Please excuse the comprehensive issue. I felt it essential to provide a wealth of detail to ensure that your response can be as informed and directed as possible, thereby eliminating potential back-and-forth questioning if that makes sense 🧐. With an impending Ph.D. deadline, it also is of paramount importance that I secure a viable proof of concept for my system in the earliest timeframe so I asked as much as I can to have all the tools in hand to make that a done-task.

Lastly, I would like to mention that your support will not solely assist me in developing a variant, plus paper(s) with GAMA's citation, but I also intend to write a brief article (probably Medium-based) to guide any newcomers in using GAMA to develop a GAMA variant Auto-ML system. I believe the advanced section of the documentation is excellent and sufficient for some use cases, but in the implementation of Imbalanced Auto-ML or my current use case, things are a bit more complex and I believe an article would be helpful (unless you object). During the summer, if everything goes as smoothly as I can anticipate, I will start working on that and potentially some PRs suggestion I will see all along 💪

Wishing you a lovely weekend in Netherlands! Feel free to ask me any single question I'll respond as promptly as possible ! Best wishes,

PGijsbers commented 1 year ago

correct? I believe that I am discussing about what's here (same applies for TransformerMixin though).

Yes, that should work. As you might have noticed, it's a bit finicky if the algorithm is both Transformer and Classifier, in which case it only gets picked up as the first one. Should be easy to adapt the parser to handle this case, but it's something that did not come up (yet).

there should be no cause for concern, correct?

I can't readily think of a reason it shouldn't work. Perhaps just give it a go :)

I believe I need only create additional functions in that script and invoke them when creating the OperatorSet().

I think that's a good starting point.

tricking the creation of the operator's set in the initialisation of this class with my new function to create an individual, would you agree?

Yes, it probably makes sense to write a subclass that overwrites only those things that are different and go from there. After it's clear what the changes are, it's easier to review if it's reasonable to refactor out inheritance and handle it by other means.

If I wanted to add the extra-needed parameter...

I think you found the relevant places considering you are not looking at exporting the models (right now).

Is there a way to disable the basic_encoding step, or it is a top necessary to execute, if so why would that?

Not right now, but there isn't a real reason it's required either (as long as the generated pipelines work on the data, everything should be OK). Refactoring it so it is an optional steps is OK with me (in fact, we started redesigning the way GAMA handles input data a while ago (https://github.com/openml-labs/gama/pull/169), but I don't have the time to finish that right now).

An article is always welcome :) but if things are missing from the documentation itself, adding it there too is also much appreciated 🙏 Good luck!

simonprovost commented 1 year ago

@PGijsbers Many thanks for validating this strategy! I greatly appreciate it, and everything has been noted, including the inclusion of more documentation/article (which will most likely occur during the summer). Feel free to close this issue so you do not have too many on the flow, or if you want me to express any issue I encounter using the above-described approach in this thread, then keep it open; otherwise, I will open a new issue (while attempting to be more succinct now that the design's path has been validated on your end, but I still need to try it out to see what is coming).

Have a wonderful week 👌

PGijsbers commented 1 year ago

I'm closing this issue, but you can open it again if you have follow up questions. Have a great week yourself :)

openml-labs / gama

[Implementation] Developing a Custom Auto-ML System Variant with GAMA: Questions & Concerns #198