openml-labs / gama

An automated machine learning tool aimed to facilitate AutoML research.
https://openml-labs.github.io/gama/master/
Apache License 2.0
93 stars 30 forks source link

[FEATURE PROPOSAL] Integration of ConfigSpace for Search Space Creation in GAMA: Challenges and Proposals #208

Closed simonprovost closed 12 months ago

simonprovost commented 1 year ago

Hi @PGijsbers,

I hope you are doing well. Since our last conversation regarding the potential feature implementation of SMAC3 and other potential features for GAMA, I have been able to spend a quick week working on a variety of tasks around that. Less essential, I also attended an AI Summer School in Slovenia last month, where I met Phrabant from your team - it is funny how small the world is 😊

Concerning this GitHub issue. This proposal intend to discuss about the replacement of the construction of search spaces with ConfigSpace instead of the current GAMA custom's one. We briefly agreed that ConfigSpace's flexibility, particularly with regard to conditional parameters and configuration sampling, is its greatest asset. Given that the GAMA's creation feature of custom search spaces was still in its infancy, we believed that GAMA stood to benefit from this development.

Therefore, we discussed revising GAMA's behaviours with genetic programming to support ConfigSpace sample generation. The objective is to sample a new configuration via ConfigSpace and construct a GAMA individual based on the ConfigSpace sample. As a result, the design structure of GAMA would be preserved without significant modifications. Currently, I have a preliminary working-draft that interfaces with ConfigSpace and is compatible with GAMA and work with all search optimisers. The preliminary results are encouraging. However, a few obstacles have emerged:

  1. Classifier/Regressor Naming Conventions in ConfigSpace: Typically, when utilising ConfigSpace, users designate a list of classifiers as strings (each being a classifier's name) via a ConfigSpace HyperParameter named classifier, with each string in the list of classifiers' names as an option. I am using this classifier hyperparameter from the ConfigSpace generated sample (i.e., config_space_generated_sample["classifier"]), to construct a Sklearn estimator. While this string-to-sklearn estimator is not the issue, the issue is that we are essentially hardcoding this ConfigSapce hyperparameter, thereby restricting the names of this hyperparameter to classifier for GAMA's users. However, while this constraint is sort of not-free from confusion between GAMA and users, I propose storing all ConfigSpace-based "hard-coded" values as global variables in the search space file, e.g classification.py search space file for classification. Consequently, classifier_hyperparameter_name="classifier" would provide users with greater flexibility and GAMA with improved guidance in how to create individuals. Note that while I will be refining the nomenclature, my primary objective is to confirm your agreement with the use of global variables in the search space definition to aid GAMA in determining the regressor/classifier's hyperparameters through the ConfigSpace's generated sample. What are your opinions on the matter? Have you got a better recommendation?

  2. Constraints in ConfigSpace Manipulation: For instance, when if store_pipelines and self._x.shape[0] * self._x.shape[1] > 6_000_000, KNN is removed from the search space due to memory model-saving constraints. This is firmly not possible with the current ConfigSpace latest API's version. While I tried to create a function that would facilitate this specific procedure (i.e., manipulating the search space) with ConfigSpace, achieving it resulted to be more complex than I thought. Similarly, classifiers incompatible with certain metrics are eliminated using the if any(metric.requires_probabilities for metric in self._metrics) check. However, modifying a ConfigSpace post-creation is currently a challenge as previously mentioned. Lastly, another similar constraint is when if store_pipelines and self._x.shape[1] > 50: which reduce the ConfigSpace if the polynomial feature is present but the dataset contains over 50 features. Given these issues, and that fact that ConfigSpace do not allow us to manipulate those easily, I suggest retaining these checks. However, we can issue a warning to users for the latter two cases and for the KNN scenario, perhaps ask users to manually exclude it via a raising-error catch. This could be a temporary measure until ConfigSpace offers a solution. Note that the ConfigSpace team is aware of this feature-demand but has not yet determined when it will be incorporated into a roadmap. It may be days or weeks, but it may also be years.

  3. Hyperparameter Naming Overlaps: Some techniques of GAMA's current original search space have hyperparameters with common names such as threshold or alpha to give a few, yet with varying ranges. However, ConfigSpace does not permit two identically-named hyperparameters. Changing alpha to alpha_<estimator> could be one potential solution that I have seen around and what they do potentially propose as a work-around. Before assigning these names to the terminals of a particular technique, we could consider executing a custom function that outputs the requisite name (i.e., alpha instead of alpha_SelectFwe) for processing correctly. This custom function would be invoked while constructing a GAMA individual following the ConfigSpace sample generation. Therefore, this custom function would also be added as "global" to the search space's file, which any GAMA community member could then modify to their convenience. Does this strategy align with your reasoning, or do you envision an alternative solution?

I am optimistic about the progress, despite the fact that there are complexities to navigate. I believe the rest of the changes that are not part of any big-deal issues will take place during the pull request phase. I value your time so much @PGijsbers and perseverance with all that 🙏

Let me know if you have any questions or require clarifications at any time any day! Best wishes,

simonprovost commented 1 year ago

Hi @PGijsbers ,

I was delighted to meet you and your OpenML team at the AutoML conf! I believe you are quite busy with many other things at the moment. However, I would appreciate it if you could give me a PRE-GO before potentially having a pull request soon and be fully-disagreeing with the potential adjustment to GAMA. I have a couple weeks before returning to my Auto-ML work, so I would love to continue this work if it makes sense.

Overall, in the meantime, what I proposed sounds quite flexible and generalisable to any use case; yet, it may not be as straightforward as the existing search space creation, yet not too complex as well. However, while this is an investment, I truly believe GAMA gains a lot of potential with ConfigSpace by introducing "constraints" to say one of many other, as well as the possibility that ConfigSpace will continue to be updated by the authors 👌

Waiting for your call, Cheers,

PGijsbers commented 1 year ago

Likewise! I had hoped to talk with you more but couldn't find you anymore on Friday. Anyway, sorry for the delay in my response.

  1. I am not sure I understand what you are proposing. Could you provide me a link to the relevant configspace documentation? I can't find it. And possibly a simple example of what a search space definition would look like (with maybe only one preprocessing and one classifier)?

  2. Perhaps shrinking the search space can effectively be achieved through ConfigSpace.add_forbidden_clause? Otherwise we can store the configuration in a dictionary, manipulate it as needed, and then transform in into a ConfigSpace only once we start search?

  3. I would also be okay not having the shared hyperparameters after the redesign. In that case, it might be easier to just always postfix hyperparameters with the algorithm name. One would only need to update the Terminal to dynamically extract its correct "output" name.

    Hope that helps!

simonprovost commented 12 months ago

I appreciate your response! Helped a great deal. Please let me know your thoughts whenever you have the opportunity. I appreciate even more the wish to have talked more. Unfortunately, I had to depart Berlin early on Friday in order to attend a meeting with a collaboration.. I had hoped to chat more too, next time 😊

In the meantime, don't worry for the delay, I'll wait the time required.

I am not sure I understand what you are proposing. Could you provide me a link to the relevant configspace documentation? I can't find it. And possibly a simple example of what a search space definition would look like (with maybe only one preprocessing and one classifier)?

The primary objective is to optimise the system's adaptability while simultaneously reducing potential user errors. In the present configuration of GAMA, for instance, one of the GamaClassifier's class steps, aims to verify the presence of a predict_proba method in each classifier within the search space, if the metric of interest necessitates it. If not available, then the classifier is excluded from the search space for obvious reasons. Hence, utilising ConfigSpace, the list of classifiers for a given search space may be seen as a categorical-based hyperparameter. This list encompasses the names of the classifiers the search space is composed of, and an additional internal mechanism to ConfigSpace exists to convert the classifier-based "string" into a scikit-learn estimator. Anyway, what is of interest here is that, in order to access this hyperparameter from the instance of the search space, e.g. clf_config, it is necessary to have knowledge of its name (within the system, clf_config.get_hyperparameter("which_name").choices). At the outset, the consideration of utilising explicit names such as "classifier" and "preprocessor" was contemplated for the purpose of defining and retrieving the essential hyperparameters.

Nonetheless, in order to adhere more closely to the principle of flexibility and the overarching concept of GAMA, I as a result suggests storing these config space hyperparameter names (classifiers/regressors/preprocessors, etc.) as global variables. This implies that the initial settings for the classifier and preprocessor will be set to these default values. However, users have the flexibility to adjust these settings according to their preferences. While the GAMA's system will in any event utilising these global variable values instead of hardcoded strings, hence reducing the likelihood of misinterpretation from the user side.

Let's presents a concise exemple of a search space configuration that encompasses both classifiers and preprocessors, employing ConfigSpace in conjunction with global variables to differentiate between hyperparameters pertaining to the list of classifiers and those pertaining to the list of preprocessors:

config_space = cs.ConfigurationSpace()

# Classifiers
classifier_choices = ['DecisionTreeClassifier', 'KNeighborsClassifier']
classifier = csh.CategoricalHyperparameter('classifier', choices=classifier_choices)
config_space.add_hyperparameter(classifier)
max_depth = csh.UniformIntegerHyperparameter('max_depth', 1, 10)
config_space.add_hyperparameter(max_depth)
config_space.add_condition(cs.EqualsCondition(max_depth, classifier, 'DecisionTreeClassifier'))

# Preprocessors
preprocessors = ['MinMaxScaler', 'StandardScaler']
preprocessor = csh.CategoricalHyperparameter('preprocessor', choices=preprocessors)
config_space.add_hyperparameter(preprocessor)

# Resulting in:
print(config_space.get_hyperparameter("classifier").choices) # ['DecisionTreeClassifier', 'KNeighborsClassifier']

With the proposed change:

config_space = cs.ConfigurationSpace()

CLASSIFIER_NAME = "classifier"  # Global variable for classifier hyperparameter name to be changed if needed
PREPROCESSOR_NAME = "preprocessor"  # Global variable for preprocessor hyperparameter name to be changed if needed

# Classifiers
classifier_choices = ['DecisionTreeClassifier', 'KNeighborsClassifier']
classifier = csh.CategoricalHyperparameter(CLASSIFIER_NAME, choices=classifier_choices)
config_space.add_hyperparameter(classifier)
max_depth = csh.UniformIntegerHyperparameter('max_depth', 1, 10)
config_space.add_hyperparameter(max_depth)
config_space.add_condition(cs.EqualsCondition(max_depth, classifier, 'DecisionTreeClassifier'))

# Preprocessors
preprocessors = ['MinMaxScaler', 'StandardScaler']
preprocessor = csh.CategoricalHyperparameter(PREPROCESSOR_NAME, choices=preprocessors)
config_space.add_hyperparameter(preprocessor)

# Resulting in a more flexible way, if users want to change the name of their classifier vector:
print(config_space.get_hyperparameter(CLASSIFIER_NAME).choices) # ['DecisionTreeClassifier', 'KNeighborsClassifier']

Consequently, by implementing this change, users can easily change the global variables (e.g., CLASSIFIER_NAME) if they wish to use different naming conventions, ensuring that GAMA remains consistent and adaptable to a variety of requirements. I believe this example provides a clearer illustration of the proposed modification. I would value your views on this strategy, as well as any alternative suggestions you may have. However, overall I do concur that I do not foresee many situations in which a user would refer to hyperparameters as "classifiers" and other "preprocessors" despite the fact that this enhances flexibility, so I will await your final call and adhere to it to the letter no worries ! 🫡

Perhaps shrinking the search space can effectively be achieved through ConfigSpace.add_forbidden_clause?

Pretty cool idea, indeed! I did not even think about it while having had read the documentation in its entirety. I'll give this a shot, but theoretically speaking it will still be in the search space but never be able to be chosen, this is clever. If it does work I'll apply to many situations where it is actually needed. Thank you for this one !

I would also be okay not having the shared hyperparameters after the redesign. In that case, it might be easier to just always postfix hyperparameters with the algorithm name. One would only need to update the Terminal to dynamically extract its correct "output" name.

I comprehend and appreciate the simplicity that postfixing hyperparameters with the algorithm name would bring. There are potential hazards to consider, however, with this straightforward approach. For example, if a user appends a custom algorithm whose hyperparameters already originally contain an underscore postfix, won't this result in nomenclature conflicts or confusion? E.g., GAMA's system would cut to to the postfix but the postfix is actually needed?

In addition, while this solution simplifies naming conventions handling, it may inadvertently impose a restriction. Users would be required to postfix all hyperparameters, including those that are unique to a specific algorithm and where they are not necessarily "shared-name based", which could result in superfluous overhead.

Thus, to assure both flexibility and usability, I proposed the following: we could refine the search space construction by incorporating a function that handles hyperparameter naming specifically. We provide a default one, easy to modify if necessary. This function would operates as follow: If a postfix is provided, matches a case in the renaming based function this function will determine the correct hyperparameter name and return it to GAMA for its internal correct usage. Furthermore, allowing users to customise naming conventions according to their needs.

Here is a concise exemple. Let's say you want two distinct ranges for max depth in DecisionTree and RandomForest, you will need to differentiate them in the ConfigSpace. Obviously, you use postfix, but the handle hyperparameter name does remove the postfix here, but only to these two, because they matched. No other postfixes are managed, even though they may exist according to the search space proposed by the user. Only these two are handled as wanted:

print(handle_hyperparameter_name("max_depth_DecisionTree"))  # Output: max_depth
print(handle_hyperparameter_name("max_depth_RandomForest"))  # Output: max_depth
print(handle_hyperparameter_name("other_hyperparameter_algorithm1"))  # Output: NO MATCH in the function, so other_hyperparameter_algorithm1

Note: SMAC employs a similar practice in their training procedure. They enable users to define their own training functions, giving the procedure a personalised touch.

While I appreciate your time, Pieter, feel free to ask me further questions; however, note that I am being picky here, but you have the final say anyway, so feel free to let me know; however, I would appreciate brief justification so I can better understand 👌

Cheers,

PGijsbers commented 12 months ago

I as a result suggests storing these config space hyperparameter names (classifiers/regressors/preprocessors, etc.) as global variables.

Having the names stored in variables instead of "magic strings" is a good practice. At first glance, maybe the meta dictionary can be used for this? See docs. If that doesn't work, then I wouldn't make the variables globally scoped, but having it scoped under gama.configuration should be enough (from gama.configuration import CLASSIFIER, PREPROCESSOR).

won't this result in nomenclature conflicts or confusion?

I would rather say you can split on the last _ in which case it becomes a problem is the estimator has underscores (and by class naming conventions they normally should not). But it's still valid that you probably don't want to embed that assumption in the code. I believe scikit-learn uses two underscores for that reason, which we could also adopt (e.g., max_depth__decisiontreeclassifier).

simonprovost commented 12 months ago

Hello @PGijsbers

Amazing! I concur with every points made.

Thanks for pointing out the meta-dictionary, I'll definitely give it a shot. Same for the two underscores. I'll work on all of this and submit the pull request in the hope that I won't encounter any other substantial changes that require your feedback.

Now everything should be fine. See you on the pull request side, Pieter, and thanks again - I appreciate it very much 👌

Have a wonderful week!