openml-labs / gama

An automated machine learning tool aimed to facilitate AutoML research.
https://openml-labs.github.io/gama/master/
Apache License 2.0
92 stars 29 forks source link

Implement the SuperVectorizer and dirty_cat's encoders to the search space #169

Open LilianBoulard opened 1 year ago

LilianBoulard commented 1 year ago

This PR aims at implementing dirty_cat's encoders (currently SimilarityEncoder, GapEncoder and MinHashEncoder) to GAMA's search space via the use of the SuperVectorizer.

The point of adding the dirty_cat encoders is for GAMA to be able to handle dirty categorical features in tabular data.

Using the SuperVectorizer gives a simplified interface to the sklearn's ColumnTransformer, and allows to mix & match different encoding techniques.

For the content of this PR to run, the features implemented in dirty_cat 0.3 are required. However, at the time of writing these lines (August 2022), this version is not out yet.

TODO:

PGijsbers commented 1 year ago

Please give me a ping here as soon as dirty cat 0.3 is released :)

LilianBoulard commented 1 year ago

Hi Pieter, dirty_cat 0.3 is out!

PGijsbers commented 1 year ago

I allowed CI now, I'll try to have a closer look over this week and the next. I will probably do the 22.0.0 release without (since I was planning to do that today or tomorrow, as the current PyPI package is broken due to updated dependencies), so ignore the message about adding things to the changelog; I'll do that later when preparing for 22.1.0.

PGijsbers commented 1 year ago

Ah, it looks like the unit tests which used pre-defined individuals are broken now (to be expected). I am not entirely sure how I want to fix that - that will depend on whether or not we want to allow for the old behavior to be used as an alternative, and that would depend on a small benchmark. So I don't think there's much you can do right now as far as improving the tests/code.

Running some additional experiments to define a sensible default search space, as noted in the OP, should be possible and is appreciated :)