Open LilianBoulard opened 1 year ago
Please give me a ping here as soon as dirty cat 0.3 is released :)
Hi Pieter, dirty_cat 0.3 is out!
I allowed CI now, I'll try to have a closer look over this week and the next. I will probably do the 22.0.0 release without (since I was planning to do that today or tomorrow, as the current PyPI package is broken due to updated dependencies), so ignore the message about adding things to the changelog; I'll do that later when preparing for 22.1.0.
Ah, it looks like the unit tests which used pre-defined individuals are broken now (to be expected). I am not entirely sure how I want to fix that - that will depend on whether or not we want to allow for the old behavior to be used as an alternative, and that would depend on a small benchmark. So I don't think there's much you can do right now as far as improving the tests/code.
Running some additional experiments to define a sensible default search space, as noted in the OP, should be possible and is appreciated :)
This PR aims at implementing dirty_cat's encoders (currently SimilarityEncoder, GapEncoder and MinHashEncoder) to GAMA's search space via the use of the SuperVectorizer.
The point of adding the dirty_cat encoders is for GAMA to be able to handle dirty categorical features in tabular data.
Using the SuperVectorizer gives a simplified interface to the sklearn's ColumnTransformer, and allows to mix & match different encoding techniques.
For the content of this PR to run, the features implemented in dirty_cat 0.3 are required. However, at the time of writing these lines (August 2022), this version is not out yet.
TODO: