Open jeromedockes opened 3 months ago
some examples of the kind of cleaning the tablevectorizer does:
>>> import pandas as pd
>>> from skrub import TableVectorizer
>>> skrubber = TableVectorizer(
... high_cardinality_transformer="passthrough",
... low_cardinality_transformer="passthrough",
... datetime_transformer="passthrough",
... numeric_transformer="passthrough",
... specific_transformers=(),
... )
>>> df = pd.DataFrame({
... 'a': ['2020-01-02', '2020-01-03'],
... 'b': ['2.2', 'nan'],
... 'c': [1.5, pd.NA],
... 'd': [True, False],
... 'e': pd.Series([4.5, 'a'], dtype='category'),
... })
>>> df
a b c d e
0 2020-01-02 2.2 1.5 True 4.5
1 2020-01-03 nan <NA> False a
>>> df.dtypes
a object
b object
c object
d bool
e category
dtype: object
>>> df['e'].cat.categories
Index([4.5, 'a'], dtype='object')
>>> skrubbed = skrubber.fit_transform(df)
>>> skrubbed
a b c d e
0 2020-01-02 2.2 1.5 1.0 4.5
1 2020-01-03 NaN NaN 0.0 a
>>> skrubbed.dtypes
a datetime64[ns]
b float32
c float32
d float32
e category
dtype: object
>>> skrubbed['e'].cat.categories
Index(['4.5', 'a'], dtype='object')
I like the name "Skrubber"
Problem Description
Sometimes we may want to apply the preprocessing/cleaning steps of the TableVectorizer (parsing datetimes, handling pandas extension dtypes, etc.), while handling the actual encoding in separate pipeline steps. This will probably become more relevant when the Recipe (or whatever its name will be) is introduced: we can use it to build exactly the pipeline we want, but we would still like to apply the default cleaning done by the TableVectorizer
If this sounds like a plausible use-case maybe we could have a shorthand for
maybe
Feature Description
...
Alternative Solutions
No response
Additional Context
No response