Feature Proposal: clean_ml functionality in the clean module

qidanrui commented 3 years ago

Summary

Implement clean_ml() function to transform an arbitrary tabular dataset into a format that's suitable for a typical ML application.

Design-level Explanation Actions

[x] Investigate prior art solutions for cleaning column names.
[X] Consider which parameters to support.
[X] Consider which target string cases to support.

Design-level Explanation

Proposed function signature for clean_ml():

def clean_ml(
    df: Union[pd.DataFrame, dd.DataFrame],
    cat_cols: List[str],
    num_cols: List[str],
    cat_imputation: str = "constant", # or "most_frequnt", "drop",
    fill_val: str = "missing_value" # or user-specified value for "constant" mode of cat_imputation
    num_imputation: str = "mean", # or "median", "most_frequent", "drop",
    cat_encoding: str = "one_hot", # or "no_encoding",
    variance_threshold: bool = False, # or True,
    variance: float = 0.0, # or other float values, when variance_threshold = True,
    num_scaling: str = "standardize", # or "minmax", "normalize", "no_scaling", 
    balancing: str = "no_balancing", # or "weight",
    class_weight: List[float] = [], # class weight when balancing = "weight",
    feature_preprocessor: str = "PCA", # or "truncatedSVD", "select_percentile", "no_preprocessing",
    max_nominal_values: int = 30, # drop columns with more than 30 unique categorical values
    max_repeated_value_percent: float = 70, # the maximum percent a value can repeat in a column
    max_unique_integers_percent: float = 99, # the maximum percent of unique values in an integer column
    include_operators: List[str] = []
    exclude_operators List[str] = []
) -> pd.DataFrame:
    """
    This function transforms an arbitrary tabular dataset into a format that's suitable for a typical ML application.

    Parameters
    ----------
    df
        Pandas or Dask DataFrame.
    cat_cols
        Categorical columns.
    num_cols
        Numerical columns.
    cat_imputation
        The mode of imputation for categorical columns.
        If it equals to "constant", then all missing values are filled with `fill_val`.
        If it equals to "most_frequent", then all missing values are filled with most frequent value.
        If it equals to "drop", then all categorical columns with missing values will be dropped.
    fill_val
        When cat_imputation = "constant", then all missing values are filled with `fill_val`.
    num_imputation
        The mode of imputation for numerical columns.
        If it equals to "mean", then all missing values are filled with mean value.
        If it equals to "median", then all missing values are filled with median value.
        If it equals to "most_frequent", then all missing values are filled with most frequent value.
        If it equals to "drop", then all numerical columns with missing values will be dropped.
    cat_encoding
        The mode of encoding categorical columns.
        If it equals to "one_hot", do one-hot encoding.
        If it equals to "no_encoding", nothing will be done.
    variance_threshold
        If it is True, then dropping numerical columns with variance less than `variance`. 
    variance
        Variance value when variance_threshold = True.
    num_scaling
        The mode of scaling for numerical columns.
        If it equals to "standardize", do standardize for all numerical columns.
        If it equals to "minmax", do minmax scaling for all numerical columns.
        If it equals to "normalize", do normalize for all numerical columns.
        If it equals to "no_scaling", nothing will be done.
    balancing
        The mode of class balancing for classification datasets.
        If it equals to "no_balancing", nothing will be done.
        If it equals to "weight", different weights will be arranged to different classes.
    class_weight
        Class weight when `balancing = "weight"`
    feature_preprocessor
        The operator of feature engineering for all columns.
    max_nominal_values
        Drop columns with more than `max_nominal_values` unique categorical values.
    max_repeated_value_percent
        Drop columns with more than `max_repeated_value_percent` repeated values. 
    max_unique_integers_percent
        Drop integer columns with more than `max_unique_integers_percent` unique values.
    include_operators
        Components included for `clean_ml`, like "one_hot", "standardize", etc.
    exclude_operators
        Components excluded for `clean_ml`, like "one_hot", "standardize", etc.
    """

Implementation-level Explanation

The implementation is based on the pipeline proposed in Auto-sklearn. Auto-sklearn](https://automl.github.io/auto-sklearn/master/) treats categorical columns and numerical columns with different pipelines. In our implementation, we also employ this idea to manage categorical columns and numerical columns with different pipelines. However, in our initial version of clean_ml(), we don't support type recognization and we give user more right to specify the columns they want to clean. So user should specify categorical columns and numerical columns with cat_cols and num_cols.

In our initial version of clean_ml(), we support general transformation process including imputation, encoding, scaling and feature engineering. We also support some special component like balancing imbalance data. This component is set after scaling of numerical columns. For each component, we give a fixed set of operators for user to choose. The reason is that for initial stage, we should build a framework for the whole pipeline and we assume that for each component user can choose one operator for it. The framework is easy to be automated, and easy to be replaced by automatic data-preprocessing part and future dataprep.feature_engineering subpackage.

Here's a simplified explanation of the process of this function:

For categorical columns:
- Categorical imputation
- Drop columns according to max_nominal_values and max_repeated_value_percent
- Encoding categorical columns
For numerical columns:
- Numerical imputation
- Drop columns according to max_repeated_value_percent and max_unique_integers_percent
- Drop columns with variance_threshold
- Scaling numerical columns with specified operator
- Doing feature engineering for numerical columns with specified operator.

It should be noted that if user inputs the include_operators and the specified operators of each component are not in include_operators, there will be a reported error. In the same way, if user inputs the exclude_operators and the specified operators of each component are in exclude_operators, there will be a reported error. If user doesn't report include_operators and exclude_operators, all operators are included.

Rational and Alternatives

This design considering clean_ml() function as a comprehensive ml preparation pipeline.

Comparing to datacleaner, our function includes more operators and more components. The process of datacleaner:
- Drop any rows with missing values (default False)
- Impute missing values: mode for categorical, mean for continuous
- Encode categorical variables: default LabelEncoder (can pass an sklearn encoder)
Comparing to RapidMiner, our process is more flexible with more operators and more components. The process of RapidMiner:
- Define the target column (so it is ignored from cleaning), ignore if doesn't exist
- Remove low quality columns
- Change type to all numerical (dummy), or all categorical (bin)
- Specify PCA or normalization
Comparing to PyCaret, our function is also easy to use. User just need to specify parameters, then our function can go through the process with just one-line code. The process of PyCaret:
- Default constant ("not available") categorical imputation, other option "mode"
- Default mean numerical imputation, other option "median"
- High cardinality transformation frequency, other "clustering"
Comparing to vtreat, the operators and functions in our function are more commonly used (with operators in Sklearn).
Prior Art

Auto-sklearn: Python package for automatically selecting machine learning pipeline for a specified dataset. datacleaner: A open source data quality solution which can support many platforms like Spark. RapidMiner: A software platform for data science teams that unites data prep, machine learning, and predictive model deployment. PyCaret: Python package for automating machine learning workflow, which includes data preprocessing part. vtreat: Python package which is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.

Future Possibilities

Replace the scaling component with automatic preprocessing part.
Replace the feature engineering component with dataprep.featuer_engineering subpackage.
Automate whole pipeline.

Implementation-level Actions

[ ] Implement the function.
[ ] Test on real world data sets.

Additional Tasks

[X] This task is put into a correct pipeline (Development Backlog or In Progress).
[X] The label of this task is set correctly.
[X] The issue is assigned to the correct person.
[ ] The issue is linked to a related Epic.

[ ] The documentation is changed accordingly.
[ ] Tests are added accordingly.

yxie66 commented 3 years ago

Hi Danrui, it's a great issue, thank you very much. I wonder whether it's possible to automate detection process for categorical/numeric attributes within input Dataframe? While it may cause error, it can be an option for those "lazy" users.

qidanrui commented 3 years ago

Hi Danrui, it's a great issue, thank you very much. I wonder whether it's possible to automate detection process for categorical/numeric attributes within input Dataframe? While it may cause error, it can be an option for those "lazy" users.

Sure, I think firstly we can employ simple type reference using dataframe~

sfu-db / dataprep