sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.06k stars 206 forks source link

Feature Proposal: clean_ml functionality in the clean module #488

Closed qidanrui closed 3 years ago

qidanrui commented 3 years ago

Summary

Implement clean_ml() function to transform an arbitrary tabular dataset into a format that's suitable for a typical ML application.

Design-level Explanation Actions

Design-level Explanation

Proposed function signature for clean_ml():

def clean_ml(
    df: Union[pd.DataFrame, dd.DataFrame],
    cat_cols: List[str],
    num_cols: List[str],
    cat_imputation: str = "constant", # or "most_frequnt", "drop",
    fill_val: str = "missing_value" # or user-specified value for "constant" mode of cat_imputation
    num_imputation: str = "mean", # or "median", "most_frequent", "drop",
    cat_encoding: str = "one_hot", # or "no_encoding",
    variance_threshold: bool = False, # or True,
    variance: float = 0.0, # or other float values, when variance_threshold = True,
    num_scaling: str = "standardize", # or "minmax", "normalize", "no_scaling", 
    balancing: str = "no_balancing", # or "weight",
    class_weight: List[float] = [], # class weight when balancing = "weight",
    feature_preprocessor: str = "PCA", # or "truncatedSVD", "select_percentile", "no_preprocessing",
    max_nominal_values: int = 30, # drop columns with more than 30 unique categorical values
    max_repeated_value_percent: float = 70, # the maximum percent a value can repeat in a column
    max_unique_integers_percent: float = 99, # the maximum percent of unique values in an integer column
    include_operators: List[str] = []
    exclude_operators List[str] = []
) -> pd.DataFrame:
    """
    This function transforms an arbitrary tabular dataset into a format that's suitable for a typical ML application.

    Parameters
    ----------
    df
        Pandas or Dask DataFrame.
    cat_cols
        Categorical columns.
    num_cols
        Numerical columns.
    cat_imputation
        The mode of imputation for categorical columns.
        If it equals to "constant", then all missing values are filled with `fill_val`.
        If it equals to "most_frequent", then all missing values are filled with most frequent value.
        If it equals to "drop", then all categorical columns with missing values will be dropped.
    fill_val
        When cat_imputation = "constant", then all missing values are filled with `fill_val`.
    num_imputation
        The mode of imputation for numerical columns.
        If it equals to "mean", then all missing values are filled with mean value.
        If it equals to "median", then all missing values are filled with median value.
        If it equals to "most_frequent", then all missing values are filled with most frequent value.
        If it equals to "drop", then all numerical columns with missing values will be dropped.
    cat_encoding
        The mode of encoding categorical columns.
        If it equals to "one_hot", do one-hot encoding.
        If it equals to "no_encoding", nothing will be done.
    variance_threshold
        If it is True, then dropping numerical columns with variance less than `variance`. 
    variance
        Variance value when variance_threshold = True.
    num_scaling
        The mode of scaling for numerical columns.
        If it equals to "standardize", do standardize for all numerical columns.
        If it equals to "minmax", do minmax scaling for all numerical columns.
        If it equals to "normalize", do normalize for all numerical columns.
        If it equals to "no_scaling", nothing will be done.
    balancing
        The mode of class balancing for classification datasets.
        If it equals to "no_balancing", nothing will be done.
        If it equals to "weight", different weights will be arranged to different classes.
    class_weight
        Class weight when `balancing = "weight"`
    feature_preprocessor
        The operator of feature engineering for all columns.
    max_nominal_values
        Drop columns with more than `max_nominal_values` unique categorical values.
    max_repeated_value_percent
        Drop columns with more than `max_repeated_value_percent` repeated values. 
    max_unique_integers_percent
        Drop integer columns with more than `max_unique_integers_percent` unique values.
    include_operators
        Components included for `clean_ml`, like "one_hot", "standardize", etc.
    exclude_operators
        Components excluded for `clean_ml`, like "one_hot", "standardize", etc.
    """

Implementation-level Explanation

The implementation is based on the pipeline proposed in Auto-sklearn. Auto-sklearn](https://automl.github.io/auto-sklearn/master/) treats categorical columns and numerical columns with different pipelines. In our implementation, we also employ this idea to manage categorical columns and numerical columns with different pipelines. However, in our initial version of clean_ml(), we don't support type recognization and we give user more right to specify the columns they want to clean. So user should specify categorical columns and numerical columns with cat_cols and num_cols.

In our initial version of clean_ml(), we support general transformation process including imputation, encoding, scaling and feature engineering. We also support some special component like balancing imbalance data. This component is set after scaling of numerical columns. For each component, we give a fixed set of operators for user to choose. The reason is that for initial stage, we should build a framework for the whole pipeline and we assume that for each component user can choose one operator for it. The framework is easy to be automated, and easy to be replaced by automatic data-preprocessing part and future dataprep.feature_engineering subpackage.

Here's a simplified explanation of the process of this function:

It should be noted that if user inputs the include_operators and the specified operators of each component are not in include_operators, there will be a reported error. In the same way, if user inputs the exclude_operators and the specified operators of each component are in exclude_operators, there will be a reported error. If user doesn't report include_operators and exclude_operators, all operators are included.

Rational and Alternatives

This design considering clean_ml() function as a comprehensive ml preparation pipeline.

Future Possibilities

Implementation-level Actions

Additional Tasks

yxie66 commented 3 years ago

Hi Danrui, it's a great issue, thank you very much. I wonder whether it's possible to automate detection process for categorical/numeric attributes within input Dataframe? While it may cause error, it can be an option for those "lazy" users.

qidanrui commented 3 years ago

Hi Danrui, it's a great issue, thank you very much. I wonder whether it's possible to automate detection process for categorical/numeric attributes within input Dataframe? While it may cause error, it can be an option for those "lazy" users.

Sure, I think firstly we can employ simple type reference using dataframe~