Closed qidanrui closed 3 years ago
Hi Danrui, it's a great issue, thank you very much. I wonder whether it's possible to automate detection process for categorical/numeric attributes within input Dataframe? While it may cause error, it can be an option for those "lazy" users.
Hi Danrui, it's a great issue, thank you very much. I wonder whether it's possible to automate detection process for categorical/numeric attributes within input Dataframe? While it may cause error, it can be an option for those "lazy" users.
Sure, I think firstly we can employ simple type reference using dataframe~
Summary
Implement
clean_ml()
function to transform an arbitrary tabular dataset into a format that's suitable for a typical ML application.Design-level Explanation Actions
Design-level Explanation
Proposed function signature for
clean_ml()
:Implementation-level Explanation
The implementation is based on the pipeline proposed in Auto-sklearn. Auto-sklearn](https://automl.github.io/auto-sklearn/master/) treats categorical columns and numerical columns with different pipelines. In our implementation, we also employ this idea to manage categorical columns and numerical columns with different pipelines. However, in our initial version of
clean_ml()
, we don't support type recognization and we give user more right to specify the columns they want to clean. So user should specify categorical columns and numerical columns withcat_cols
andnum_cols
.In our initial version of
clean_ml()
, we support general transformation process including imputation, encoding, scaling and feature engineering. We also support some special component like balancing imbalance data. This component is set after scaling of numerical columns. For each component, we give a fixed set of operators for user to choose. The reason is that for initial stage, we should build a framework for the whole pipeline and we assume that for each component user can choose one operator for it. The framework is easy to be automated, and easy to be replaced by automatic data-preprocessing part and future dataprep.feature_engineering subpackage.Here's a simplified explanation of the process of this function:
max_nominal_values
andmax_repeated_value_percent
max_repeated_value_percent
andmax_unique_integers_percent
variance_threshold
It should be noted that if user inputs the
include_operators
and the specified operators of each component are not ininclude_operators
, there will be a reported error. In the same way, if user inputs theexclude_operators
and the specified operators of each component are inexclude_operators
, there will be a reported error. If user doesn't reportinclude_operators
andexclude_operators
, all operators are included.Rational and Alternatives
This design considering
clean_ml()
function as a comprehensive ml preparation pipeline.Prior Art
Auto-sklearn: Python package for automatically selecting machine learning pipeline for a specified dataset. datacleaner: A open source data quality solution which can support many platforms like Spark. RapidMiner: A software platform for data science teams that unites data prep, machine learning, and predictive model deployment. PyCaret: Python package for automating machine learning workflow, which includes data preprocessing part. vtreat: Python package which is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.
Future Possibilities
Implementation-level Actions
Additional Tasks