udellgroup / oboe

An AutoML pipeline selection system to quickly select a promising pipeline for a new dataset.
BSD 3-Clause "New" or "Revised" License
82 stars 16 forks source link

AutoLearner crashes when using a dataset with 'NaN' values #16

Closed sebastianpinedaar closed 2 years ago

sebastianpinedaar commented 2 years ago

Hi,

I wanted to use AutoLearner with method="Oboe", for the OpenML dataset=168868, but it fails because it has NaN values. Do you maybe know why?

chengrunyang commented 2 years ago

Sorry for my late reply! As of right now, the AutoLearner with method="Oboe" will fail if the dataset is not preprocessed. This is because Oboe was designed to only select ML estimators for preprocessed datasets at that time. According to the beginning of Section 5 of the Oboe paper (https://arxiv.org/pdf/1808.03233.pdf):

Since data pre-processing is not our focus, we preprocess all datasets in the same way: one-hot encode categorical features and then standardize all features to have zero mean and unit variance. These pre-processed datasets are used in all the experiments.

We use the pre_process function at https://github.com/udellgroup/oboe/blob/9c5b47d890bfa88ce4c67ee1450d89961d55fa9f/oboe/preprocessing.py#L12 to preprocess datasets. With a 2D numpy.ndarray feature array X that contains row-wise data points, we do X_preprocessed, categorical = pre_process(X, categorical, impute=True, standardize=True, one_hot_encode=True) for a general dataset with missing entries, mixed-type and non-standard features.

Could you try preprocessing the dataset with the above function, and then using the preprocessed X_preprocessed together with label array y for the AutoLearner?

chengrunyang commented 2 years ago

Closing this issue for now. Feel free to reopen.