AutoLearner crashes when using a dataset with 'NaN' values

sebastianpinedaar commented 2 years ago

Hi,

I wanted to use AutoLearner with method="Oboe", for the OpenML dataset=168868, but it fails because it has NaN values. Do you maybe know why?

chengrunyang commented 2 years ago

Sorry for my late reply! As of right now, the AutoLearner with method="Oboe" will fail if the dataset is not preprocessed. This is because Oboe was designed to only select ML estimators for preprocessed datasets at that time. According to the beginning of Section 5 of the Oboe paper (https://arxiv.org/pdf/1808.03233.pdf):

Since data pre-processing is not our focus, we preprocess all datasets in the same way: one-hot encode categorical features and then standardize all features to have zero mean and unit variance. These pre-processed datasets are used in all the experiments.

We use the pre_process function at https://github.com/udellgroup/oboe/blob/9c5b47d890bfa88ce4c67ee1450d89961d55fa9f/oboe/preprocessing.py#L12 to preprocess datasets. With a 2D numpy.ndarray feature array X that contains row-wise data points, we do X_preprocessed, categorical = pre_process(X, categorical, impute=True, standardize=True, one_hot_encode=True) for a general dataset with missing entries, mixed-type and non-standard features.

Could you try preprocessing the dataset with the above function, and then using the preprocessed X_preprocessed together with label array y for the AutoLearner?

chengrunyang commented 2 years ago

Closing this issue for now. Feel free to reopen.

udellgroup / oboe

AutoLearner crashes when using a dataset with 'NaN' values #16