More datasets that in the paper?

udellgroup / oboe

An AutoML pipeline selection system to quickly select a promising pipeline for a new dataset.

BSD 3-Clause "New" or "Revised" License

82 stars 16 forks source link

More datasets that in the paper? #15

Closed sebastianpinedaar closed 2 years ago

sebastianpinedaar commented 2 years ago

I was checking the tensor on the repository "oboe/large_files/error_tensor_f16_compressed.npz", and I noticed there are 551 datasets, while in the [paper] (https://people.ece.cornell.edu/cy/_papers/tensor_oboe.pdf) you mentioned only 215 for meta-training. Did you add more? Moreover, is it possible to get the meta-features of these 551 datasets? Or how do you compute the best initializations when meta-learning with Auto-sklearn?

Thanks!

chengrunyang commented 2 years ago

Thanks for the question! Yes, I added more datasets, and collected the meta-training performance with 5-fold cross-validation (instead of 3-fold in the TensorOboe paper) to make the system more robust.

As for meta-features, we are using factors of matrix or tensor decomposition as the dataset embeddings (or data-driven meta-features) in Oboe and TensorOboe, if that's what you are asking about.

For the best initializations given by auto-sklearn, we just use the default implementation in the auto-sklearn code repository at that time (I believe it was v0.12.1).

sebastianpinedaar commented 2 years ago

Thank you for your answer!

Would it be possible to get the source of the new datasets? Are they from openml? If so, would it be possible to share the corresponding task-id?

Thanks!

chengrunyang commented 2 years ago

The meta-training datasets are all from OpenML, and their IDs are stored in oboe/defaults/TensorOboe/training_index.pkl. You can read the file by

import pickle
import os

path = '......oboe/oboe/defaults/TensorOboe'
with open(os.path.join(path, 'training_index.pkl'), 'rb') as handle:
    IDs = pickle.load(handle)

I did not collect the meta-training data by following some OpenML task IDs, though. I did 5-fold stratified cross validations (sklearn.model_selection.StratifiedKFold with random_state=0) to evaluate the pipelines assembled by the components listed in Table 2 of the paper.

Hope this information can help and please feel free to ask.

chengrunyang commented 2 years ago

Closing this issue for now. Feel free to reopen for anything.