rhiever / sklearn-benchmarks

A centralized repository to report scikit-learn model performance across a variety of parameter settings and data sets.
MIT License
209 stars 53 forks source link

function to easily get best sklearn models for each dataset #42

Open cod3licious opened 5 years ago

cod3licious commented 5 years ago

I think the pmlb dataset compilation is really cool, but as far as benchmarking goes, it would be really great if for every dataset the best possible results would be known, so that a new method can actually be benchmarked against the existing (sklearn) algorithms, without having to do the whole parameter selection etc for all these models again.

With that respect it would be really great if there existed a function where I could give the name of the dataset and was given back the sklearn model that performed best on this dataset, initialized with its best parameters, such that all I have to do is train and evaluate the model on the dataset to reproduce the benchmark results. I.e. in a similar manner as fetch_data from pmlb it would be cool if I could call a fetch_model function with the dataset name and get back the initialized model on which I can then call fit and predict to get at least some kind of reasonable baseline performance for each dataset.

skeller88 commented 4 years ago

Expanding on this idea, it would be powerful to see an analysis of which models performed best on which types of dataset, using features of each dataset that are already being gathered here. Then people could plug in the features of a new dataset that they want to analyze, and they would be recommended which model to use.

I think that would be a really interesting paper to see if certain algorithms perform best on certain types of datasets. Taking that idea even one step further, I would love to see some sort of standard for describing dataset features, so that researchers and data scientists can share what models performed best on their datasets, as well as features of the datasets. Pooling that knowledge could be some form of a global hyperparameter search. This research group seems well positioned to lead such an effort.

EDIT: I see that this dataset features are being called "metafeatures", and that you started doing some analysis on that. It would be really interesting if someone had time to finish that.

rhiever commented 4 years ago

I'd be open to merging this functionality into pmlb if someone will send a PR for it.

skeller88 commented 4 years ago

I'm not sure if I'll have time but I will update if so.