Open cod3licious opened 5 years ago
Expanding on this idea, it would be powerful to see an analysis of which models performed best on which types of dataset, using features of each dataset that are already being gathered here. Then people could plug in the features of a new dataset that they want to analyze, and they would be recommended which model to use.
I think that would be a really interesting paper to see if certain algorithms perform best on certain types of datasets. Taking that idea even one step further, I would love to see some sort of standard for describing dataset features, so that researchers and data scientists can share what models performed best on their datasets, as well as features of the datasets. Pooling that knowledge could be some form of a global hyperparameter search. This research group seems well positioned to lead such an effort.
EDIT: I see that this dataset features are being called "metafeatures", and that you started doing some analysis on that. It would be really interesting if someone had time to finish that.
I'd be open to merging this functionality into pmlb if someone will send a PR for it.
I'm not sure if I'll have time but I will update if so.
I think the pmlb dataset compilation is really cool, but as far as benchmarking goes, it would be really great if for every dataset the best possible results would be known, so that a new method can actually be benchmarked against the existing (sklearn) algorithms, without having to do the whole parameter selection etc for all these models again.
With that respect it would be really great if there existed a function where I could give the name of the dataset and was given back the sklearn model that performed best on this dataset, initialized with its best parameters, such that all I have to do is train and evaluate the model on the dataset to reproduce the benchmark results. I.e. in a similar manner as
fetch_data
frompmlb
it would be cool if I could call afetch_model
function with the dataset name and get back the initialized model on which I can then callfit
andpredict
to get at least some kind of reasonable baseline performance for each dataset.