rhiever / sklearn-benchmarks

A centralized repository to report scikit-learn model performance across a variety of parameter settings and data sets.
MIT License
209 stars 53 forks source link

Auto category discovery #26

Closed harshnisar closed 8 years ago

harshnisar commented 8 years ago

Tried:

Both had a lot of false positives. I've discarded Benford as a metric and reduced the maximum number of unique values to 0.001 times the rows in the dataset.

Ultimately used something really hacky. Given you've used sklearn's Label Encoder for all your encoding, I can assume encodings start with zero and end with N - 1 where N is the total unique values in the column. So I simply check if the unique of the column is a complete list of natural numbers. :P

There are still some false positives like age, but that's fine for now I guess.

In the next two PRs (over this weekend) I am going to complete covering all metafeatures commonly found else where. No point waiting for an api. Also (given meta-features help), whenever TPOT will face a new dataset, it would need to finds its metafeatures first to recommend the starting population and hence it better be an offering packed with TPOT. I might be wrong about this.

You can delete #25 as this PR has the monkey_runner script too.

Number of categorical columns for the first few datasets.

image

harshnisar commented 8 years ago

Added two more classes of metafeatures - descriptive stats for kurtosis and skew of all numerical variables.