Both had a lot of false positives.
I've discarded Benford as a metric and reduced the maximum number of unique values to 0.001 times the rows in the dataset.
Ultimately used something really hacky.
Given you've used sklearn's Label Encoder for all your encoding, I can assume encodings start with zero and end with N - 1 where N is the total unique values in the column. So I simply check if the unique of the column is a complete list of natural numbers. :P
There are still some false positives like age, but that's fine for now I guess.
In the next two PRs (over this weekend) I am going to complete covering all metafeatures commonly found else where. No point waiting for an api. Also (given meta-features help), whenever TPOT will face a new dataset, it would need to finds its metafeatures first to recommend the starting population and hence it better be an offering packed with TPOT. I might be wrong about this.
You can delete #25 as this PR has the monkey_runner script too.
Number of categorical columns for the first few datasets.
Tried:
Both had a lot of false positives. I've discarded Benford as a metric and reduced the maximum number of unique values to 0.001 times the rows in the dataset.
Ultimately used something really hacky. Given you've used sklearn's Label Encoder for all your encoding, I can assume encodings start with zero and end with
N - 1
whereN
is the total unique values in the column. So I simply check if the unique of the column is a complete list of natural numbers. :PThere are still some false positives like age, but that's fine for now I guess.
In the next two PRs (over this weekend) I am going to complete covering all metafeatures commonly found else where. No point waiting for an api. Also (given meta-features help), whenever TPOT will face a new dataset, it would need to finds its metafeatures first to recommend the starting population and hence it better be an offering packed with TPOT. I might be wrong about this.
You can delete #25 as this PR has the monkey_runner script too.
Number of categorical columns for the first few datasets.