ianozsvald / beyond_correlation

Exploratory code to see if we can learn about feature relationships in a DataFrame using machine learning
MIT License
55 stars 19 forks source link

Two bits of feedback from a user #16

Closed ianozsvald closed 5 years ago

ianozsvald commented 5 years ago

" Two things that possibly stuck out: How do I fix the random seed? And maybe have the categorical default to an empty set to make regression problems a bit cleaner." https://www.linkedin.com/feed/update/urn%3Ali%3Aactivity%3A6543945632275546112/?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A6543945632275546112%2C6544252539892703232%29

JesperDramsch commented 5 years ago

Hey Ian,

Default Classifier_Overrides

as of right now it seems you have to

classifier_overrides = set()
df_results = discover.discover(df, classifier_overrides)

That's at least what I took from playing around with it and looking at the examples. Having classifier_overrides default to set() would clean up the code a bit, as it only applies to classifiers, where you'd have to declare it anyways, so having it as a default would probably be a good choice.

Reproducibility

Setting the random state improves reproducibility of findings, maybe even defaulting to a value, fell a bit into a trap, writing about a discover-matrix, then having some different results on re-running.

JesperDramsch commented 5 years ago

Seen here but it seems I just misinterpreted the example, with None actually being fine.

JesperDramsch commented 5 years ago

17 is to fix the random state in sklearn.

ianozsvald commented 5 years ago

Many thanks for the contribution Jesper, I'm at a conference speaking this weekend (PyLondinium), I'll get this reviewed next week. Thank you!

JesperDramsch commented 5 years ago

That's cool. Enjoy the conference!

ianozsvald commented 5 years ago

Cheers for the addition. Later (or you might, if you fancy), it would be sensible to pass in a kw_sklearn kwargs dict where random_state is one of the possible parameters to pass in, but that also opens up a can of worms around how to specify metrics (accuracy is a bit crap...maybe balanced_accuracy as a reasonable replacement?) and other options and that might take a bit of thought. Cheers!