add entropy-based estimate of number of effective values to skimr package

ropensci / ozunconf17

Website for 2017 rOpenSci Ozunconf

http://ozunconf17.ropensci.org/

24 stars 6 forks source link

add entropy-based estimate of number of effective values to skimr package #9

Open rgayler opened 7 years ago

rgayler commented 7 years ago

skimr is an ropensci package implementing a pipeable approach to creating summary statistics.

Given a novel data set I find it helpful to get a feel for a numeric variable whether it is more like a continuous or discrete variable. I calculate the information-theoretic entropy of the values and calculate an effective number of levels (i.e. the number of equiprobable levels having the same entropy) and comparing the number of effective levels to the number of unique values.

This can also be done for character variables, which gives an idea of the number of effective levels for variables with extremely skewed distributions (e.g. words, names).

njtierney commented 7 years ago

This sounds interesting! Would this help say to identify whether a variable with the values

var <- c(91, 92, 93, 93.01, 94)

Is more "discrete" since there is only one numeric number in there?

Perhaps related, there is the readr::parse_guess() function, which helps identify whether something is an integer, double, date, etc. etc.

rgayler commented 7 years ago

It's more to do with how the variable should be treated in modelling. Even if you are willing to treat a numeric variable as continuous, you may have the vast majority of probability mass on a small number of values, in which case it will have a small number of effective values, so behaves more like a discrete-valued variable. This is more relevant when you don't know whether you should treat the data as continuous.

Unlike scientific data where you have generated and understand the data, I tend to get data where i have no idea how it has been encoded. So i can see whether a variable is an int, double, etc. but I can't see how real-world things are mapped to values of the variable.