mi2-warsaw / FSelectorRcpp

Rcpp (free of Java/Weka) implementation of FSelector entropy-based feature selection algorithms with a sparse matrix support
http://mi2-warsaw.github.io/FSelectorRcpp/
35 stars 15 forks source link

Add `nbins` argument to `information.gain()` #79

Closed pat-s closed 4 years ago

pat-s commented 5 years ago

Making it possible for the user to choose the bin size.

The bin size could optionally be tuned when information.gain() is used with other packages, e.g. mlr.

Other changes

Example

library(mlr)
#> Loading required package: ParamHelpers
library(FSelectorRcpp)

# generate data
task = dropFeatures(bh.task, "chas")

data = getTaskData(task)
x = data[getTaskFeatureNames(task)]
y = data[[getTaskTargetNames(task)]]

# equal = FALSE (default)
information_gain(x = x, y = y)
#> Warning in .information_gain.data.frame(x = x, y = y, type = type, equal
#> = equal, : Dependent variable is a numeric! It will be converted to
#> factor with simple factor(y). We do not discretize dependent variable
#> in FSelectorRcpp by default! You can choose equal frequency binning
#> discretization by setting equal argument to TRUE.
#>    attributes importance
#> 1        crim          0
#> 2          zn          0
#> 3       indus          0
#> 4         nox          0
#> 5          rm          0
#> 6         age          0
#> 7         dis          0
#> 8         rad          0
#> 9         tax          0
#> 10    ptratio          0
#> 11          b          0
#> 12      lstat          0

# equal = TRUE
information_gain(x = x, y = y, equal = TRUE)
#>    attributes importance
#> 1        crim  0.2433297
#> 2          zn  0.1259694
#> 3       indus  0.3493898
#> 4         nox  0.3575879
#> 5          rm  0.4123003
#> 6         age  0.2516128
#> 7         dis  0.1499913
#> 8         rad  0.1491553
#> 9         tax  0.2543226
#> 10    ptratio  0.2924772
#> 11          b  0.1011264
#> 12      lstat  0.5993705

# equal = TRUE, nbins = 10
information_gain(x = x, y = y, equal = TRUE, nbins = 10)
#>    attributes importance
#> 1        crim  0.2990401
#> 2          zn  0.1308077
#> 3       indus  0.3393358
#> 4         nox  0.3566715
#> 5          rm  0.3767511
#> 6         age  0.2052093
#> 7         dis  0.1916598
#> 8         rad  0.1700187
#> 9         tax  0.1988506
#> 10    ptratio  0.2356597
#> 11          b  0.1081561
#> 12      lstat  0.5917914

# equal = TRUE, nbins = 30
information_gain(x = x, y = y, equal = TRUE, nbins = 30)
#>    attributes importance
#> 1        crim  0.2534268
#> 2          zn  0.0000000
#> 3       indus  0.2637486
#> 4         nox  0.2436495
#> 5          rm  0.4489540
#> 6         age  0.2348189
#> 7         dis  0.2423434
#> 8         rad  0.0000000
#> 9         tax  0.2316402
#> 10    ptratio  0.2622159
#> 11          b  0.0000000
#> 12      lstat  0.6510120

Created on 2019-06-25 by the reprex package (v0.3.0)

MarcinKosinski commented 5 years ago

CC @zzawadz

MarcinKosinski commented 4 years ago

Hey @pat-s I see an error on travis, however I see that's the dependencies issues. Let me restart the build and remove unneeded dependencies if that might help

MarcinKosinski commented 4 years ago

Thanks @pat-s for the update