mi2-warsaw / FSelectorRcpp

Rcpp (free of Java/Weka) implementation of FSelector entropy-based feature selection algorithms with a sparse matrix support
http://mi2-warsaw.github.io/FSelectorRcpp/
35 stars 15 forks source link

Does FSelectorRcpp produce the same results as FSelector? #51

Open larskotthoff opened 7 years ago

larskotthoff commented 7 years ago

Do you guys have any tests to check this? We're thinking of replacing FSelector with FSelectorRcpp in mlr, but we'd like to be sure that we remain reproducible.

@berndbischl

zzawadz commented 7 years ago

For some functions - yes - see e.g. https://github.com/mi2-warsaw/FSelectorRcpp/blob/master/tests/testthat/test-information_gain.R

This is part of the code:

test_that("Comparsion with FSelector", {
  expect_equal(information.gain(Species ~ ., data = iris)$attr_importance,
               information_gain(formula = Species ~ ., data = iris)$importance)

  expect_equal(gain.ratio(Species ~ ., data = iris)$attr_importance,
               information_gain(formula = Species ~ ., data = iris,
                                type = "gainratio")$importance)

  expect_equal(symmetrical.uncertainty(Species ~ .,
                                       data = iris)$attr_importance,
               information_gain(formula = Species ~ ., data = iris,
                                type = "symuncert")$importance)
})

For other functions please send us a list of functionalities which must be checked against FSelector, and then we will prepare required tests to convince you that everything is fine:)

larskotthoff commented 7 years ago

I'd love to see tests for all of the functions that users can call, ideally on a range of different inputs. Maybe using quickcheck (https://github.com/RevolutionAnalytics/quickcheck)?

larskotthoff commented 7 years ago

Oh and once I'm convinced I'm willing to officially deprecate FSelector in favour of FSelectorRcpp.

zzawadz commented 7 years ago

Ok. We will work on this.

Thanks!

MarcinKosinski commented 7 years ago

@zzawadz another amazing challenge for FSelectorRcpp : )

Maybe it'll be the easiest way to include FSelectorRcpp in the FSelector

zzawadz commented 7 years ago

@MarcinKosinski Good idea. We can replace functionalities (inner implementation) in FSelector step by step to reach the convergence. @larskotthoff What do you think?

larskotthoff commented 7 years ago

Sounds good. Pull requests welcome!

MarcinKosinski commented 7 years ago

So this can be closed - https://github.com/mi2-warsaw/FSelectorRcpp/issues/27 : ) @larskotthoff is aware of that we will suggest inner implementation

MarcinKosinski commented 7 years ago

Getting back to this thread. FSelectorRcpp will be available on CRAN again soon (removed because lack of informtion of C++ dependency) https://github.com/mi2-warsaw/FSelectorRcpp/issues/69

To enable FSelectorRcpp be a part of FSelector engine I think we could try substituting

FSelector:::information.gain.body() function with the FSelectorRcpp::information_gain(). We need to polish FSelectorRcpp edition to produce the same results as FSelector and also enable some another approaches to dealing with NAs and discretization of dependent variable.

2 tasks should be finished then

FSelector:::information.gain.body <- function(params, equal = TRUE) {
    FSelectorRcpp::information_gain(params, equal = equal)
}
RandomGuessR commented 5 years ago

Hi, I am struggling to get the same results from FSelectorRcpp and FSelector - posted under this issue: https://github.com/mlr-org/mlr/issues/1677#issuecomment-431234791. The results I get are actually very different, and the impact on an end model is large. Would appreciate your help if I am doing anything wrong. Thanks!

zzawadz commented 5 years ago

@RandomGuessR

FSelectorRcpp treats integer columns like factors, not numeric, and because of that, it does not discretize them before calculating the information gain. You need to cast the integers columns into numerics to get the same result:

See the code below:

library(FSelectorRcpp)
library(FSelector)
dt <- read.csv("~/Downloads/all/train.csv")

dt2 <- data.frame(
  yy = dt$target,
  X0deb4b6a8 = dt$X0deb4b6a8,
  X0deb4b6a8Numeric = as.numeric(dt$X0deb4b6a8)
)

information_gain(yy ~ ., dt2, equal = TRUE)
#          attributes  importance
# 1        X0deb4b6a8 0.001443917
# 2 X0deb4b6a8Numeric 0.000000000

information.gain(yy ~ ., dt2)
#                   attr_importance
# X0deb4b6a8                      0
# X0deb4b6a8Numeric               0
RandomGuessR commented 5 years ago

Thanks for helping with this so quickly! Might be good to document this difference somewhere in the package(s)

MarcinKosinski commented 5 years ago

Kudos for Zzawadz

pt., 19 paź 2018, 10:34 użytkownik RandomGuessR notifications@github.com napisał:

Thanks for helping with this so quickly!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mi2-warsaw/FSelectorRcpp/issues/51#issuecomment-431288122, or mute the thread https://github.com/notifications/unsubscribe-auth/AGdazkRXSO22dOfoPnz52ZiKuTYusGvSks5umY6kgaJpZM4MdpFY .

zzawadz commented 5 years ago

@RandomGuessR @MarcinKosinski

I found an inconsistent behavior in FSelectorRcpp:( The information_gain does not discretize integers, but discretize do this:( I consider this as a bug, and I'll fix this.

RandomGuessR commented 5 years ago

Thanks @zzawadz.

After changing the data from integer to numeric, FSelectorRcpp works like a treat; really happy with the performance. The RWeka-based implementation was too slow for most real-world practical purposes.

zzawadz commented 5 years ago

We (I should say me) decided that FSelectorRcpp will try to mimic the behavior of FSelector so that since 0.3.0 integers will be treated as numerics by default, not factors.