mi2-warsaw / FSelectorRcpp

Rcpp (free of Java/Weka) implementation of FSelector entropy-based feature selection algorithms with a sparse matrix support
http://mi2-warsaw.github.io/FSelectorRcpp/
35 stars 15 forks source link

Enable FSelectoRcpp dealing with NAs in explanatory variables as in the RWeka::Discretize #63

Open krzyslom opened 7 years ago

krzyslom commented 7 years ago
iris_na <- iris
iris_na$Sepal.Length[1] <- NA

FSelector::information.gain(Species ~ ., iris_na)
FSelectorRcpp::information_gain(Species ~ ., iris_na)

This corresponds to #51 issue.

MarcinKosinski commented 7 years ago

@krzyslom I think FSelectorRcpp completly removes rows with NAs . Can you provide a summary of behaviour for FSelector in this case?

From FSelector:::information.gain.body -> FSelector:::discretize.all -> FSelector:::supervised.discretization I see

function (formula, data) 
{
    data = get.data.frame.from.formula(formula, data)
    complete = complete.cases(data[[1]])
    all.complete = all(complete)
    if (!all.complete) {
        new_data = data[complete, , drop = FALSE]
        result = Discretize(formula, data = new_data, na.action = na.pass)
        return(result)
    }
    else {
        return(Discretize(formula, data = data, na.action = na.pass))
    }
}
<environment: namespace:FSelector>

That FSelector removes only rows where NA is in the dependent variable. So the only thing is to check how does FSelector (by the interface to RWeka::Dicretize` deals with NAs in the explanatory variables

> RWeka::Discretize
An R interface to Weka class 'weka.filters.supervised.attribute.Discretize', which has
information

  An instance filter that discretizes a range of numeric attributes in the dataset
  into nominal attributes. Discretization is by Fayyad & Irani's MDL method (the
  default).

  For more information, see:

  Usama M. Fayyad, Keki B. Irani: Multi-interval discretization of continuousvalued
  attributes for classification learning. In: Thirteenth International Joint
  Conference on Articial Intelligence, 1022-1027, 1993.

  Igor Kononenko: On Biases in Estimating Multi-Valued Attributes. In: 14th
  International Joint Conference on Articial Intelligence, 1034-1040, 1995.

  BibTeX:

  @INPROCEEDINGS{Fayyad1993,
    publisher = {Morgan Kaufmann Publishers},
    year = {1993},
    pages = {1022-1027},
    author = {Usama M. Fayyad and Keki B. Irani},
    title = {Multi-interval discretization of continuousvalued attributes for
      classification learning},
    volume = {2},
    booktitle = {Thirteenth International Joint Conference on Articial Intelligence},
  }

  @INPROCEEDINGS{Kononenko1995,
    year = {1995},
    pages = {1034-1040},
    PS = {http://ai.fri.uni-lj.si/papers/kononenko95-ijcai.ps.gz},
    author = {Igor Kononenko},
    title = {On Biases in Estimating Multi-Valued Attributes},
    booktitle = {14th International Joint Conference on Articial Intelligence},
  }

Argument list:
  x(formula, data, subset, na.action, control = NULL)

Returns objects inheriting from classes:
  Discretize data.frame