sramirez / spark-infotheoretic-feature-selection

This package contains a generic implementation of greedy Information Theoretic Feature Selection (FS) methods. The implementation is based on the common theoretic framework presented by Gavin Brown. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided.
http://sci2s.ugr.es/BigData
Apache License 2.0
134 stars 46 forks source link

Info-Theoretic Framework requires positive values in range [0, 255] #10

Closed michaelws92 closed 6 years ago

michaelws92 commented 6 years ago

Is your algorithm not support for double or float value? Do you have suggestion if I have a very big value data like in million ? because your algorithm not support it

sramirez commented 6 years ago

Hi Michael,

You can discretize your data with my package spark-MDLP. I have updated the README file in order to reflect all the information that you demand.

_LabeledPoint data must be discretized as integer values in double representation, ranging from 0 to 255. By doing so, double values can be transformed to byte directly thus making the overall selection process much more efficient (communication overhead is deeply reduced).

Please refer to the MDLP package if you need to discretize your dataset:

https://spark-packages.org/package/sramirez/spark-MDLP-discretization_