hlin117 / mdlp-discretization

An implementation of the minimum description length principal expert binning algorithm by Usama Fayyad
BSD 3-Clause "New" or "Revised" License
101 stars 54 forks source link

Inconsistent results with r package mdlp #12

Open ashtonwebster opened 7 years ago

ashtonwebster commented 7 years ago

First, I want to say your work in this library looks great and it is perfect for what I am trying to use it for. So thank you :)

I am getting very different results from what you obtained here: https://gist.github.com/hlin117/dee443d98af5ac3413c5 . When I print conv_X after running your example from the README, it looks like this:

[[  4.  18.   0.   0.]
 [  1.  11.   0.   0.]
 [  0.  13.   0.   0.]
 [  0.  13.   0.   0.]
 [  2.  19.   0.   0.]
 [  7.  21.   0.   0.]
 [  0.  17.   0.   0.]
...

Meanwhile the r package and your previous results give categories in the range of [1,3]. I would feel more confident in these results if they matched the results of the r packages.

I created a new environment and set versions for my packages to match the ones you specify in requirements.txt.

Any idea why I might be obtaining these inconsistent results?

hlin117 commented 7 years ago

Hi Ashton! Thanks for your comments.

Hmm, seems the gist I posted is quite old. It looks like I made the script when I was trying to get the package into scikit-learn, but as of a few months ago, this project has been discontinued.

7 seems to have certain breaking changes, and it has caused a unit test to fail in this package. (I need to get Travis working for this project sometime.)

With all this said, I'll certainly look into your issue when I have the time.

ashtonwebster commented 7 years ago

Thanks @hlin117 . I am not sure if this is a related error, but I am also failing the unit test (in fact, the output discretized values are floats instead of integers).

As a nasty hack, I ended up using rpy2 to simply call the r mdlp function. Perhaps when I am done with my current project I can help to take a look. It's too bad this wasn't accepted by scikit-learn; IMO this would be a useful discretization method.

ChingChuan-Chen commented 5 years ago

The inconsistency is caused by the different implementation. R code is here.

R:

for (i in 1:(n-1)){
    if(x[i+1]!=x[i]) {  # they check x not y
        ct <- (x[i]+x[i+1])/2
        wx <- which(x<=ct)
        wn <- length(wx)/n
        e1 <- wn*ent(y[wx])
        e2 <- (1-wn)*ent(y[-wx])
        val <- e1+e2
        if(val<entropy) {
            entropy <- val
            ci <- i
        }
    }
}

Yours:

for ind in range(start + 1, end):

    # I choose not to use a `min` function here for this optimization.
    if y[ind-1] == y[ind]:  # you check y instead of x
        continue

    # Find the partition entropy, and check if this entropy is minimum
    first_half = (<float> (ind - start)) / length \
                        * slice_entropy(y, start, ind)[0]
    second_half = (<float> (end - ind)) / length \
                         * slice_entropy(y, ind, end)[0]
    curr_entropy = first_half + second_half

    if prev_entropy > curr_entropy:
        prev_entropy = curr_entropy
        k = ind

I think that checking y is wired because the algorithm is to discretize x.

ChingChuan-Chen commented 5 years ago

In the original paper, it also state that only boundary points of the attribute are considered. And it is defined in 2.2.