hlin117 / mdlp-discretization

An implementation of the minimum description length principal expert binning algorithm by Usama Fayyad
BSD 3-Clause "New" or "Revised" License
101 stars 53 forks source link

Use base2 log in entropy calculation #26

Open dodgejoel opened 6 years ago

dodgejoel commented 6 years ago

The scipy entropy function called supports a base argument that can be used. Of course, the original paper's proof applies when entropy is calculated in base 2 not base e.

hlin117 commented 6 years ago

I don't think this is necessary. The function should work just as well with log_e compared to log_2. Is there any advantage to using log_2 instead of log_e?

I avoided log_2 in the original code here because it was unnecessary, and made the code perhaps harder to read.

dodgejoel commented 6 years ago

I think the theoretical results on optimality of the discretization obtained with this method specifically depend on the entropy calculations ,which enter into the decision to accept or reject a given cut-point, being calculated with log_2. At some point the authors even specifically call this out as a requirement for the validity of their results: "Note that this theorem requires that the entropy Ent(S) be defined using logarithms to the base 2" on page 1026.

I don't see how using a different base in the logarithm impacts code readability in a substantial way. When you say that the function works "just as well with log_e compared to log_2" what exactly do you mean?

dodgejoel commented 6 years ago

Perhaps you are claiming that the cut points produced will always be the same? Is it formally true that it does not matter which base is used in this algorithm?

dodgejoel commented 6 years ago

any response here?