hlin117 / mdlp-discretization

An implementation of the minimum description length principal expert binning algorithm by Usama Fayyad
BSD 3-Clause "New" or "Revised" License
101 stars 53 forks source link

Optimize cut points: duplicate feature values with different labels #27

Open dodgejoel opened 6 years ago

dodgejoel commented 6 years ago

If there are multiple rows in the column being discretized that have the same value but that have different classes, then it is possible for the discretization algorithm to produce different results depending on the order that the rows appear in the dataset. This is obviously incorrect and should be fixed.

hlin117 commented 6 years ago

I know what you mean, but this isn't a bug per se; it's just a feature of the algorithm. From what I know, there isn't a smart way to address this problem, which is why I have the random_state parameter here.

If you do have a suggestion, I'd rather you make it another option into the algorithm as opposed to a breaking change. I'm looking forward to what you might find!

dodgejoel commented 6 years ago

This can be fixed by ensuring that each cut point corresponds not only to a change in the y-value but also to a change in the col value. The entropy computations for the subset corresponding to each half of a cut do not agree with the real distribution of data otherwise.

The implementation operates with indices as if they correspond perfectly with cuts but this correspondence is not actually possible if there are repeated col values. If each col value only appears with a single y-value then this effect is mitigated by throwing out indices where the y-value does not change but if some col value appears with multiple different classes then the implementation is incorrect.

dodgejoel commented 6 years ago

any response here?

hlin117 commented 6 years ago

This can be fixed by ensuring that each cut point corresponds not only to a change in the y-value but also to a change in the col value. The entropy computations for the subset corresponding to each half of a cut do not agree with the real distribution of data otherwise.

The implementation operates with indices as if they correspond perfectly with cuts but this correspondence is not actually possible if there are repeated col values. If each col value only appears with a single y-value then this effect is mitigated by throwing out indices where the y-value does not change but if some col value appears with multiple different classes then the implementation is incorrect.

Can you cite an implementation or paper where they make this happen?

From my glance at the code for the R analog of this package, they don't do what you describe.

dodgejoel commented 6 years ago

I agree that it does not appear that this R package is as careful as I am suggesting we be here. I believe that the original paper itself is a strong and clear enough citation for this suggestion. Reading the argument in the paper I think it is clear that the cuts being considered are genuinely corresponding to cuts determined by a threshold and not just by indices in the array holding the values.

If other implementations of this algorithm do not take the level of care that I am suggesting then I think that is not a good reason to avoid making this change.

hlin117 commented 6 years ago

I genuinely respect that you're taking a great level of care with this, and I do agree that it's possible that the current code is not producing optimal cuts. (I've actually reached out to Huan Liu before with this very same problem.)

I think I implemented this fix in a previous version of this package, but removed it for reasons I can't remember. If you would like to implement this, then be my guest!

dodgejoel commented 6 years ago

I genuinely appreciate you engaging with my suggestions! I will implement this change and attempt to add some appropriate unit tests around the change.

Thanks!