Closed marcdhansenesi closed 1 year ago
Hi everyone,
I just checked the code of fit
in MDLPVectDiscretizer
and I agree with the proposed fix. By reading the code, one can see that the search routine in the fit
method does not guarantee sorted cut points because these are stored in a standard set, which does not guarantee sorted values (besides the routine itself does not guarantee that the cut points are added to the set in increasing or decreasing order, so a list is not a choice either). Another solution could be to use a sorted set in line 116 of mdlp_discretizer.py. That makes the code dependent on sortedcontainers, which on the other hand, is declared as a dependency.
cutpoints is now sorted with the following fix: self.cut_points_ = np.array(sorted(list(cut_points))
. The second issue (#134 ) on MDLPDiscretizer will be considered soon.
Describe the bug In
MDLPDiscretizer
, thecut_points_
parameter is not sorted. However, in theMDLPDiscretizer.transform
function in mdlp_discretizer.py on line 239,np.searchsorted
assumes the first parameter (cut_points_
) is sorted in ascending order. Also, the X output fromtransform
assigns integer encodings based on the unsorted order and in the example below, ignores the last cut point. So I'm not sure in general how to do an inverse transform of the encoded values in X back to the appropriate bin ranges for display to our users (maybe drop any cut points that are not in monotonically increasing order?) Finally,cut_points_
cannot be passed to the pandascut
function to create user-friendly bin descriptions, as it expects the bins to be in monotonically increasing order.To Reproduce
Expected behavior
cut_points_
assigned duringfit
would be in sorted order andtransform
would then apply those cut points without skipping any.cut_points_
would also be suitable for passing to pandascut
function.Screenshots N/A
Desktop (please complete the following information):
Additional context I believe the fix only requires changing
fit
inMDLPVectDiscretizer
to returncut_points_
as a sorted array:self.cut_points_ = np.array(sorted(list(cut_points)))
instead ofself.cut_points_ = np.array(list(cut_points))