scikit-mine / scikit-mine

scikit-mine : pattern mining in Python
https://scikit-mine.github.io/scikit-mine/
BSD 3-Clause "New" or "Revised" License
72 stars 10 forks source link

MDLPDiscretizer: cut_points_ is not sorted #129

Closed marcdhansenesi closed 1 year ago

marcdhansenesi commented 1 year ago

Describe the bug In MDLPDiscretizer, the cut_points_ parameter is not sorted. However, in the MDLPDiscretizer.transform function in mdlp_discretizer.py on line 239, np.searchsorted assumes the first parameter (cut_points_) is sorted in ascending order. Also, the X output from transform assigns integer encodings based on the unsorted order and in the example below, ignores the last cut point. So I'm not sure in general how to do an inverse transform of the encoded values in X back to the appropriate bin ranges for display to our users (maybe drop any cut points that are not in monotonically increasing order?) Finally, cut_points_ cannot be passed to the pandas cut function to create user-friendly bin descriptions, as it expects the bins to be in monotonically increasing order.

To Reproduce

import numpy as np
import pandas as pd
from skmine.preprocessing import MDLPDiscretizer
X = np.ndarray(shape=(14, 1), dtype=np.float64,
               buffer=np.array([1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 3.0, 3.0, 4.0, 4.0, 2.0, 2.0]))
y = np.array(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]), dtype=np.int32)
disc = MDLPDiscretizer()
disc.fit(X, y)
print(f"disc.cut_points_ = {disc.cut_points_}")
# disc.cut_points_ = {0: array([1.5, 2. , 3. , 3.5, 4. , 1. ])}
Xt = disc.transform(X).T
print(f"Xt = {Xt}")
# Xt = [[0 0 0 0 0 0 2 1 2 2 4 4 1 1]]
bins_for_pd= np.concatenate(([-np.inf], disc.cut_points_[0], [np.inf]), axis=None)
user_friendly_bins = pd.cut(np.ravel(X), bins_for_pd)
# ValueError: bins must increase monotonically.

Expected behavior cut_points_ assigned during fit would be in sorted order and transform would then apply those cut points without skipping any. cut_points_ would also be suitable for passing to pandas cut function.

# disc.cut_points_ = {0: array([1., 1.5, 2. , 3. , 3.5, 4. ])}
Xt = disc.transform(X).T
print(f"Xt = {Xt}")
# Xt = [[0 0 0 0 0 0 3 2 3 3 5 5 2 2]]
bins_for_pd= np.concatenate(([-np.inf], disc.cut_points_[0], [np.inf]), axis=None)
user_friendly_bins = pd.cut(np.ravel(X), bins_for_pd)
print(f"user_friendly_bins = {user_friendly_bins}")
# user_friendly_bins = [(-inf, 1.0], (-inf, 1.0], (-inf, 1.0], (-inf, 1.0], (-inf, 1.0], ..., (2.0, 3.0], (3.5, 4.0], (3.5, 4.0], (1.5, 2.0], (1.5, 2.0]]
# Length: 14
# Categories (7, interval[float64, right]): [(-inf, 1.0] < (1.0, 1.5] < (1.5, 2.0] < (2.0, 3.0] <
#                                            (3.0, 3.5] < (3.5, 4.0] < (4.0, inf]]

Screenshots N/A

Desktop (please complete the following information):

Additional context I believe the fix only requires changing fit in MDLPVectDiscretizer to return cut_points_ as a sorted array: self.cut_points_ = np.array(sorted(list(cut_points))) instead of self.cut_points_ = np.array(list(cut_points))

lgalarra commented 1 year ago

Hi everyone,

I just checked the code of fit in MDLPVectDiscretizer and I agree with the proposed fix. By reading the code, one can see that the search routine in the fit method does not guarantee sorted cut points because these are stored in a standard set, which does not guarantee sorted values (besides the routine itself does not guarantee that the cut points are added to the set in increasing or decreasing order, so a list is not a choice either). Another solution could be to use a sorted set in line 116 of mdlp_discretizer.py. That makes the code dependent on sortedcontainers, which on the other hand, is declared as a dependency.

thomasbtnfr commented 1 year ago

cutpoints is now sorted with the following fix: self.cut_points_ = np.array(sorted(list(cut_points)). The second issue (#134 ) on MDLPDiscretizer will be considered soon.