imoscovitz / wittgenstein

Ruleset covering algorithms for transparent machine learning
MIT License
90 stars 24 forks source link

Dropping samples unnecessarily (possible bug) #15

Open Veghit opened 3 years ago

Veghit commented 3 years ago

https://github.com/imoscovitz/wittgenstein/blob/5dbb2ecdbaee425d0bb547c6d8bdc73c919f35bd/wittgenstein/base.py#L127

The linked line is dropping duplicates even though it should be fine to have duplicate data samples in the data frame. This causes the wrong estimation of probabilities down the road. Am I missing something in the function's logic here ?

Arzik1987 commented 2 years ago

@Veghit, I assume the logic behind this operation was not to count one example multiple times since rules are not mutually exclusive. However, this line indeed causes improper behavior in case a dataset contains duplicates. Here is the example:

from sklearn.datasets import make_classification
import wittgenstein as lw
import numpy as np

X, y = make_classification(random_state = 2021)
ripper = lw.RIPPER(random_state = 2021)
ripper.fit(X, y)
print(ripper.score(X, y)) # prints 0.89
print(ripper.score(np.vstack([X, X]), np.hstack([y, y]))) # should be 0.89 as well, but returns 0.7

UPD. Having studied the further logic, I also see that the command is redundant. Removing it solves the issue in the above example but does not crash the overall logic since 'predict()' uses a set of unique indices anyways.