online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
4.89k stars 538 forks source link

kNN causes an error when the input has nominal data #1565

Closed e10e3 closed 1 week ago

e10e3 commented 1 week ago

Versions

River version: 0.21.1 Python version: 3.12.4 Operating system: macOS 14.5

Describe the bug

When a kNN model does a prediction, it computes the distance of its input with previous data points.\ If some of the feature are nominal (i.e. not numbers), kNN will cause an error because it tries to perform a subtraction on unsupported data types.

One would expect kNN to be resilient to nominal data, since there exist ways to derive a distance between non-numeric features. A simple one is to give a distance of 0 when they are equal and 1 if they are different.

Code to reproduce

from river import neighbors

m = neighbors.KNNClassifier()

m.learn_one({"a": "river"}, 0)
m.predict_one({"a": "river"})

Output

Traceback (most recent call last):
  File "/path/to/river/knn-strings.py", line 6, in <module>
    m.predict_one({"a": "river"})
  File "/path/to/river/river/base/classifier.py", line 66, in predict_one
    y_pred = self.predict_proba_one(x, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/neighbors/knn_classifier.py", line 149, in predict_proba_one
    nearest = self._nn.search((x, None), n_neighbors=self.n_neighbors, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/neighbors/ann/swinn.py", line 431, in search
    return self._linear_scan(item, n_neighbors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/neighbors/ann/swinn.py", line 339, in _linear_scan
    points = [(p.item, self.dist_func(item, p.item)) for p in self]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/neighbors/base.py", line 33, in __call__
    return self.distance_function(a[0], b[0])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/utils/math.py", line 165, in minkowski_distance
    return sum((abs(a.get(k, 0.0) - b.get(k, 0.0))) ** p for k in {*a.keys(), *b.keys()}) ** (1 / p)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/river/river/utils/math.py", line 165, in <genexpr>
    return sum((abs(a.get(k, 0.0) - b.get(k, 0.0))) ** p for k in {*a.keys(), *b.keys()}) ** (1 / p)
                    ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'str' and 'str'
smastelini commented 1 week ago

Hi @e10e3, I see your point with the error. However, following the river convention, this type of situation should be dealt with on the application side. So, it is not a bug, but the intended behavior.

I am also against making kNN slow in cases where the data is numeric only.

Note that linear models, decisions trees and other models are also prone to fail if the user passes numeric and non-numeric data directly.

As possible solutions, the user can use a pipeline with One-Hot Encoding or supplying a custom distance metric as you propose in the related PR.

e10e3 commented 1 week ago

I agree with your point.

I guess I won't be the last (nor the first) to propose such change, is this convention explained in a document for future reference?

smastelini commented 1 week ago

Hi @e10e3, that is a good point. We have an entry on the River FAQ about input validation.