Closed e10e3 closed 1 week ago
Hi @e10e3, I see your point with the error. However, following the river convention, this type of situation should be dealt with on the application side. So, it is not a bug, but the intended behavior.
I am also against making kNN slow in cases where the data is numeric only.
Note that linear models, decisions trees and other models are also prone to fail if the user passes numeric and non-numeric data directly.
As possible solutions, the user can use a pipeline with One-Hot Encoding or supplying a custom distance metric as you propose in the related PR.
I agree with your point.
I guess I won't be the last (nor the first) to propose such change, is this convention explained in a document for future reference?
Hi @e10e3, that is a good point. We have an entry on the River FAQ about input validation.
Versions
River version: 0.21.1 Python version: 3.12.4 Operating system: macOS 14.5
Describe the bug
When a kNN model does a prediction, it computes the distance of its input with previous data points.\ If some of the feature are nominal (i.e. not numbers), kNN will cause an error because it tries to perform a subtraction on unsupported data types.
One would expect kNN to be resilient to nominal data, since there exist ways to derive a distance between non-numeric features. A simple one is to give a distance of 0 when they are equal and 1 if they are different.
Code to reproduce
Output