Closed jpfeil closed 1 month ago
Hi @jpfeil, thanks for reporting.
Indeed, the tree models are not prepared to deal with nan values by design. As we deal with a naturally sparse input representation, i.e., dictionaries, we expect missing feature values to imply a non-existing key x value pair. The algorithms are robust to this situation.
Thanks, @smastelini. I was playing around with the frequency of NaNs, and it looks like if the frequency is lower < 0.4, then it does not throw the error, but a frequency greater than that does.
from pprint import pprint
from river import datasets
import numpy as np
from random import choice
from river.forest import ARFClassifier
from river.tree.splitter import HistogramSplitter
from numpy.random import binomial
dataset = datasets.Phishing()
model = ARFClassifier(splitter=HistogramSplitter(), grace_period=1)
corrupt = None
print(dataset.n_samples)
for i, (x, y) in enumerate(dataset):
if corrupt is None:
corrupt = choice(list(x.keys()))
if binomial(1, 0.3, 1)[0]:
x[corrupt] = np.nan
print(i)
print(corrupt)
print(x)
y_pred = model.predict_one(x)
model.learn_one(x, y)
Do you recommend I drop these features or is there a potential fix for this? I suppose I don't necessarily know which features may become nan in an online setting, so this could be an issue for data streams that change over time.
Hi, @jpfeil. River follows the principle that is easier to "ask for forgiveness than permission". This theme has already popped up in a couple of issues. That is why we don't do any data checks in the algorithms.
Ensuring that nans are skipped for every instance would imply checking every input feature, every time. If we were to make such changes in ARF, we would also need to extend it to every algorithm in River. This type of checking brings high costs to the algorithms' functioning, so ideally, it should be treated on the application side.
Thanks for clarifying, @smastelini. Is there a pipeline function that will remove NaN values? I could write my own, but it would be cool if there was a preprocessing function for removing NaNs.
Update: Found it: https://riverml.xyz/dev/api/preprocessing/StatImputer/
Versions
river version: 0.21.1 Python version: 3.10.14 Operating system: Ubuntu 20.04
Describe the bug
ARFClassifier throws a TypeError when at least one feature is always NaN. This may be a relatively common edge case with streaming data, so might want to handle when a feature has so far been NaN.
Steps/code to reproduce