online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
4.89k stars 538 forks source link

Edge Case: TypeError with HistogramSplitter when Data is NaN #1552

Closed jpfeil closed 1 month ago

jpfeil commented 1 month ago

Versions

river version: 0.21.1 Python version: 3.10.14 Operating system: Ubuntu 20.04

Describe the bug

ARFClassifier throws a TypeError when at least one feature is always NaN. This may be a relatively common edge case with streaming data, so might want to handle when a feature has so far been NaN.

Steps/code to reproduce

# Sample code to reproduce the problem
# Please do your best to provide a Minimal, Reproducible Example: https://stackoverflow.com/help/minimal-reproducible-example

from pprint import pprint
from river import datasets
import numpy as np
from random import choice
from river.forest import ARFClassifier
from river.tree.splitter import HistogramSplitter

dataset = datasets.Phishing()

model = ARFClassifier(splitter=HistogramSplitter())
corrupt = None
for x, y in dataset:
    if corrupt is None:
        corrupt = choice(list(x.keys()))
    x[corrupt] = np.nan
    y_pred = model.predict_one(x)
    model.learn_one(x, y)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 17
     15 x[corrupt] = np.nan
     16 y_pred = model.predict_one(x)
---> 17 model.learn_one(x, y)

File ~/.pyenv/versions/river/lib/python3.10/site-packages/river/forest/adaptive_random_forest.py:174, in BaseForest.learn_one(self, x, y, **kwargs)
    171 if not self._warning_detection_disabled and self._background[i] is not None:
    172     self._background[i].learn_one(x=x, y=y, w=k)  # type: ignore
--> 174 model.learn_one(x=x, y=y, w=k)
    176 drift_input = None
    177 if not self._warning_detection_disabled:

File ~/.pyenv/versions/river/lib/python3.10/site-packages/river/tree/hoeffding_tree_classifier.py:368, in HoeffdingTreeClassifier.learn_one(self, x, y, w)
    365     node = self._root
    367 if isinstance(node, HTLeaf):
--> 368     node.learn_one(x, y, w=w, tree=self)
    369     if self._growth_allowed and node.is_active():
    370         if node.depth >= self.max_depth:  # Max depth reached

File ~/.pyenv/versions/river/lib/python3.10/site-packages/river/tree/nodes/htc_nodes.py:193, in LeafNaiveBayesAdaptive.learn_one(self, x, y, w, tree)
    190     if len(nb_pred) > 0 and max(nb_pred, key=nb_pred.get) == y:
    191         self._nb_correct_weight += w
--> 193 super().learn_one(x, y, w=w, tree=tree)

File ~/.pyenv/versions/river/lib/python3.10/site-packages/river/tree/nodes/leaf.py:174, in HTLeaf.learn_one(self, x, y, w, tree)
    172 self.update_stats(y, w)
    173 if self.is_active():
--> 174     self.update_splitters(x, y, w, tree.nominal_attributes)

File ~/.pyenv/versions/river/lib/python3.10/site-packages/river/tree/nodes/leaf.py:109, in HTLeaf.update_splitters(self, x, y, w, nominal_attributes)
    106         splitter = self.splitter.clone()
    108     self.splitters[att_id] = splitter
--> 109 splitter.update(att_val, y, w)

File ~/.pyenv/versions/river/lib/python3.10/site-packages/river/tree/splitter/histogram_splitter.py:38, in HistogramSplitter.update(self, att_val, target_val, w)
     36 def update(self, att_val, target_val, w):
     37     for _ in range(int(w)):
---> 38         self.hists[target_val].update(att_val)

File ~/.pyenv/versions/river/lib/python3.10/site-packages/river/sketch/histogram.py:170, in Histogram.update(self, x)
    168 # Bins have to be merged if there are more than max_bins
    169 if len(self) > self.max_bins:
--> 170     self._shrink(1)

File ~/.pyenv/versions/river/lib/python3.10/site-packages/river/sketch/histogram.py:186, in Histogram._shrink(self, k)
    183             min_idx = idx
    185     # Merge the bins
--> 186     self[min_idx] += self.pop(min_idx + 1)
    187     return
    189 indexes = range(len(self) - 1)

File ~/.pyenv/versions/3.10.14/lib/python3.10/collections/__init__.py:1223, in UserList.__getitem__(self, i)
   1221     return self.__class__(self.data[i])
   1222 else:
-> 1223     return self.data[i]

TypeError: list indices must be integers or slices, not NoneType
smastelini commented 1 month ago

Hi @jpfeil, thanks for reporting.

Indeed, the tree models are not prepared to deal with nan values by design. As we deal with a naturally sparse input representation, i.e., dictionaries, we expect missing feature values to imply a non-existing key x value pair. The algorithms are robust to this situation.

jpfeil commented 1 month ago

Thanks, @smastelini. I was playing around with the frequency of NaNs, and it looks like if the frequency is lower < 0.4, then it does not throw the error, but a frequency greater than that does.

from pprint import pprint
from river import datasets
import numpy as np
from random import choice
from river.forest import ARFClassifier
from river.tree.splitter import HistogramSplitter
from numpy.random import binomial

dataset = datasets.Phishing()

model = ARFClassifier(splitter=HistogramSplitter(), grace_period=1)
corrupt = None

print(dataset.n_samples)
for i, (x, y) in enumerate(dataset):

    if corrupt is None:
        corrupt = choice(list(x.keys()))

    if binomial(1, 0.3, 1)[0]:
        x[corrupt] = np.nan

    print(i)
    print(corrupt)
    print(x)
    y_pred = model.predict_one(x)
    model.learn_one(x, y)

Do you recommend I drop these features or is there a potential fix for this? I suppose I don't necessarily know which features may become nan in an online setting, so this could be an issue for data streams that change over time.

smastelini commented 1 month ago

Hi, @jpfeil. River follows the principle that is easier to "ask for forgiveness than permission". This theme has already popped up in a couple of issues. That is why we don't do any data checks in the algorithms.

Ensuring that nans are skipped for every instance would imply checking every input feature, every time. If we were to make such changes in ARF, we would also need to extend it to every algorithm in River. This type of checking brings high costs to the algorithms' functioning, so ideally, it should be treated on the application side.

jpfeil commented 1 month ago

Thanks for clarifying, @smastelini. Is there a pipeline function that will remove NaN values? I could write my own, but it would be cool if there was a preprocessing function for removing NaNs.

Update: Found it: https://riverml.xyz/dev/api/preprocessing/StatImputer/