online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
4.89k stars 538 forks source link

ARF can crash if the number of input features changes #1560

Closed e10e3 closed 2 weeks ago

e10e3 commented 2 weeks ago
## Versions

River version: 0.21.1 Python version: 3.12.4 Operating system: MacOS 14.5

Describe the bug

When using an adaptive random forest (ARF), if the number of features in the input dictionary changes and goes below a threshold, the model crashes because of a sampling error.

This situation can happen if feature selection is used, or simply if the number of features changes in the data stream.

This crash happens because the maximum number of features to consider is set when the leaves are created. If the effective number of features changes subsequently, the leaf calls random.sample() with a sample size larger than the number of elements (which is illegal).

Code to reproduce

# Crashes when the number of feature becomes less than "max_features"

from river import forest

arf = forest.ARFClassifier(seed=0)

xs = [
    ({"a": 0, "b": 2, "c": 0}, 1),
    ({"a": 1, "b": 2, "c": 1}, 2),
    ({"a": 1, "b": 2, "c": 2}, 3),
    ({"a": 2, "b": 2, "c": 0}, 4),
    ({"a": 3, "b": 2, "c": 1}, 5),
    ({"a": 5, "b": 2, "c": 2}, 6),
    ({"a": 8, "b": 2, "c": 0}, 7),
    ({"a": 13}, 0),
    ({"a": 21}, 0),
]

for x in xs:
    arf.learn_one(*x)

Output:

Traceback (most recent call last):
  […]
  File "/path/to/lib/python3.12/site-packages/river/tree/nodes/arf_htc_nodes.py", line 39, in _iter_features
    self.feature_indices = self._sample_features(x, self.max_features)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib/python3.12/site-packages/river/tree/nodes/arf_htc_nodes.py", line 47, in _sample_features
    return self.rng.sample(sorted(x.keys()), k=max_features)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/other/path/to/lib/python3.12/random.py", line 430, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative
smastelini commented 2 weeks ago

Thanks for reporting that @e10e3! I fixed this error in #1561.

e10e3 commented 2 weeks ago

Thank you for your quick response @smastelini!

When looking through the code, I found the test check_disappearing_features, which seems to ensure such behaviour does not happen. Do you know why the tests did not find this issue?

smastelini commented 2 weeks ago

Hi @e10e3 , I think the tests were designed to test the robustness of ARF when a few features disappear, but not when the number of missing features went below the bare minimum. It is indeed an interesting corner case to catch :)