SRP can cause a recursion error on large datasets

e10e3 commented 3 months ago

Versions

River version: 0.21.1 Python version: 3.12.4 Operating system: macOS 14.5

Describe the bug

When used on large datasets, SRP (with the default arguments) can make its Hoeffding trees grow so much they go beyond Python's recursion limit.

In theory this could happen with bare Hoeffding trees as well, but I could not reproduce the crash with them.

Steps/code to reproduce

I used the Sensors/Intel lab Data stream, available freely. The file is a bit large to insert here, but I can give the files I used if needed.

from river import ensemble, evaluate, metrics

dataset = Sensors() # A class that loads the stream, typically inheriting from FileDataset
model = ensemble.SRPClassifier(seed=10)
metric = metrics.Accuracy()
evaluate.progressive_val_score(dataset, model, metric, print_every=50_000)

If you try to reproduce, beware: the training time can be long on big datasets.

Output:

(previous lines elided for concision)
[1,700,000] Accuracy: 76.42%
[1,750,000] Accuracy: 76.58%
[1,800,000] Accuracy: 76.51%
[1,850,000] Accuracy: 76.32%
[1,900,000] Accuracy: 75.64%
[1,950,000] Accuracy: 75.11%
[2,000,000] Accuracy: 74.50%
[2,050,000] Accuracy: 74.14%
Traceback (most recent call last):
  File "/path/to/test-srp-limit.py", line 13, in <module>
    ev = evaluate.progressive_val_score(dataset, model, metric, print_every=50_000)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/site-packages/river/evaluate/progressive_validation.py", line 399, in progressive_val_score
    for checkpoint in checkpoints:
  File "/path/to/site-packages/river/evaluate/progressive_validation.py", line 220, in iter_progressive_val_score
    yield from _progressive_validation(
  File "/path/to/site-packages/river/evaluate/progressive_validation.py", line 92, in _progressive_validation
    model.learn_one(x, y, **kwargs)
  File "/path/to/site-packages/river/ensemble/streaming_random_patches.py", line 112, in learn_one
    model.learn_one(x=x, y=y, w=k, n_samples_seen=self._n_samples_seen)
  File "/path/to/site-packages/river/ensemble/streaming_random_patches.py", line 546, in learn_one
    self.model.learn_one(x=x_subset, y=y, **kwargs)
  File "/path/to/site-packages/river/tree/hoeffding_tree_classifier.py", line 379, in learn_one
    self._attempt_to_split(node, p_node, p_branch)
  File "/path/to/site-packages/river/tree/hoeffding_tree_classifier.py", line 319, in _attempt_to_split
    self._enforce_size_limit()
  File "/path/to/site-packages/river/tree/hoeffding_tree.py", line 252, in _enforce_size_limit
    leaves = self._find_leaves()
             ^^^^^^^^^^^^^^^^^^^
  File "/path/to/site-packages/river/tree/hoeffding_tree.py", line 325, in _find_leaves
    return [leaf for leaf in self._root.iter_leaves()]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/site-packages/river/tree/base.py", line 117, in iter_leaves
    yield from child.iter_leaves()
  File "/path/to/site-packages/river/tree/base.py", line 117, in iter_leaves
    yield from child.iter_leaves()
  File "/path/to/site-packages/river/tree/base.py", line 117, in iter_leaves
    yield from child.iter_leaves()
  [Previous line repeated 987 more times]
RecursionError: maximum recursion depth exceeded

e10e3 commented 3 months ago

A possible solution for this is to change the defaults in SRP, or maybe directly in Hoeffding trees, to limit the maximum depth of the trees in order to respect Python's recursion limit.

In CPython, the recursion limit it at 1000 by default (it can be changed by the user).

smastelini commented 3 months ago

Hi @e10e3, I suggest taking a look at the Hoeffding Trees guideline to check the parameters that control memory usage and tree depth. I am open to changing default values, but I fear the requirements may vary depending on the data and the machine specs. If you have any suggestions, I would be happy to hear them!

As for a workaround, I suggest limiting the maximum memory usage and decreasing the interval the trees trigger the memory enforcement checking routine. Maybe we could find good defaults for these parameters. What do you think?

e10e3 commented 3 months ago

When I investigated this crash, I found that setting the max_depthargument of the Hoeffding tree to 985 successfully avoided the crash. Setting it 15 below the limit gives some liberty to the user to add a few function calls around the model.

The default limit of 1000 recursive calls is hard-coded in CPython, as can be seen in the source code. Users can change the value at run time, but the default is the same for everyone (unless you recompiled Python with a different limit, of course).

It is true that hitting this limit depends a lot on your data. As a matter of fact, I tested various datasets, and only encountered this problem on Sensors! This makes me think that setting a default maximum depth for the Hoeffding trees will have a negligible impact on River's users, while preventing a crash.

About the other avenues you mentioned, I fear that limiting the memory use may not be enough. If I understood correctly the Hoeffding trees guidelines you liked to, giving a memory limit will only disable the splitters on less-promising leaves. It won't prevent leaves from being split.

smastelini commented 2 months ago

About the other avenues you mentioned, I fear that limiting the memory use may not be enough. If I understood correctly the Hoeffding trees guidelines you liked to, giving a memory limit will only disable the splitters on less-promising leaves. It won't prevent leaves from being split.

It will prevent the least promising ones from being split. But I see your point: if a leaf deep in the tree is "promising" it will still be split. In that case, do you think it would be doable to keep the parameter as None by default and in that case, gather the current maximum recursion value, and set (internally) the maximum height to something just shy of this limit? I would go for something like that rather than hardcoding a limit. What do you think, @e10e3 ?

e10e3 commented 2 months ago

I think this is a good idea.

How big of a margin do we want to keep before hitting the recursion limit? In my simple example above, there are 10 calls before starting to iterate over the nodes of the tree. Should we set aside 15 calls? 20? 30? More?

e10e3 commented 2 months ago

I created a PR with your proposition, setting max_depth to None will stop before the recursion limit. I arbitrarily chose to put 20 executions aside.

The Hoeffding Tree recipe is not yet done, but will need to be adapted.

(As a side note, I'll be busy in the next weeks and won't have time to work on this. You are welcome to take over the PR if you feel like it in the meantime.)

online-ml / river