Closed e10e3 closed 2 months ago
A possible solution for this is to change the defaults in SRP, or maybe directly in Hoeffding trees, to limit the maximum depth of the trees in order to respect Python's recursion limit.
In CPython, the recursion limit it at 1000 by default (it can be changed by the user).
Hi @e10e3, I suggest taking a look at the Hoeffding Trees guideline to check the parameters that control memory usage and tree depth. I am open to changing default values, but I fear the requirements may vary depending on the data and the machine specs. If you have any suggestions, I would be happy to hear them!
As for a workaround, I suggest limiting the maximum memory usage and decreasing the interval the trees trigger the memory enforcement checking routine. Maybe we could find good defaults for these parameters. What do you think?
When I investigated this crash, I found that setting the max_depth
argument of the Hoeffding tree to 985 successfully avoided the crash. Setting it 15 below the limit gives some liberty to the user to add a few function calls around the model.
The default limit of 1000 recursive calls is hard-coded in CPython, as can be seen in the source code. Users can change the value at run time, but the default is the same for everyone (unless you recompiled Python with a different limit, of course).
It is true that hitting this limit depends a lot on your data. As a matter of fact, I tested various datasets, and only encountered this problem on Sensors! This makes me think that setting a default maximum depth for the Hoeffding trees will have a negligible impact on River's users, while preventing a crash.
About the other avenues you mentioned, I fear that limiting the memory use may not be enough. If I understood correctly the Hoeffding trees guidelines you liked to, giving a memory limit will only disable the splitters on less-promising leaves. It won't prevent leaves from being split.
About the other avenues you mentioned, I fear that limiting the memory use may not be enough. If I understood correctly the Hoeffding trees guidelines you liked to, giving a memory limit will only disable the splitters on less-promising leaves. It won't prevent leaves from being split.
It will prevent the least promising ones from being split. But I see your point: if a leaf deep in the tree is "promising" it will still be split. In that case, do you think it would be doable to keep the parameter as None
by default and in that case, gather the current maximum recursion value, and set (internally) the maximum height to something just shy of this limit? I would go for something like that rather than hardcoding a limit. What do you think, @e10e3 ?
I think this is a good idea.
How big of a margin do we want to keep before hitting the recursion limit? In my simple example above, there are 10 calls before starting to iterate over the nodes of the tree. Should we set aside 15 calls? 20? 30? More?
I created a PR with your proposition, setting max_depth
to None
will stop before the recursion limit. I arbitrarily chose to put 20 executions aside.
The Hoeffding Tree recipe is not yet done, but will need to be adapted.
(As a side note, I'll be busy in the next weeks and won't have time to work on this. You are welcome to take over the PR if you feel like it in the meantime.)
Versions
River version: 0.21.1 Python version: 3.12.4 Operating system: macOS 14.5
Describe the bug
When used on large datasets, SRP (with the default arguments) can make its Hoeffding trees grow so much they go beyond Python's recursion limit.
In theory this could happen with bare Hoeffding trees as well, but I could not reproduce the crash with them.
Steps/code to reproduce
I used the Sensors/Intel lab Data stream, available freely. The file is a bit large to insert here, but I can give the files I used if needed.
If you try to reproduce, beware: the training time can be long on big datasets.
Output: