Closed frankl1 closed 1 year ago
Hi @frankl1! Long time no see, my friend!!
Thanks for reporting that. Actually, @danielnowakassis had mentioned this also happens with MOA and I didn't know whether this was the case with river too. Now we know and I don't think this should happen at all.
I will check this issue ASAP.
Given the following line in the split attempt procedure:
and,
what you report should not happen. I will need to dig deeper to find this issue.
Hey @smastelini, I'm happy to finally join the river community :) I hope I will contribute soon.
And you are more than welcome to! I encourage you to join our Discord :)
@smastelini, I suggest you write a couple of unit tests once you find the bug :)
Yeap, that is a good idea @MaxHalford! I will try to check that on the weekend :D
Side note: @frankl1 was the one who introduced me to PySAD!
He visited the lab I stayed at the University of Porto. Michael's work on time series classification is awesome!
I was able to check that two samples with y=1
appear in the example above. That's why the tree splits twice.
Technically speaking, although 99,999% of the instances in the leaf belong to the same class, there is one that does not.
We need to implement an additional heuristic to avoid such cases (and possibly add one additional param) or hardcode something.
Suggestions, @MaxHalford? The algorithm is working as designed. It sounds like a corner case. I guess the stats are reset and that's why we don't see class 1 with a very low probability in the corresponding leaves.
For reference, the tree regressors have a parameter to avoid splitting in such cases. We could limit a minimum class proportion to enable splitting.
Any ideas for parameter names? Note that such a change will impact a lot of algorithms.
Thanks for digging @smastelini! In a way this is reassuring. As you say, it's an edge-case. I think adding an extra parameter is a good idea. scikit-learn has an min_samples_leaf
parameter which is a bit similar.
Could we add a rule to only split when 1 - P(most_common_class)
is below a certain threshold? The parameter could be called max_share_to_split
. It could default to something like 0.99
, and prevent nodes from splitting if >99% of the samples belong to the same class. Of course feel free to reformulate/rename as appropriate, you're the tree hugger @smastelini :)
Thanks @smastelini and @MaxHalford for working on this issue.
Happy to see that the algorithm is working as expected :). So, the real issue is related to how the tree is drawn (Visualization issue) as I see in the leaf that $P(0) = 1.0$, If there is at least a sample from class 1, I expect to see $P(0)=0.99999$ and $P(1)=0.00001$. Of course, by rounding at 2 decimal digits we have $P(0)=1.0)$ and $P(1)=0.0$.
What about visualizing such leaves with $P(0) > 0.99$ and $(P(1) < 0.01)$ ??
In scikit-learn, there is the parameter min_impurity_decrease
which is used to decide if a split if profitable or not. I think this is a meaningful parameter to control splits as it is based on the information gain and can be used with any purity criterion (entropy, gini, classification error, etc). There is the parameter predefined_ig_rejection_level
in Sktime which serves the same purpose.
@smastelini if I can see both classes in the leaves, even when one class is extremely rare, I would consider my issue solved. Regarding adding a parameter to prevent splitting, I think it is not related to this issue, but would be a nice parameter to have.
@smastelini can we close this issue?
Can we leave it open? I intend to tackle it soon :)
Yes no problem! I was just checking :)
Versions
river version: 0.14.0 Python version: 3.10.8 Operating system: Windows 10 Pro
Describe the bug
Hi,
I have started using River a couple of weeks ago, I have to say that I'm very satisfied with the tool and would like to encourage the community behind.
I have a misunderstanding of how Attribute Observers works, I have train a Hoeffding tree classifier (HTC) on a binary dataset. After processing a certain number of samples, I draw the current tree to see how it looks like. I am surprised to realize that some nodes are split although they contain a single class.
Steps/code to reproduce
Here is the tree you should obtain. Look at the split on the attribute
dst_bytes
, both left and right children are pure leaves with class label 0. It was also probably the case when the split at the root occurred. I would have expected the split to happen only when the node is not pure. Is this a normal behavior? Why?