online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
5.07k stars 548 forks source link

Splits happening on pure nodes. #1169

Closed frankl1 closed 1 year ago

frankl1 commented 1 year ago

Versions

river version: 0.14.0 Python version: 3.10.8 Operating system: Windows 10 Pro

Describe the bug

Hi,

I have started using River a couple of weeks ago, I have to say that I'm very satisfied with the tool and would like to encourage the community behind.

I have a misunderstanding of how Attribute Observers works, I have train a Hoeffding tree classifier (HTC) on a binary dataset. After processing a certain number of samples, I draw the current tree to see how it looks like. I am surprised to realize that some nodes are split although they contain a single class.

Steps/code to reproduce

# Sample code to reproduce the problem
from river import datasets
from river import tree
from IPython.display import display

dataset = datasets.HTTP()

model = tree.HoeffdingTreeClassifier(grace_period=20)

display_after = 220000

for i, (x, y) in enumerate(dataset):
    model.learn_one(x, y)
    if i > 0 and not i % display_after:
        display(model.draw())
        break

Here is the tree you should obtain. Look at the split on the attribute dst_bytes, both left and right children are pure leaves with class label 0. It was also probably the case when the split at the root occurred. I would have expected the split to happen only when the node is not pure. Is this a normal behavior? Why?

image

smastelini commented 1 year ago

Hi @frankl1! Long time no see, my friend!!

Thanks for reporting that. Actually, @danielnowakassis had mentioned this also happens with MOA and I didn't know whether this was the case with river too. Now we know and I don't think this should happen at all.

I will check this issue ASAP.

smastelini commented 1 year ago

Given the following line in the split attempt procedure:

https://github.com/online-ml/river/blob/78c1e4f116b4e2ca7956332bf3699acc795896bb/river/tree/hoeffding_tree_classifier.py#L244

and,

https://github.com/online-ml/river/blob/78c1e4f116b4e2ca7956332bf3699acc795896bb/river/tree/nodes/htc_nodes.py#L68-L82

what you report should not happen. I will need to dig deeper to find this issue.

frankl1 commented 1 year ago

Hey @smastelini, I'm happy to finally join the river community :) I hope I will contribute soon.

MaxHalford commented 1 year ago

And you are more than welcome to! I encourage you to join our Discord :)

@smastelini, I suggest you write a couple of unit tests once you find the bug :)

smastelini commented 1 year ago

Yeap, that is a good idea @MaxHalford! I will try to check that on the weekend :D

Side note: @frankl1 was the one who introduced me to PySAD!

He visited the lab I stayed at the University of Porto. Michael's work on time series classification is awesome!

smastelini commented 1 year ago

I was able to check that two samples with y=1 appear in the example above. That's why the tree splits twice.

Technically speaking, although 99,999% of the instances in the leaf belong to the same class, there is one that does not.

We need to implement an additional heuristic to avoid such cases (and possibly add one additional param) or hardcode something.

Suggestions, @MaxHalford? The algorithm is working as designed. It sounds like a corner case. I guess the stats are reset and that's why we don't see class 1 with a very low probability in the corresponding leaves.

smastelini commented 1 year ago

For reference, the tree regressors have a parameter to avoid splitting in such cases. We could limit a minimum class proportion to enable splitting.

Any ideas for parameter names? Note that such a change will impact a lot of algorithms.

MaxHalford commented 1 year ago

Thanks for digging @smastelini! In a way this is reassuring. As you say, it's an edge-case. I think adding an extra parameter is a good idea. scikit-learn has an min_samples_leaf parameter which is a bit similar.

Could we add a rule to only split when 1 - P(most_common_class) is below a certain threshold? The parameter could be called max_share_to_split. It could default to something like 0.99, and prevent nodes from splitting if >99% of the samples belong to the same class. Of course feel free to reformulate/rename as appropriate, you're the tree hugger @smastelini :)

frankl1 commented 1 year ago

Thanks @smastelini and @MaxHalford for working on this issue.

Happy to see that the algorithm is working as expected :). So, the real issue is related to how the tree is drawn (Visualization issue) as I see in the leaf that $P(0) = 1.0$, If there is at least a sample from class 1, I expect to see $P(0)=0.99999$ and $P(1)=0.00001$. Of course, by rounding at 2 decimal digits we have $P(0)=1.0)$ and $P(1)=0.0$.

What about visualizing such leaves with $P(0) > 0.99$ and $(P(1) < 0.01)$ ??

In scikit-learn, there is the parameter min_impurity_decrease which is used to decide if a split if profitable or not. I think this is a meaningful parameter to control splits as it is based on the information gain and can be used with any purity criterion (entropy, gini, classification error, etc). There is the parameter predefined_ig_rejection_level in Sktime which serves the same purpose.

@smastelini if I can see both classes in the leaves, even when one class is extremely rare, I would consider my issue solved. Regarding adding a parameter to prevent splitting, I think it is not related to this issue, but would be a nice parameter to have.

MaxHalford commented 1 year ago

@smastelini can we close this issue?

smastelini commented 1 year ago

Can we leave it open? I intend to tackle it soon :)

MaxHalford commented 1 year ago

Yes no problem! I was just checking :)