scikit-learn-contrib / hiclass

A python library for hierarchical classification compatible with scikit-learn
BSD 3-Clause "New" or "Revised" License
113 stars 20 forks source link

Handling of common children in hierarchy tree #95

Closed astronights closed 9 months ago

astronights commented 9 months ago

Hi,

I have a use case where at some level in the hierarchy there are children that share the same class name.

How does HiClass handle this and is this safe or do these need to be deduplicated before hand?

I am looking for a representation as such

     A
   /   \
  B     C
 |      |
 D      D
 |      | 
 E      F

And not like this:

     A
    / \
   B   C
    \ / 
     D
   /  \
  E    F

Can you help provide some insight into if there is support for this type of hierarchical tree? Thanks!

mirand863 commented 9 months ago

Hi @astronights,

Thank you for the interest in HiClass!

HiClass recognizes ambiguities and handles them accordingly. Internally, it creates separate nodes according to the parent it is coming from, which means that in your example it would have 2 separate nodes for class "D". Here is a small example showing how the prediction is not affected by ambiguity:

import numpy as np
from sklearn.linear_model import LogisticRegression

from hiclass import LocalClassifierPerNode

# Define data
X_train = [[1, 2], [3, 4]]
X_test = [[3, 4], [1, 2]]
Y_train = np.array(
    [
        ["A", "B", "D", "E"],
        ["A", "C", "D", "F"],
    ],
    dtype=object,
)

# Use random forest classifiers for every node
rf = LogisticRegression()
classifier = LocalClassifierPerNode(local_classifier=rf)

# Train local classifier per node
classifier.fit(X_train, Y_train)

# Predict
predictions = classifier.predict(X_test)
print(predictions)

Please, let me know if this answers your question.

astronights commented 9 months ago

Thank you for the clarification. Good to know that these discrepancies are handled.