scikit-learn-contrib / hiclass

A python library for hierarchical classification compatible with scikit-learn
BSD 3-Clause "New" or "Revised" License
110 stars 19 forks source link

Predicting confidence scores #81

Open Yasmen-Wahba opened 1 year ago

Yasmen-Wahba commented 1 year ago

Hello, Is there a predict_proba() method for the LCPPN pipeline ??

mirand863 commented 1 year ago

Hello, Is there a predict_proba() method for the LCPPN pipeline ??

Hi @Yasmen-Wahba ,

Not at the moment, but I can add shortly. However, I did not add this yet because the probability scores become skewed since the parent nodes are trained on subsets of the data. Would that be a problem for your application? There are some methods to calibrate/smooth the probability scores in hierarchical classification, but might take me a while to have time to code them since I am currently working on the multi-label problem.

channeng commented 10 months ago

Hi, thanks for building this library. It really makes it easy to perform hierarchical classification.

It will certainly be useful to have scores for each node along the category path. Then we can decide if instead of a leaf category prediction, we can traverse upward to a parent category.

PRFina commented 5 months ago

Hi @mirand863 and thanks for building this library! We're currently working on a multiclass classification problem achieving good performance with hierarchical models. To evaluate our models we need to get the confidence score, but as you already mentioned, the API doesn't expose the predict_proba method.

We are thinking of implementing it by ourselves, simply traversing the DAG (a tree in our case) and multiplying the score of each node in the path to get the leaf node score. What do you think about this very simple approach? Can you elaborate a little bit on the "skewness" issue? Can you provide some literature about calibrate/smooth the probability scores in hierarchical classification? If something good comes out, we'll be very happy to contribute with a PR :smiley:

mirand863 commented 5 months ago

Hi @mirand863 and thanks for building this library! We're currently working on a multiclass classification problem achieving good performance with hierarchical models. To evaluate our models we need to get the confidence score, but as you already mentioned, the API doesn't expose the predict_proba method.

We are thinking of implementing it by ourselves, simply traversing the DAG (a tree in our case) and multiplying the score of each node in the path to get the leaf node score. What do you think about this very simple approach? Can you elaborate a little bit on the "skewness" issue? Can you provide some literature about calibrate/smooth the probability scores in hierarchical classification? If something good comes out, we'll be very happy to contribute with a PR 😃

Hi @PRFina,

Glad to hear you are getting good results with hierarchical classifiers.

The problem that I mentioned is that the local classifiers are only trained on subsets of the data. Sometimes even a single data point is used for training leaf nodes. Hence, when you try to return the probabilities for your test data it becomes inaccurate. I hope this makes sense.

There is currently a master student working on this issue for his master thesis, but it might still take a few months before any code is released. Would it be OK for you to wait a while longer? Otherwise I think the strategy you describe can possibly work if you have a large amount of data. Another method that come to my mind is shrinkage.

Best regards, Fabio

lukas-kania-ccmlp commented 1 month ago

Hi @mirand863, Wanted to check in on this work. This would be very useful to have the probabilities output. Do you have an update on progress?