Multilabel classification training data

huggingface / setfit

Efficient few-shot learning with Sentence Transformers

https://hf.co/docs/setfit

Apache License 2.0

2.25k stars 223 forks source link

Multilabel classification training data #413

Open josh-yang92 opened 1 year ago

josh-yang92 commented 1 year ago

Hi guys, just have a question regarding the training data for multilabel classification.

So, for multiclass classification, you can play around with the number of samples (K) per label, and of course, the higher the K, the performance increases; this is straightforward as there will be only 1 label per sample.

However, for multilabel classification, where there can be more than 1 label per sample (also in many different combinations), how are we supposed to construct our training data? for example, would it be best to give an equal number of samples for every combination of labels? this would exponentially increase the training data required which would defeat the purpose of few-shot learning...?

I am asking this question as I have tried training the model without considering the above question and getting not-so-great results (around 65% accuracy with 12 labels).

Thank you!

MattiL commented 1 year ago

I have used SetFit when there is a limited number of labeled training data available. If you have ample data, you can also use other frameworks. I have used the following code to balance unbalanced training data classes.

model = SetFitModel(
                    model_body=SentenceTransformer('all-MiniLM-L6-v2'), 
                    model_head=OneVsRestClassifier(LogisticRegression(class_weight="balanced")),
                    multi_target_strategy="one-vs-rest"
                    )

josh-yang92 commented 1 year ago

@MattiL I am not sure if you understood my question correctly. I am talking about multilabel problem where the input data can have multiple labels at the same time out of many labels unlike multiclass problem where the data can have exactly one label out of many.

To balance the training data for the multiclass problem, it's easy, you just balance the data or adopt the method like you have. However, for the multilabel problem, you can have sum(nCr) combinations. So to my understanding, in order to achieve similar to the proposed result, you would have to have equal number of examples for each and every combination, which then would defeat the purpose of few-shot learning.

Hopefully I have explained myself better...

MattiL commented 1 year ago

I have tried to use the balancing code for multilabel classification. I guess that might improve the accuracy. Multiclass has had little support in SetFit.

alejandrodumas commented 1 year ago

You could use scikit-multilearn to create a balanced training dataset. Use iterative_train_test_split

singularity014 commented 1 year ago

to all wondering how the data should look like. Here is a sample format

'text', 'label1', 'label2', 'label3'

'this is a sentence', 0, 0, 1
'this is a sentence2', 1, 0, 1