huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.14k stars 217 forks source link

No tutorial or guideline for Few-shot learning on multiclass text classification #408

Open ByUnal opened 1 year ago

ByUnal commented 1 year ago

I just want to use SBERT for Few Shot multiclass text classification, however I couldn't see any tutorial or explanation for it. Can you explain to me that which "multi_target_strategy" and loss function should I use for multi-class text classification ?

tomaarsen commented 1 year ago

Hello! I'm afraid the documentation is a bit lacking on that department indeed. You can experiment with the different multi_target_strategy options from the README, but I think "multi-output" should be a good start. Beyond that, you don't have to override the default loss function, you can just leave it. The default is the recommended one.

ByUnal commented 1 year ago

I tried every options in README, but in every case I encountered with IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed. As it seems that those multi_target_strategy doesn't work for multiclass. Interestingly, in case I don't use multi_target_strategy parameter, training occurs, but the success rate is terrible.

josh-yang92 commented 1 year ago

it's probably due to your input data dimensions (I would presume your label dimension). I have it working with the multi_target_strategy parameter, though my accuracy is not really good but I am not working with a multiclass problem but a multilabel problem.

joel-odlund commented 1 year ago

As far as I understand, multi-class refers to the setting where you predict one class out of multiple classes, whereas multi-label refers to the setting where you predict multiple labels (out of multiple classes) So if you have a multi-class setting in this sense, you would not want to enable options for multi-target.

utility-aagrawal commented 11 months ago

Hi All,

I have a question on this same topic - I am working on a multi-class text classification problem. I have a couple of questions on the expected format for labels in the data - 1) Do labels need to be integers? 2) I understand that for binary classification, they can be 0 and 1 bit what about in case of more than 2 classes? I am working on a sentiment analysis problem with 3 classes - positive, negative and neutral. How should I format the labels in the dataset? I tried -1, 0, and 1 for negative, neutral and positive respectively but training failed with the error: "setfit IndexError: Target -1 is out of bounds."

I can really use some help. Thanks for your help!

utility-aagrawal commented 11 months ago

Just to add to my last question, my problem is just a multi-class text classification problem and not a multi-label problem. One sentence/example will have only one label out of positive, negative or neutral. Thanks!

josh-yang92 commented 11 months ago

@utility-aagrawal You should one-hot encode the labels so that they are [0 or 1, 0 or 1, 0 or 1] where [negative, neutral, positive]

Eg. If a sentence is neutral then your label should be [0, 1, 0].

This should answer both of your questions.

utility-aagrawal commented 11 months ago

@josh-yang92 Thanks a lot! and I don't need to use multi-target-strategy, right? That's for multilabel classification problems?

utility-aagrawal commented 11 months ago

I am getting the following error after encoding my target variables as [1, 0, 0] for negative , [0, 1, 0] for neutral, [0, 0, 1] for positive. I am not using multi_target_strategy since I don't have multiple target variables.

image

Suggestions are welcome!

utility-aagrawal commented 11 months ago

@ByUnal Were you able to make it work for a multiclass text classification problem? I would love to hear your experience with this. Thanks!

@tomaarsen Do you have any recommendations as to how to handle labels in case of multiclass text-classification? Thanks!

josh-yang92 commented 11 months ago

I am getting the following error after encoding my target variables as [1, 0, 0] for negative , [0, 1, 0] for neutral, [0, 0, 1] for positive. I am not using multi_target_strategy since I don't have multiple target variables.

image

Suggestions are welcome!

You could compare your data format to the format used in this example: https://github.com/huggingface/setfit/blob/main/notebooks/text-classification.ipynb

ByUnal commented 11 months ago

@utility-aagrawal Hello there, and sorry for the late answer. I think your problem would be solved if you use [0,1,2] as target values for [neutral, positive, negative] instead of using combination of 1 and 0. Your way works more like binary classification. You try to estimate whether any sample is positive (also same for the others) or not. Besides, do not define any multi_target_strategy for multi-class classification, since it didn't work in my case. I've managed to train model by this way. Hope it works. Let me know if you need help further.

utility-aagrawal commented 11 months ago

@ByUnal Thanks a lot for your response! Using (0,1,2) without specifying any multi_target_strategy worked for me! I was able to train using that.

I have a couple of follow-up questions -

1) I trained with 8 and 16 examples per class and accuracy is in 60s which is not bad but unfortunately not good enough for my use case. Should I experiment with more examples per class or if I can gather more training data, should I go for training from scratch/fine tuning a bigger model? Do you have experience with this? 2) Do you have any tips for choosing training examples for setfit? Currently, I am randomly choosing n examples per class from my training data.

I appreciate your help!

ByUnal commented 11 months ago

@utility-aagrawal the more the better in terms of data samples for each class. However, you need to do bunch of experiments, cause it depends on your data quality, data samples, classification model and so forth. I think you should observe your results and decide which way you're going to proceed. You can use confusion matrix to understand which classes are confused by model, for example.

In my case, I had really really imbalanced data, the quality was low. So, it didn't work as I expected. Anyway, you can try the followings to increase your success rate:

This is all I can say from this point of view.

utility-aagrawal commented 11 months ago

Thanks @ByUnal ! I'll give that a try.