huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.22k stars 220 forks source link

SetFit for a large number of classes #145

Closed steve-marmalade closed 1 year ago

steve-marmalade commented 2 years ago

Hi there, thanks for releasing such an interesting library.

I am curious if any experiments have been run using SetFit in the extreme multiclass setting, say as n_classes>=100?

MosheWasserb commented 2 years ago

Hi thanks for your feedback, I did try to run for clinc_oos 'plus' (150+1 classes): 151x8 samples (BS=32) got ACC. 78% vs. full data (>15K samples) 86%. Will be great if you can share more results with large n.

PhilipMay commented 2 years ago

At the moment I am training on a German dataset with ~90 very unbalanced classes. The minority class has 20 samples. The majority class has 183 samples.

It works very good.

paul-khudan commented 1 year ago

At the moment I am training on a German dataset with ~90 very unbalanced classes. The minority class has 20 samples. The majority class has 183 samples.

It works very good.

@PhilipMay, hello, could You add any detail on your training configuration?

  1. Do you have a multiclass or multilabel task?
  2. Do you use a sklearn head?
  3. Are there any minority classes that are fully "ignored" (show 0 recall)?
PhilipMay commented 1 year ago

Do you have a multiclass or multilabel task?

I have a normal multi class task.

Do you use a sklearn head?

yes

Are there any minority classes that are fully "ignored" (show 0 recall)?

no

tomaarsen commented 1 year ago

I believe these questions have been answered now, so I'll close this. You're always free to reopen or make a new issue for further questions or issues :)

KnutJaegersberg commented 1 year ago

My aim is to use setfit to create a generic, broad topic classifier for the purpose of weak labeling and corpus exporation. I haven't gotten anything useful out of trying to fit the model on my traindataset with with roundabout 6k records and 182 classes (which I balanced). I tried both multiclass and multilabel classification, would have expected to get something useful out of the latter. No luck so far. The classes overlap, it is actually hierarchical classification with 17 broad classes. The dataset is artificial, based on class descriptions. I played around with the (somewhat larger) base dataset here till I got balanced classes (on mid level and high level):

https://huggingface.co/datasets/KnutJaegersberg/News_topics_IPTC_codes_long

I used this code:

Load a SetFit model from Hub

model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-multilingual-mpnet-base-v2", multi_target_strategy="one-vs-rest")

Create trainer

trainer = SetFitTrainer( model=model, train_dataset=train, eval_dataset=evaluation, loss_class=CosineSimilarityLoss,

metric="Accuracy",

batch_size=1,
num_iterations=20, # The number of text pairs to generate for contrastive learning
num_epochs=1, # The number of epochs to use for contrastive learning
column_mapping={"text": "text", "label": "label"} # Map dataset columns to text/label expected by trainer

)

For multilabel, I created a label column with dummy variables in an array. Prediction was 0 when it clearly shouldnt.

No class got predicted. All 0s. Could this have to do with the large number of classes?

KnutJaegersberg commented 1 year ago

Never mind, I think it was an issue with reticulate.

tomaarsen commented 1 year ago

Glad to hear that you got it working!

iHamzaKhanzada commented 1 year ago

I am working on a multi-label scenario with around 400+ classes, regarding the dataset that I have right now has a categories field which has category names in it for each row, so some row belong to 2 categories some row belong to 4 categories, but total classes are 400+. I am confused if this dataset would be acceptable or I have to rebuild my labels and transform them to 0,1 like text: "some text", label: [0,1,0,0,0,0,1,0,0] as opposed to text: "some text", label: ['class-23', 'class-44'] etc?

tomaarsen commented 1 year ago

Hello!

The former style is required. You'll have to train using a multi-label model: https://github.com/huggingface/setfit#example-using-a-classification-head-from-scikit-learn

iHamzaKhanzada commented 1 year ago

Thank you for your response, would you suggest which multi_target_strategy should I use? I am a beginner to ML and data science and working on my Master's dissertation for a multi-label few shot classification problem, here are some statistics about my dataset total rows: 20k+ total unique categories: 442

I have discarded the categories which had less than 6 rows, and for splitting them in training and validation set what I have done is those categories which had more than 30+ rows (samples) are part of training_dataset and the rest are part of test_dataset

so now there are 321 unique categories and 18K rows in training_dataset and around 60 unique categories and 2k+ rows in testing_dataset.

based on this, can you suggest me the hyper-parameters for best results?

tomaarsen commented 1 year ago

I would try out the multi-output option, but feel free to experiment with the others as well. As for the other hyperparameters, good values depend a lot on the exact problem, so I can't suggest any specific ones, but I would recommend lowering the batch_size if you get "Out of memory" exceptions, and decreasing num_iterations if the training time is too large. Beyond that, the default options are solid.

With that much data, you may also be able to use a "standard" text classification solution using 🤗 Transformers (docs).

nprasanthi7 commented 11 months ago

I have 53658 samples of training data and 514 labels and taken 30 samples_per_label. It took nearly 4 days to complete the training and gave 0% accuracy on evaluation data. What might be the reason?

MosheWasserb commented 11 months ago

@nprasanthi7 Regarding the speed, Did you run on GPU? What is the average sequence length? Regarding accuracy it is very-hard to say, few ideas: 1. Try LogitReg on your data as baseline to make sure test is ok 2. Take a small subset of data and labels to make sure it's running ok 3. It may be that the claasses are very close to each other 4. maybe the test data is extremly inbalanced

nprasanthi7 commented 11 months ago

Thank you @MosheWasserb for your response. yes I ran on GPU and it took nearly 45 GB gpu ram. The average sequence length is 748.71. One sample: term loan lender term loan commitment,"[False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True]" Can I have True and False instead of 0 and 1. I even trained with 1 sample per label and still got 0

ya-stack commented 5 months ago

HI @tomaarsen, I'm working with hierarchical data, specifically item taxonomy data, and my goal is to predict four levels: Product type, Product subtype, Merchandise type, and item type, based on product descriptions and titles. I'm seeking advice on how to prepare the data for this multi-label classification problem while preserving the hierarchical structure. For instance, if the product type is "furniture," the model should classify the product subtype within the furniture category, and similarly for merchandise type and item type. Below is a snippet of the data -

tcin | product_type_n | product_sub_type_n | merchandise_type_n | item_type_n XYZ | HOME | BEDDING | blankets and throws | Throw Blankets BCD | HOME | SOFT HOME | rugs, mats and grips | Rugs PQR | FURNITURE | seating and tables | standalone tables | Console Tables ABC | HOME | SOFT HOME | rugs, mats and grips | Rugs EFG | FURNITURE | bedroom furniture | beds and mattresses | Beds

Thanks in advance.