Closed steve-marmalade closed 1 year ago
Hi thanks for your feedback, I did try to run for clinc_oos 'plus' (150+1 classes): 151x8 samples (BS=32) got ACC. 78% vs. full data (>15K samples) 86%. Will be great if you can share more results with large n.
At the moment I am training on a German dataset with ~90 very unbalanced classes. The minority class has 20 samples. The majority class has 183 samples.
It works very good.
At the moment I am training on a German dataset with ~90 very unbalanced classes. The minority class has 20 samples. The majority class has 183 samples.
It works very good.
@PhilipMay, hello, could You add any detail on your training configuration?
Do you have a multiclass or multilabel task?
I have a normal multi class task.
Do you use a sklearn head?
yes
Are there any minority classes that are fully "ignored" (show 0 recall)?
no
I believe these questions have been answered now, so I'll close this. You're always free to reopen or make a new issue for further questions or issues :)
My aim is to use setfit to create a generic, broad topic classifier for the purpose of weak labeling and corpus exporation. I haven't gotten anything useful out of trying to fit the model on my traindataset with with roundabout 6k records and 182 classes (which I balanced). I tried both multiclass and multilabel classification, would have expected to get something useful out of the latter. No luck so far. The classes overlap, it is actually hierarchical classification with 17 broad classes. The dataset is artificial, based on class descriptions. I played around with the (somewhat larger) base dataset here till I got balanced classes (on mid level and high level):
https://huggingface.co/datasets/KnutJaegersberg/News_topics_IPTC_codes_long
I used this code:
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-multilingual-mpnet-base-v2", multi_target_strategy="one-vs-rest")
trainer = SetFitTrainer( model=model, train_dataset=train, eval_dataset=evaluation, loss_class=CosineSimilarityLoss,
batch_size=1,
num_iterations=20, # The number of text pairs to generate for contrastive learning
num_epochs=1, # The number of epochs to use for contrastive learning
column_mapping={"text": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
)
For multilabel, I created a label column with dummy variables in an array. Prediction was 0 when it clearly shouldnt.
No class got predicted. All 0s. Could this have to do with the large number of classes?
Never mind, I think it was an issue with reticulate.
Glad to hear that you got it working!
I am working on a multi-label scenario with around 400+ classes, regarding the dataset that I have right now has a categories field which has category names in it for each row, so some row belong to 2 categories some row belong to 4 categories, but total classes are 400+. I am confused if this dataset would be acceptable or I have to rebuild my labels and transform them to 0,1 like text: "some text", label: [0,1,0,0,0,0,1,0,0] as opposed to text: "some text", label: ['class-23', 'class-44'] etc?
Hello!
The former style is required. You'll have to train using a multi-label model: https://github.com/huggingface/setfit#example-using-a-classification-head-from-scikit-learn
Thank you for your response, would you suggest which multi_target_strategy should I use? I am a beginner to ML and data science and working on my Master's dissertation for a multi-label few shot classification problem, here are some statistics about my dataset total rows: 20k+ total unique categories: 442
I have discarded the categories which had less than 6 rows, and for splitting them in training and validation set what I have done is those categories which had more than 30+ rows (samples) are part of training_dataset and the rest are part of test_dataset
so now there are 321 unique categories and 18K rows in training_dataset and around 60 unique categories and 2k+ rows in testing_dataset.
based on this, can you suggest me the hyper-parameters for best results?
I would try out the multi-output
option, but feel free to experiment with the others as well. As for the other hyperparameters, good values depend a lot on the exact problem, so I can't suggest any specific ones, but I would recommend lowering the batch_size
if you get "Out of memory" exceptions, and decreasing num_iterations
if the training time is too large. Beyond that, the default options are solid.
With that much data, you may also be able to use a "standard" text classification solution using 🤗 Transformers (docs).
I have 53658 samples of training data and 514 labels and taken 30 samples_per_label. It took nearly 4 days to complete the training and gave 0% accuracy on evaluation data. What might be the reason?
@nprasanthi7 Regarding the speed, Did you run on GPU? What is the average sequence length? Regarding accuracy it is very-hard to say, few ideas: 1. Try LogitReg on your data as baseline to make sure test is ok 2. Take a small subset of data and labels to make sure it's running ok 3. It may be that the claasses are very close to each other 4. maybe the test data is extremly inbalanced
Thank you @MosheWasserb for your response. yes I ran on GPU and it took nearly 45 GB gpu ram. The average sequence length is 748.71. One sample: term loan lender term loan commitment,"[False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True]" Can I have True and False instead of 0 and 1. I even trained with 1 sample per label and still got 0
HI @tomaarsen, I'm working with hierarchical data, specifically item taxonomy data, and my goal is to predict four levels: Product type, Product subtype, Merchandise type, and item type, based on product descriptions and titles. I'm seeking advice on how to prepare the data for this multi-label classification problem while preserving the hierarchical structure. For instance, if the product type is "furniture," the model should classify the product subtype within the furniture category, and similarly for merchandise type and item type. Below is a snippet of the data -
tcin | product_type_n | product_sub_type_n | merchandise_type_n | item_type_n XYZ | HOME | BEDDING | blankets and throws | Throw Blankets BCD | HOME | SOFT HOME | rugs, mats and grips | Rugs PQR | FURNITURE | seating and tables | standalone tables | Console Tables ABC | HOME | SOFT HOME | rugs, mats and grips | Rugs EFG | FURNITURE | bedroom furniture | beds and mattresses | Beds
Thanks in advance.
Hi there, thanks for releasing such an interesting library.
I am curious if any experiments have been run using SetFit in the extreme multiclass setting, say as
n_classes>=100
?