Open cassin-edwin opened 1 year ago
@tomaarsen Could you please provide some insights?
Hello @cassin-edwin!
By "not captured", are you referring to a lower-than-expected performance or a bug? I'll assume the former for the rest of this message. How many classes (topics) are you trying to train on? I do not have a lot of experience with multi-label classification, but with single-label classification, the number of classes has a notable impact on the number of training samples that should be annotated for solid performance. The exact details behind these graphs are irrelevant, but they do show some interesting behaviour. Note that the boxplots show the performances of three very similar SetFit models:
bbc-news
datasetImportant: This dataset has 2 classes! This is a fairly simple dataset, and SetFit is able to reach 90%+ accuracies with only 4 samples per class (e.g. 8 annotated samples total)
emotion
datasetImportant: This dataset has 6 classes! This is a fairly difficult dataset, even for humans. SetFit will require more data before it reaches a point where additional data stops improving the performance. This is caused both by how there are more classes that need to be separated in the embedding space, and because of the relatively low data quality.
The takeaway here is that adding more data may be very fruitful for models that classify between more classes. For your specific case, I am going to guess that it will help, especially considering most models that I train do not improve beyond epoch 1: the training loss has generally already reached 0 then (although the evaluation loss is not 0).
Another interesting test for your situation is to take a "general" pretrained sentence transformer, rather than your domain adapted one. It would be interesting to see the performance differences.
As for hyperparameter tuning, I've been trying to find some better values myself, but the only improvement that I've found for my scenarios is that a learning rate of 8e-5
instead of 2e-5
sometimes works better, and that a slightly higher num_iterations
(e.g. 30) can help, too.
However, I think the main thing to invest time in is getting some additional annotations.
@tomaarsen The total number of classes is 97 multi-labels. Each class is having a minimum of 18 samples (such that my random_state = 81 so atleast 16 samples will be used for train data set). My data's quality is so bad (Contents in Logs).
@tomaarsen,
I today trained a few models with the following hyper-parameters:
Domain Adaptation model : Epochs - 1, Iterations - 30, Batch_size - 16 => Test Accuracy - 45% General Sentence Transformer model : Epochs - 1, Iterations - 30, Batch_size - 16 => Test Accuracy - 55% General Sentence Transformer model : Epochs - 1, Iterations - 60, Batch_size - 16 => Test Accuracy - 54% General Sentence Transformer model : Epochs - 7, Iterations - 30, Batch_size - 16 => Test Accuracy - 0%
When I tested a few samples, it was clear that they were not predicting 'all' the multi-topics it was supposed to predict for longer sentences when it contain many topics (which means the model is not trained with a 'particular sentence' with all those topics expected.) Also, sometimes the train data itself is not getting trained on 'all' multi-label topics which was feeded to it. As you may see anything more than 1 epochs could overfit.
Any way to view the embedding space (umap) to get to know how the training is done?
As you said I think getting additional annotations is the only way. But it seems too time consuming.
Can you suggest anything more to try. Appreciate it.
It is common for SetFit models to perform best at or near 1 epoch. If you want to do more training, it is recommended to increase num_iterations
instead, which impacts the number of contrastive training pairs that are generated internally.
For non-multi label datasets it is possible to perform methods (PCA, t-SNE) to decrease the embedding dimensions of your evaluation set down to 3, resulting in graphs like:
However, I'm not sure if this can reasonably be done with multi-label data.
For data labeling, perhaps you'll enjoy using Argila with SetFit, there's a tutorial here. Note: I am affiliated with Argilla.
@cassin-edwin - could you please tell how you did the domain adpatation? And, why do you think that the domain adapted model is doing worse than the original st?
@vahuja4 Please refer to this https://www.sbert.net/examples/domain_adaptation/README.html . I used tsdae method under Adaptive Pre-training. My data set is actually log data which means there is so much of random information that is not needed. So, that's why I believe the performance wasn't better than an already available pre-trained model.
@cassin-edwin - thank you for your reply! If you don't mind, can you please clarify the following questions:
@vahuja4
In the embedding space I believe words are positioned based on their semantics and syntactic structure and are given a vector. So, the concept behind sentence transformers is that it is the average of all the word embeddings/vector. I don't know how the sentence will be directly given a vector.
I used a bert-base-uncased model to do the domain adaptation. However you can try to do it on any pre-trained model which you think is trained on relevant data like your fine-tuning dataset. This is my opinion. Look for more suggestions on this.
I am working on a task to classify a sentence into multiple topics (Multi-label classification). Initially, I trained the entire 1.5M unlabeled data on bert base model using domain adaptation. Then I manually annotated the labels with atleast 10 examples each and fine-tuned it. The test accuracy what I received was 35% for 1 epoch and 46% for 7 epochs. I see that the model is capturing negations too but it is not capturing multi-labels. I know that there is no way to measure accuracy for train/test dataset after every epoch to get to know if it is overfitting or underfitting. Can you suggest some some good range of epochs for this multi label classification or changes in hyper-parameter opt. settings? Or may be i should increase the quality of the labeled sentences? Suggestions please.