Multi-label topics not captured

cassin-edwin commented 1 year ago

I am working on a task to classify a sentence into multiple topics (Multi-label classification). Initially, I trained the entire 1.5M unlabeled data on bert base model using domain adaptation. Then I manually annotated the labels with atleast 10 examples each and fine-tuned it. The test accuracy what I received was 35% for 1 epoch and 46% for 7 epochs. I see that the model is capturing negations too but it is not capturing multi-labels. I know that there is no way to measure accuracy for train/test dataset after every epoch to get to know if it is overfitting or underfitting. Can you suggest some some good range of epochs for this multi label classification or changes in hyper-parameter opt. settings? Or may be i should increase the quality of the labeled sentences? Suggestions please.

cassin-edwin commented 1 year ago

@tomaarsen Could you please provide some insights?

tomaarsen commented 1 year ago

Hello @cassin-edwin!

By "not captured", are you referring to a lower-than-expected performance or a bug? I'll assume the former for the rest of this message. How many classes (topics) are you trying to train on? I do not have a lot of experience with multi-label classification, but with single-label classification, the number of classes has a notable impact on the number of training samples that should be annotated for solid performance. The exact details behind these graphs are irrelevant, but they do show some interesting behaviour. Note that the boxplots show the performances of three very similar SetFit models:

Performance of k-shot on the `bbc-news` dataset

Important: This dataset has 2 classes! bbc-news This is a fairly simple dataset, and SetFit is able to reach 90%+ accuracies with only 4 samples per class (e.g. 8 annotated samples total)

Performance of k-shot on the `emotion` dataset

Important: This dataset has 6 classes! emotion This is a fairly difficult dataset, even for humans. SetFit will require more data before it reaches a point where additional data stops improving the performance. This is caused both by how there are more classes that need to be separated in the embedding space, and because of the relatively low data quality.

The takeaway here is that adding more data may be very fruitful for models that classify between more classes. For your specific case, I am going to guess that it will help, especially considering most models that I train do not improve beyond epoch 1: the training loss has generally already reached 0 then (although the evaluation loss is not 0).

Another interesting test for your situation is to take a "general" pretrained sentence transformer, rather than your domain adapted one. It would be interesting to see the performance differences.

As for hyperparameter tuning, I've been trying to find some better values myself, but the only improvement that I've found for my scenarios is that a learning rate of 8e-5 instead of 2e-5 sometimes works better, and that a slightly higher num_iterations (e.g. 30) can help, too. However, I think the main thing to invest time in is getting some additional annotations.

Tom Aarsen

cassin-edwin commented 1 year ago

@tomaarsen The total number of classes is 97 multi-labels. Each class is having a minimum of 18 samples (such that my random_state = 81 so atleast 16 samples will be used for train data set). My data's quality is so bad (Contents in Logs).

cassin-edwin commented 1 year ago

@tomaarsen,

I today trained a few models with the following hyper-parameters:

Domain Adaptation model : Epochs - 1, Iterations - 30, Batch_size - 16 => Test Accuracy - 45% General Sentence Transformer model : Epochs - 1, Iterations - 30, Batch_size - 16 => Test Accuracy - 55% General Sentence Transformer model : Epochs - 1, Iterations - 60, Batch_size - 16 => Test Accuracy - 54% General Sentence Transformer model : Epochs - 7, Iterations - 30, Batch_size - 16 => Test Accuracy - 0%

When I tested a few samples, it was clear that they were not predicting 'all' the multi-topics it was supposed to predict for longer sentences when it contain many topics (which means the model is not trained with a 'particular sentence' with all those topics expected.) Also, sometimes the train data itself is not getting trained on 'all' multi-label topics which was feeded to it. As you may see anything more than 1 epochs could overfit.

Any way to view the embedding space (umap) to get to know how the training is done?

As you said I think getting additional annotations is the only way. But it seems too time consuming.

Can you suggest anything more to try. Appreciate it.

tomaarsen commented 1 year ago

It is common for SetFit models to perform best at or near 1 epoch. If you want to do more training, it is recommended to increase num_iterations instead, which impacts the number of contrastive training pairs that are generated internally. For non-multi label datasets it is possible to perform methods (PCA, t-SNE) to decrease the embedding dimensions of your evaluation set down to 3, resulting in graphs like: emotion_64_shot_20_num_iter_large However, I'm not sure if this can reasonably be done with multi-label data.

For data labeling, perhaps you'll enjoy using Argila with SetFit, there's a tutorial here. Note: I am affiliated with Argilla.

Tom Aarsen

vahuja4 commented 1 year ago

@cassin-edwin - could you please tell how you did the domain adpatation? And, why do you think that the domain adapted model is doing worse than the original st?

cassin-edwin commented 1 year ago

@vahuja4 Please refer to this https://www.sbert.net/examples/domain_adaptation/README.html . I used tsdae method under Adaptive Pre-training. My data set is actually log data which means there is so much of random information that is not needed. So, that's why I believe the performance wasn't better than an already available pre-trained model.

vahuja4 commented 1 year ago

@cassin-edwin - thank you for your reply! If you don't mind, can you please clarify the following questions:

The sample code uses BERT as an encoder and the decoder to do the domain adaptation. I am a little confused about that because I thought that the whole idea behind sentence transformers is to generate sentence embeddings directly, as opposed to averaging the word embeddings coming out of an LLM.
https://www.sbert.net/docs/pretrained_models.html - shows all the models which have been trained to produce sentence embeddings by training them using sentence pairs (Siamese style). Would you know if we can use TSDAE to adapt these for our dataset? If so, can you please point to sample code?

cassin-edwin commented 1 year ago

@vahuja4

In the embedding space I believe words are positioned based on their semantics and syntactic structure and are given a vector. So, the concept behind sentence transformers is that it is the average of all the word embeddings/vector. I don't know how the sentence will be directly given a vector.
I used a bert-base-uncased model to do the domain adaptation. However you can try to do it on any pre-trained model which you think is trained on relevant data like your fine-tuning dataset. This is my opinion. Look for more suggestions on this.

huggingface / setfit