Open omriel1 opened 4 weeks ago
Note: This results are using non-normalized embeddings! see next coment
First evaluation result, using the sklearn_classifier.py
script: https://github.com/omriel1/llm2vec/blob/development/nlp_course/sklearn_classifier.py
See - https://github.com/omriel1/llm2vec/blob/development/nlp_course/experiments/v1/results.json
accuracy: 0.7723502304147466, classification_report:
precision recall f1-score support
Negative 0.70 0.75 0.72 433
Neutral 0.82 0.80 0.81 1233
Positive 0.73 0.72 0.73 503
accuracy 0.77 2170
macro avg 0.56 0.57 0.56 2170
weighted avg 0.77 0.77 0.77 2170
accuracy: 0.5682027649769585, classification_report:
precision recall f1-score support
Negative 0.00 0.00 0.00 433
Neutral 0.57 1.00 0.72 1233
Positive 0.00 0.00 0.00 503
accuracy 0.57 2170
macro avg 0.14 0.25 0.18 2170
weighted avg 0.32 0.57 0.41 2170
accuracy: 0.22995391705069124, classification_report:
precision recall f1-score support
Negative 0.19 0.25 0.22 433
Neutral 0.53 0.23 0.32 1233
Positive 0.20 0.21 0.20 503
accuracy 0.23 2170
macro avg 0.23 0.17 0.19 2170
weighted avg 0.39 0.23 0.27 2170
Additional try with normalized embeddings (following https://github.com/mistralai/mistral-inference/blob/main/tutorials/classifier.ipynb):
20 iterations, C=1.0, random_state=0
accuracy: 0.8419354838709677 classification_report:
precision recall f1-score support
Negative 0.82 0.81 0.82 433
Neutral 0.85 0.89 0.87 1234
Positive 0.83 0.75 0.79 503
accuracy 0.84 2170
macro avg 0.83 0.82 0.83 2170
weighted avg 0.84 0.84 0.84 2170
accuracy: 0.5686635944700461, classification_report:
precision recall f1-score support
Negative 0.00 0.00 0.00 433
Neutral 0.57 1.00 0.73 1234
Positive 0.00 0.00 0.00 503
accuracy 0.57 2170
macro avg 0.19 0.33 0.24 2170
weighted avg 0.32 0.57 0.41 2170
accuracy: 0.35069124423963133, classification_report:
precision recall f1-score support
Negative 0.21 0.36 0.26 433
Neutral 0.59 0.35 0.44 1234
Positive 0.26 0.36 0.30 503
accuracy 0.35 2170
macro avg 0.35 0.35 0.33 2170
weighted avg 0.44 0.35 0.37 2170
Additional results of training a classification head on top of the freezed model (the mntp-simcse
model).
That is, the model itself is frezzed, and then we train a neural network classification head with the following architecture:
self.linear_layer_stack = nn.Sequential(
nn.Linear(self.embedding_dim, self.hidden_units, bias=False),
nn.ReLU(),
nn.Linear(self.hidden_units, self.num_labels, bias=False),
)
Which of course gets into a softmax etc.
After 10 epochs of training, got this bad results: accuracy: 0.48847926267281105 classification_report:
precision recall f1-score support
Negative 0.30 0.91 0.45 433
Neutral 0.79 0.52 0.63 1234
Positive 0.72 0.04 0.08 503
accuracy 0.49 2170
macro avg 0.60 0.49 0.39 2170
weighted avg 0.68 0.49 0.47 2170
Comparison to other hebrew encoders: Hebrew Encoders:
-> Note that both of this models are BERT based, hence output a vector embedding for each token, As usually done (see Jay Alamars' blog from above, and as learned in class), text classification in this settings is done using the embedding of the first token, the [CLS]
token.
Multilingual:
Fine-tuned hebrew sentiment analysis models
LogisticRegression
classifier, max_iter
of 20 gets best results for our llm2vec
.accuracy: 0.840552995391705 classification_report:
precision recall f1-score support
Negative 0.82 0.80 0.81 433
Neutral 0.86 0.88 0.87 1234
Positive 0.82 0.77 0.79 503
accuracy 0.84 2170
macro avg 0.83 0.82 0.82 2170
weighted avg 0.84 0.84 0.84 2170
accuracy: 0.5686635944700461 classification_report:
precision recall f1-score support
Negative 0.00 0.00 0.00 433
Neutral 0.57 1.00 0.73 1234
Positive 0.00 0.00 0.00 503
accuracy 0.57 2170
macro avg 0.19 0.33 0.24 2170
weighted avg 0.32 0.57 0.41 2170
accuracy: 0.35069124423963133 classification_report:
precision recall f1-score support
Negative 0.21 0.36 0.26 433
Neutral 0.59 0.35 0.44 1234
Positive 0.26 0.36 0.30 503
accuracy 0.35 2170
macro avg 0.35 0.35 0.33 2170
weighted avg 0.44 0.35 0.37 2170
Additional results of training a classification head on top of the freezed model (the mntp-simcse
model).
That is, the model itself is freezed, and then we train a neural network classification head with the following architecture (we used hidden_units=8
):
self.linear_layer_stack = nn.Sequential(
nn.Linear(self.embedding_dim, self.hidden_units, bias=False),
nn.ReLU(),
nn.Linear(self.hidden_units, self.num_labels, bias=False),
)
Which of course gets into a softmax etc.
After 10 epochs of training, got this bad results: accuracy: 0.48847926267281105 classification_report:
precision recall f1-score support
Negative 0.30 0.91 0.45 433
Neutral 0.79 0.52 0.63 1234
Positive 0.72 0.04 0.08 503
accuracy 0.49 2170
macro avg 0.60 0.49 0.39 2170
weighted avg 0.68 0.49 0.47 2170
All results here are for same logistic regression classifiers with the same configurations as above
accuracy: 0.7889400921658987 classification_report:
precision recall f1-score support
Negative 0.76 0.70 0.73 433
Neutral 0.81 0.87 0.84 1234
Positive 0.76 0.67 0.71 503
accuracy 0.79 2170
macro avg 0.78 0.75 0.76 2170
weighted avg 0.79 0.79 0.79 2170
accuracy: 0.7912442396313364 classification_report:
precision recall f1-score support
Negative 0.80 0.70 0.74 433
Neutral 0.80 0.87 0.83 1234
Positive 0.76 0.67 0.71 503
accuracy 0.79 2170
macro avg 0.79 0.75 0.76 2170
weighted avg 0.79 0.79 0.79 2170
accuracy: 0.8064516129032258 classification_report:
precision recall f1-score support
Negative 0.81 0.75 0.78 433
Neutral 0.82 0.86 0.84 1234
Positive 0.76 0.72 0.74 503
accuracy 0.81 2170
macro avg 0.80 0.78 0.79 2170
weighted avg 0.81 0.81 0.81 2170
accuracy: 0.7903225806451613 classification_report:
precision recall f1-score support
Negative 0.76 0.73 0.75 433
Neutral 0.80 0.86 0.83 1234
Positive 0.79 0.67 0.72 503
accuracy 0.79 2170
macro avg 0.78 0.75 0.77 2170
weighted avg 0.79 0.79 0.79 2170
Original Task
Citing from the original course task:
See this github issue - https://github.com/UKPLab/sentence-transformers/issues/2547#issuecomment-2020153378 And read - https://huggingface.co/docs/setfit/conceptual_guides/setfit#classifier-training-phase
Data
Hebrew sentiment analysis dataset - https://huggingface.co/datasets/HebArabNlpProject/HebrewSentiment
As the benchmarks for hebrew sentiment used in Alephbert etc was proven as leaked
Classifier
We used the recommended approach of training a Logistic Regression classifier on top of the model embeddings, especially as recommended in: