Original Task

Citing from the original course task:

Training a strong Hebrew Sentence Encoder from a pretrained Decoder While recent years have brought many additions to the open-source set of pretrained LMs in high-resource languages such as English, most of these tools are not directly useful for use on Hebrew Inputs. Recently, a new project aiming to bridge this gap has introduced new tools and most importantly benchmarks for Herbrew LMs. Concurrently, some new open-source strong models have been trained on Hebrew text, most recently, the DictaLM 2.0. In this project, you will modify the DictaLM model to be a strong Encoder-model using the LLM2Vec method. To evaluate the result, you will train linear classifier for a Hebrew sentiment analysis task on top of embeddings from your trained model, and against some baselines. Such baselines can be strong English and multilingual pretrained models, and existing pretrained Hebrew encoders (for example, AlephBERT and AlephBERTGimmel).

See this github issue - https://github.com/UKPLab/sentence-transformers/issues/2547#issuecomment-2020153378 And read - https://huggingface.co/docs/setfit/conceptual_guides/setfit#classifier-training-phase

Data

Hebrew sentiment analysis dataset - https://huggingface.co/datasets/HebArabNlpProject/HebrewSentiment

As the benchmarks for hebrew sentiment used in Alephbert etc was proven as leaked

Classifier

We used the recommended approach of training a Logistic Regression classifier on top of the model embeddings, especially as recommended in:

Mistral blog (as our model is mistral-based) - https://github.com/mistralai/mistral-inference/blob/main/tutorials/classifier.ipynb
As demonstrated by the famous Jay Alamar - https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/nlp/03_Sentence_Classification_with_BERT.ipynb
By huggingfaces' creators book - Natural Language Processing with Transformers, Revised Edition, chapter 2

Note: This results are using non-normalized embeddings! see next coment

First evaluation result, using the sklearn_classifier.py script: https://github.com/omriel1/llm2vec/blob/development/nlp_course/sklearn_classifier.py

See - https://github.com/omriel1/llm2vec/blob/development/nlp_course/experiments/v1/results.json

Logistic regression classifier

accuracy: 0.7723502304147466, classification_report:

              precision    recall  f1-score   support

    Negative       0.70      0.75      0.72       433
     Neutral       0.82      0.80      0.81      1233
    Positive       0.73      0.72      0.73       503

    accuracy                           0.77      2170
   macro avg       0.56      0.57      0.56      2170
weighted avg       0.77      0.77      0.77      2170

Dummy most-frequent classifier

accuracy: 0.5682027649769585, classification_report:

              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00       433
     Neutral       0.57      1.00      0.72      1233
    Positive       0.00      0.00      0.00       503

    accuracy                           0.57      2170
   macro avg       0.14      0.25      0.18      2170
weighted avg       0.32      0.57      0.41      2170

Dummy uniform classifier

accuracy: 0.22995391705069124, classification_report:

              precision    recall  f1-score   support

    Negative       0.19      0.25      0.22       433
     Neutral       0.53      0.23      0.32      1233
    Positive       0.20      0.21      0.20       503

    accuracy                           0.23      2170
   macro avg       0.23      0.17      0.19      2170
weighted avg       0.39      0.23      0.27      2170

Nornmalize embeddings

Additional try with normalized embeddings (following https://github.com/mistralai/mistral-inference/blob/main/tutorials/classifier.ipynb):

Logistic regression classifier

20 iterations, C=1.0, random_state=0

accuracy: 0.8419354838709677 classification_report:

              precision    recall  f1-score   support

    Negative       0.82      0.81      0.82       433
     Neutral       0.85      0.89      0.87      1234
    Positive       0.83      0.75      0.79       503

    accuracy                           0.84      2170
   macro avg       0.83      0.82      0.83      2170
weighted avg       0.84      0.84      0.84      2170

Dummy most-frequent classifier

accuracy: 0.5686635944700461, classification_report:

              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00       433
     Neutral       0.57      1.00      0.73      1234
    Positive       0.00      0.00      0.00       503

    accuracy                           0.57      2170
   macro avg       0.19      0.33      0.24      2170
weighted avg       0.32      0.57      0.41      2170

Dummy uniform classifier

accuracy: 0.35069124423963133, classification_report:

              precision    recall  f1-score   support

    Negative       0.21      0.36      0.26       433
     Neutral       0.59      0.35      0.44      1234
    Positive       0.26      0.36      0.30       503

    accuracy                           0.35      2170
   macro avg       0.35      0.35      0.33      2170
weighted avg       0.44      0.35      0.37      2170

Additional results of training a classification head on top of the freezed model (the mntp-simcse model). That is, the model itself is frezzed, and then we train a neural network classification head with the following architecture:

        self.linear_layer_stack = nn.Sequential(
            nn.Linear(self.embedding_dim, self.hidden_units, bias=False),
            nn.ReLU(),
            nn.Linear(self.hidden_units, self.num_labels, bias=False),
        )

Which of course gets into a softmax etc.

After 10 epochs of training, got this bad results: accuracy: 0.48847926267281105 classification_report:

              precision    recall  f1-score   support

    Negative       0.30      0.91      0.45       433
     Neutral       0.79      0.52      0.63      1234
    Positive       0.72      0.04      0.08       503

    accuracy                           0.49      2170
   macro avg       0.60      0.49      0.39      2170
weighted avg       0.68      0.49      0.47      2170

Comparison to other hebrew encoders: Hebrew Encoders:

-> Note that both of this models are BERT based, hence output a vector embedding for each token, As usually done (see Jay Alamars' blog from above, and as learned in class), text classification in this settings is done using the embedding of the first token, the [CLS] token.

Multilingual:

Fine-tuned hebrew sentiment analysis models

DictaBERT-setiment

Final results

I updated the results and found that for the LogisticRegression classifier, max_iter of 20 gets best results for our llm2vec.
Also, added two more embeddings models

LLM2Vec over DictaLM2.0 (Our trained model)

logistic_regression_classifier

accuracy: 0.840552995391705 classification_report:

              precision    recall  f1-score   support

    Negative       0.82      0.80      0.81       433
     Neutral       0.86      0.88      0.87      1234
    Positive       0.82      0.77      0.79       503

    accuracy                           0.84      2170
   macro avg       0.83      0.82      0.82      2170
weighted avg       0.84      0.84      0.84      2170

dummy_most_frequent_classifier

accuracy: 0.5686635944700461 classification_report:

              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00       433
     Neutral       0.57      1.00      0.73      1234
    Positive       0.00      0.00      0.00       503

    accuracy                           0.57      2170
   macro avg       0.19      0.33      0.24      2170
weighted avg       0.32      0.57      0.41      2170

dummy_uniform_classifier

accuracy: 0.35069124423963133 classification_report:

              precision    recall  f1-score   support

    Negative       0.21      0.36      0.26       433
     Neutral       0.59      0.35      0.44      1234
    Positive       0.26      0.36      0.30       503

    accuracy                           0.35      2170
   macro avg       0.35      0.35      0.33      2170
weighted avg       0.44      0.35      0.37      2170

Neural Network classification head

Additional results of training a classification head on top of the freezed model (the mntp-simcse model). That is, the model itself is freezed, and then we train a neural network classification head with the following architecture (we used hidden_units=8):

  self.linear_layer_stack = nn.Sequential(
      nn.Linear(self.embedding_dim, self.hidden_units, bias=False),
      nn.ReLU(),
      nn.Linear(self.hidden_units, self.num_labels, bias=False),
  )

Which of course gets into a softmax etc.

After 10 epochs of training, got this bad results: accuracy: 0.48847926267281105 classification_report:

              precision    recall  f1-score   support

    Negative       0.30      0.91      0.45       433
     Neutral       0.79      0.52      0.63      1234
    Positive       0.72      0.04      0.08       503

    accuracy                           0.49      2170
   macro avg       0.60      0.49      0.39      2170
weighted avg       0.68      0.49      0.47      2170

Competitors

All results here are for same logistic regression classifiers with the same configurations as above

onlplab/alephbert-base

accuracy: 0.7889400921658987 classification_report:

              precision    recall  f1-score   support

    Negative       0.76      0.70      0.73       433
     Neutral       0.81      0.87      0.84      1234
    Positive       0.76      0.67      0.71       503

    accuracy                           0.79      2170
   macro avg       0.78      0.75      0.76      2170
weighted avg       0.79      0.79      0.79      2170

imvladikon/alephbertgimmel-base-512

accuracy: 0.7912442396313364 classification_report:

              precision    recall  f1-score   support

    Negative       0.80      0.70      0.74       433
     Neutral       0.80      0.87      0.83      1234
    Positive       0.76      0.67      0.71       503

    accuracy                           0.79      2170
   macro avg       0.79      0.75      0.76      2170
weighted avg       0.79      0.79      0.79      2170

intfloat/multilingual-e5-large

accuracy: 0.8064516129032258 classification_report:

              precision    recall  f1-score   support

    Negative       0.81      0.75      0.78       433
     Neutral       0.82      0.86      0.84      1234
    Positive       0.76      0.72      0.74       503

    accuracy                           0.81      2170
   macro avg       0.80      0.78      0.79      2170
weighted avg       0.81      0.81      0.81      2170

sentence-transformers/paraphrase-multilingual-mpnet-base-v2

accuracy: 0.7903225806451613 classification_report:

              precision    recall  f1-score   support

    Negative       0.76      0.73      0.75       433
     Neutral       0.80      0.86      0.83      1234
    Positive       0.79      0.67      0.72       503

    accuracy                           0.79      2170
   macro avg       0.78      0.75      0.77      2170
weighted avg       0.79      0.79      0.79      2170

omriel1 / llm2vec

Evaluation #8

Original Task

Data

Classifier

Logistic regression classifier

Dummy most-frequent classifier

Dummy uniform classifier

Nornmalize embeddings

Logistic regression classifier

Dummy most-frequent classifier

Dummy uniform classifier

Final results

LLM2Vec over DictaLM2.0 (Our trained model)

logistic_regression_classifier

dummy_most_frequent_classifier

dummy_uniform_classifier

Neural Network classification head

Competitors

onlplab/alephbert-base

imvladikon/alephbertgimmel-base-512

intfloat/multilingual-e5-large

sentence-transformers/paraphrase-multilingual-mpnet-base-v2