Applying LLM2Vec to DictaLM

omriel1 commented 1 month ago

In order to apply LLM2Vec to DictaLM we need:

[x] Identify base model - https://huggingface.co/collections/dicta-il/dicta-lm-20-collection-661bbda397df671e4a430c27
[x] Enable bi-directional attention
[x] Prepare dataset for MNTP training - https://mboyanov.github.io/2024/08/31/BiMNTP.html
[x] Create a json configuration:
- https://towardsdatascience.com/turn-llama-3-into-an-embedding-model-with-llm2vec-8448005f99aa
- https://github.com/mboyanov/bg2vec/blob/master/model_configurations/bggpt-7b.json
[x] Run the run_mntp script against the model and configuration file (see resources above)
- https://github.com/mboyanov/bg2vec/blob/master/1.%20Bi-MNTP%20Training.ipynb
[x] Test new model created
[x] Finetune using unsupervised contrastive learning - https://mboyanov.github.io/2024/09/11/SimCSE.html

omriel1 commented 1 month ago

Identifying base model

Looking at the paper (section 2.2, "models"), at their hugging-face and code seems like the technique was applied on chat/instruct models:

Llama-2-7B-chat
Mistral-7B-Instruct-v0.2
Meta-Llama-3-8B-Instruct
Qwen/Qwen2-7B-Instruct

Hence, a reasonable assumption will be to use the instruct version of DictaLM - https://huggingface.co/dicta-il/dictalm2.0-instruct

omriel1 commented 1 month ago

Data

As the paper states (section 2.2, "Training data"):

We perform both the MNTP and the unsupervised SimCSE step using data from English Wikipedia. We select data from Wikipedia as it is presumably included in the pre-training mixture of all the models we experiment with. It is therefore fair to assume that these two adaptation steps are not teaching the model any new knowledge beyond how to attend to future tokens and how to construct sequence representations

Hence, it's reasonable to use data that the model already seen in training.

Note that using the exact data they used for the paper can be hard, as it's just generally mentions datasets used and the techniques applied on this data to make it suitable for training (deduplication etc.). Hence, it'll be easier to start with already filtered&cleaned data, even if it's not exactly what used for the training. See:

HeDC4 used for HeRo {Apache License 2.0} - A Hebrew Deduplicated and Cleaned Common Crawl Corpus. A thoroughly cleaned and approximately deduplicated dataset for unsupervised learning.
Wikipedia Corpora used for AlephBERT {Apache License 2.0} - The texts in all of Hebrew Wikipedia was also extracted to pre-train OnlpLab's AlephBERT, using Attardi's Wikiextractor.

omriel1 commented 1 month ago

Applying bidirectional encoding

LLM2Vec has modified architecture for 4 models only:

llama
gemma
mistral
qwen

Hopefully, DictaLM2.0 is a fine-tuned version of one of the model mentioned above and then the scripts will work "out-of-the-box".

As the paper states:

We chose to initialize our model from the Mistral-7B-v0.1 checkpoint

Indeed, we can check what is the model underlying architecture:

from transformers import AutoConfig
from experiments.run_mntp import get_model_class

def main():
    config = AutoConfig.from_pretrained(
        "dicta-il/dictalm2.0-instruct"
    )
    model_class = get_model_class(config)
    print(model_class)

if __name__ == "__main__":
    main()

>> MistralConfig
>> <class 'llm2vec.models.bidirectional_mistral.MistralBiForMNTP'>

Hence it's supported out of the box in the run_mntp.py script

omriel1 commented 2 weeks ago

A note regarding SimSCE data: The original paper used: Wiki1M, see - https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/viewer/default/train?p=2&row=208 Note that the sentences are relatively short!(see distribution and sample), and that's opposed to the data they've used for training for MNTP (https://huggingface.co/datasets/Salesforce/wikitext?row=47) which is reasonable as anyway we want to use relatively short sentences and we anyway have a limitation on the input size

omriel1 / llm2vec