Open omriel1 opened 1 month ago
Looking at the paper (section 2.2, "models"), at their hugging-face and code seems like the technique was applied on chat/instruct models:
Hence, a reasonable assumption will be to use the instruct version of DictaLM - https://huggingface.co/dicta-il/dictalm2.0-instruct
As the paper states (section 2.2, "Training data"):
We perform both the MNTP and the unsupervised SimCSE step using data from English Wikipedia. We select data from Wikipedia as it is presumably included in the pre-training mixture of all the models we experiment with. It is therefore fair to assume that these two adaptation steps are not teaching the model any new knowledge beyond how to attend to future tokens and how to construct sequence representations
Hence, it's reasonable to use data that the model already seen in training.
Note that using the exact data they used for the paper can be hard, as it's just generally mentions datasets used and the techniques applied on this data to make it suitable for training (deduplication etc.). Hence, it'll be easier to start with already filtered&cleaned data, even if it's not exactly what used for the training. See:
LLM2Vec
has modified architecture for 4 models only:
llama
gemma
mistral
qwen
Hopefully, DictaLM2.0
is a fine-tuned version of one of the model mentioned above and then the scripts will work "out-of-the-box".
As the paper states:
We chose to initialize our model from the Mistral-7B-v0.1 checkpoint
Indeed, we can check what is the model underlying architecture:
from transformers import AutoConfig
from experiments.run_mntp import get_model_class
def main():
config = AutoConfig.from_pretrained(
"dicta-il/dictalm2.0-instruct"
)
model_class = get_model_class(config)
print(model_class)
if __name__ == "__main__":
main()
>> MistralConfig
>> <class 'llm2vec.models.bidirectional_mistral.MistralBiForMNTP'>
Hence it's supported out of the box in the run_mntp.py
script
A note regarding SimSCE data:
The original paper used: Wiki1M
, see - https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/viewer/default/train?p=2&row=208
Note that the sentences are relatively short!(see distribution and sample), and that's opposed to the data they've used for training for MNTP (https://huggingface.co/datasets/Salesforce/wikitext?row=47) which is reasonable as anyway we want to use relatively short sentences and we anyway have a limitation on the input size
In order to apply LLM2Vec to DictaLM we need:
run_mntp
script against the model and configuration file (see resources above)