what's the code you use to continue pretraining in German?

nps798 commented 11 months ago

hi. thanks for all your work! I am wondering if you would like to share with me the code you use to continue pretraining and fine-tuning in German ? do you extend the original tokenizer vocabulary?

🙏🙏👍

jphme commented 11 months ago

Hi, for the pretraining please have a look at LeoLM. They will also publish a paper with all details on the training soon. (I also did some pretraining for the 7b Llama2 Model, but only a fraction of LeoLM´s).

We didn't extend the original tokenizer vocabulary - actually the Llama2 tokenizer is not well suited for German text and I was myself surprised that the pretrained Mistral model is able to generate such good text despite this (but I don't know if that is different for non-romanic languages).

nps798 commented 11 months ago

Will have a look and wait for their paper. Thanks

best regards.

jphme / EM_German

what's the code you use to continue pretraining in German? #5