jphme / EM_German

Repository for the EM German Model
103 stars 5 forks source link

what's the code you use to continue pretraining in German? #5

Closed nps798 closed 11 months ago

nps798 commented 11 months ago

hi. thanks for all your work! I am wondering if you would like to share with me the code you use to continue pretraining and fine-tuning in German ? do you extend the original tokenizer vocabulary?

πŸ™πŸ™πŸ‘

jphme commented 11 months ago

Hi, for the pretraining please have a look at LeoLM. They will also publish a paper with all details on the training soon. (I also did some pretraining for the 7b Llama2 Model, but only a fraction of LeoLMΒ΄s).

We didn't extend the original tokenizer vocabulary - actually the Llama2 tokenizer is not well suited for German text and I was myself surprised that the pretrained Mistral model is able to generate such good text despite this (but I don't know if that is different for non-romanic languages).

nps798 commented 11 months ago

Will have a look and wait for their paper. Thanks

best regards.