jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.31k stars 426 forks source link

Does it support spanish language? #141

Open wilfoderek opened 5 months ago

wilfoderek commented 5 months ago

Excellent works guys! My question is if this model support the spanish language? or what languages it supports? Can it be trained in spanish? how much time and resources are necessary for this purpose?

Have an excellent day to all the team!

ChaosCodes commented 5 months ago

Hi, currently our training datasets mainly contain English corpus. I think that not much spanish are training during pretraining. However, I think that you can collect about over 50B high quality spanish language corpus, mix them with slimpajama corpus and continual pretrain our model with those data.

wilfoderek commented 5 months ago

How must cost the training of a tiny llama in spanish?

wilfoderek commented 5 months ago

Thanks for your answer

ChaosCodes commented 5 months ago

How must cost the training of a tiny llama in spanish?

It depends on your token number. For example, you need about half a month for ~250B tokens under 8 A40s.

wilfoderek commented 5 months ago

Considering the actual prices and your estimated time, it's aproximately $ 3.168 dolars.

ChaosCodes commented 5 months ago

I am not sure how many tokens required to achieve a good continual pretrained models for spanish. Maybe will be less than 250B. Sorry I have no experience about that.

demetera commented 4 months ago

Thank you for explanations and your awesome model. I have a little question about mixing non-English corpus with slimpajama. Is it mandatory? In which proportion it should be done? If I have a book corpus, can I split be sentence and train on the little context size (32-64?) with the max_length padding?