meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.
10.36k stars 1.47k forks source link

llama 3 multilingual recipe #509

Open woohwan opened 1 month ago

woohwan commented 1 month ago

🚀 The feature, motivation and pitch

The current multilingual recipes are for LLAMA 2. I would like to see LLAMA 3 multilingual recipes added.

Thank you.

Alternatives

No response

Additional context

Adding multilingual tokens via huggingface tokenizer does not work.

I followed the documentation below. https://huggingface.co/learn/nlp-course/chapter6/2

HamidShojanazeri commented 1 month ago

@woohwan thanks for the feature request, just note that is the e2e recipe more geared toward showing the process. I wonder if you are interested in contributing a llama3 case-study?

woohwan commented 1 month ago

sorry. i'm newbie in llm field.

savanth14 commented 1 month ago

@ HamidShojanazeri I am also interested in merging the llama 3 tokenizer with a new custom tokenizer that I trained from scratch. I understand that llama 1 and 2 tokenizers are based on sentencepiece and the current llama recipes also provide code to merge two sentencepiece tokenizers. However, llama 3 tokenizer is based on tiktoken and there no official training script available to train a tiktoken tokenizer let alone merge two of them together. Can you help with the code or point in the right direction as to how to merge two tiktoken based tokenizers? Thanks in advance

wukaixingxp commented 1 month ago

Hi! Here is the recipe for multilingual! Please take a look and let me know if there is any questions!