TransLLM: Why Not Transform Chat Large Language Models to Non-English?

TransLLM is implemented based on the Chinese-LLaMA-Alpaca-2 project.

Data

We provide the following data:

Recovery KD data in English: ./code/train/distil_alpaca_en_52k_llama2-7b-chat.json
Recovery KD data in Thai: ./code/train/distil_alpaca_en_52k_llama2-7b-chat_th_googlemt.json
Alpaca-GPT-4 data in English: ./code/train/alpaca_gpt4_data_en.json
Alpaca-GPT-4 data in Thai: ./code/train/alpaca_gpt4_data_th_googlemt.json
MT-Bench in Thai: ./code/test/mt_bench_question.xlsx
Alpaca-Eval in Thai: ./code/test/alpaca_eval.xlsx
Example data format of experiments: ./code/train/example

Traning

Model Extension

Use SentencePiece to learn the Thai vocabulary on mc4-TH. Merege the vocabulary as described in Chinese-LLaMA-Alpaca-2.

Target Language Pre-Training

Prepare mc4-TH in txt format, and the target chat model (such as llama2-chat-7b-hf).
Change the data path and model path in the ./train/run_pt_1.sh.
Run run_pt_1.sh.

Translation Pre-Training

Prepare Pile data and EN-TH parallel data in txt format
Change the data path and model path in the ./train/run_pt_2.sh.
Run run_pt_2.sh.

Transfer Fine-Tuning

Translate the Recovery KD data to Thai, organize TCOT data and SFT Translation data.
Change the data path and model path in the ./train/run_sft.sh.
Run run_sft.sh.

Evluation

We provide the following scripts for evaluation

Merge the LoRA model: ./Chinese-LLaMA-Alpaca-2/scripts/merge_llama2_with_chinese_lora_low_mem.py
Generate output for mt_bench: ./eval/mt_bench_generate.py
Generate output for alpaca_eval: ./eval/alpaca_eval_generate.py
Generate GPT-4 evaluations: ./eval/gpt4_eval.py

Notice

We have modified some files in ./Chinese-LLaMA-Alpaca-2/scripts/training.

run_clm_pt_with_peft.py
run_clm_sft_with_peft.py
build_dataset.py
build_distil_dataset.py

License

The code and data is released under Apache License 2.0.

Citation

Please cite as:

@misc{geng2024TransLLM,
      title={Why Not Transform Chat Large Language Models to Non-English?}, 
      author={Xiang Geng and Ming Zhu and Jiahuan Li and Zhejian Lai and Wei Zou and Shuaijie She and Jiaxin Guo and Xiaofeng Zhao and Yinglu Li and Yuang Li and Chang Su and Yanqing Zhao and Min Zhang and Hao Yang and Xinglin Lyu and Jiajun Chen and Shujian Huang},
      year={2024},
      eprint={2405.13923},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

hy5468 / TransLLM

readme