mbzuai-nlp / bactrian-x

A Multilingual Replicable Instruction-Following Model
92 stars 3 forks source link

🐫 MBZUAI Bactrian-X

A Multilingual Replicable Instruction-Following Model

Haonan Li*, Fajri Koto*, Minghao Wu, Alham Fikri Aji, Timothy Baldwin (*equal contribution)

Code License Data License

:fire: News

Overview

Bactrian-X dataset contains 3.4M pairs of instructions and responses in 52 languages. The instructions were obtained from alpaca-52k, and dolly-15k, and tranlated into 52 languages (52 languages x 67k instances = 3.4M instances). The responses in 52 languages were generated from gpt-3.5-turbo model.

Bactrian-X models are a series of LLM models fine-tuned (using low-rank adaptation/LoRA) on Bactrian-X dataset.

Usage and License Notices: Bactrian-X is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Dataset

We curate our Bactrian instruction dataset with the following steps:

  1. Collecting English instructions: The English instructions are obtained from alpaca-52k and dolly-15k, and they are saved to instructions.json.
  2. Translating the English instructions into foreign languages: The instructions (and the corresponding inputs, if any) are translated into 51 languages using the Google Translate API (conducted in April 2023).
  3. Generating the responses: We generate output from gpt-3.5-turbo for the instructions in each language (conducted in April 2023).

Models

With our dataset and Low-Rank Adaptation (LoRA), we present a family of multilingual and monolingual models based on LLaMA and BLOOM. Our instruction-tuned multilingual Bactrian-X models are available at:

Note: We are continually updating this repository. The number of languages will be more than 52 in the future, and the current models are mostly only 7B in size. We welcome any collaborators who are willing to contribute larger models.

Hands-on Bactrian-X

Setting up the Environment

conda create -n bactrian python=3.9
conda activate bactrian
pip install -r requirements.txt

Training

Models are trained with the following hyperparameters:

Hyper-parameter Bactrian-X
batch_size 128
num_epochs 4
learning_rate 3e-4
cutoff_len 768
lora_r 64
lora_alpha 16

Below is a command to train a LLaMA-7B adapter with our datasets in specific language(s). Replace <lang_iso> with a list of (one or more) ISO-639-2 language codes separated by commas (e.g., en,zh for English and Chinese), and <your_output_dir> to specify where to store the outputs.

# Script to train on 4x Nvidia A100 80GB gpus
WORLD_SIZE=4
CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --nproc_per_node=4 --master_port=1234 finetune.py \
    --model_name_or_path decapoda-research/llama-7b-hf \
    --lang <lang_iso> \
    --output_dir <your_output_dir> \
    --load_in_8bit \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 8 \
    --model_max_length 768 \
    --learning_rate 3e-4 \
    --val_set_size 2000 \
    --warmup_steps 200 \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --lora_target_modules 'q_proj,k_proj,v_proj,o_proj' \
    --group_by_length 

Inference

This is example code that loads both the foundation model and Bactrian LoRA weights from the Hugging Face model hub, and runs a Gradio interface for inference on a specified input.

python generate.py \
    --load_8bit \
    --base_model 'decapoda-research/llama-7b-hf' \
    --lora_weights 'MBZUAI/bactrian-x-llama-7b-lora' \
    --share_gradio 

Checkpoint export

To merge the LoRA weights back into the base model for export to Hugging Face format and to PyTorch state_dicts, go to Alpaca-LoRA. This should help users who want to run inference in projects like llama.cpp or alpaca.cpp.

Output Examples

Please check output examples here.

Citation

Please cite the repo if you use the data, model or code in this repo. A paper will be released very soon.

@misc{li2023bactrianx,
      title={Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation}, 
      author={Haonan Li and Fajri Koto and Minghao Wu and Alham Fikri Aji and Timothy Baldwin},
      year={2023},
      eprint={2305.15011},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Naturally, you should also cite the original LLaMA paper, the Self-Instruct paper, and the Stanford Alpaca repo.

Acknowledgements

We are standing on the shoulders of giants and would like to especially acknowledge the previous efforts of the following works.:

  1. Stanford Alpaca
  2. Alpaca-LoRA
  3. Low-Rank Adaptation (LoRA)
  4. PEFT
  5. LLM.int8()