irthomasthomas / undecidability

3 stars 2 forks source link

Qwen - supervised finetuning script and guide for SFT. #660

Open irthomasthomas opened 4 months ago

irthomasthomas commented 4 months ago

Example - Qwen

DESCRIPTION:
Here we provide a very simple script for supervised finetuning, which is revised from the training script in Fastchat. The script is used to finetune Qwen with Hugging Face Trainer. You can check the script here. This script for supervised finetuning (SFT) has the following features:

In the following, we introduce more details about the usage of the script.

Installation
Before you start, make sure you have installed the following packages:

pip install peft deepspeed optimum accelerate

Data Preparation
For data preparation, we advise you to organize the data in a jsonl file, where each line is a dictionary as demonstrated below:

{
    "type": "chatml",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Tell me something about large language models."
        },
        {
            "role": "assistant",
            "content": "Large language models are a type of language model that is trained on a large corpus of text data. They are capable of generating human-like text and are used in a variety of natural language processing tasks..."
        }
    ],
    "source": "unknown"
}
{
    "type": "chatml",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is your name?"
        },
        {
            "role": "assistant",
            "content": "My name is Qwen."
        }
    ],
    "source": "self-made"
}

Above are two examples of each data sample in the dataset. Each sample is a JSON object with the following fields: type, messages, and source. messages is required while the others are optional for you to label your data format and data source. The messages field is a list of JSON objects, each of which has two fields: role and content. role can be system, user, or assistant. content is the text of the message. source is the source of the data, which can be self-made, alpaca, open-hermes, or any other string.

To make the jsonl file, you can use json to save a list of dictionaries to the jsonl file:

import json

with open('data.jsonl', 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

Quickstart
For you to start finetuning quickly, we directly provide a shell script for you to run without paying attention to details. You need different hyperparameters for different types of training, e.g., single-GPU / multi-GPU training, full-parameter tuning, LoRA, or Q-LoRA.

cd examples/sft
bash finetune.sh -m <model_path> -d <data_path> --deepspeed <config_path> [--use_lora True] [--q_lora True]

Specify the <model_path> for your model, <data_path> for your data, and <config_path> for your deepspeed configuration. If you use LoRA or Q-LoRA, just add --use_lora True or --q_lora True based on your requirements. This is the simplest way to start finetuning. If you want to change more hyperparameters, you can dive into the script and modify those parameters.

Advanced Usages
In this section, we introduce the details of the scripts, including the core python script as well as the corresponding shell script.

Shell Script
Before we introduce the python code, we provide a brief introduction to the shell script with commands. We provide some guidance inside the shell script and here we take finetune.sh as an example.

To set up the environment variables for distributed training (or single-GPU training), specify the following variables: GPUS_PER_NODE, NNODES, NODE_RANK, MASTER_ADDR, and MASTER_PORT. No need to worry too much about them as we provide the default settings for you. In the command, you can pass in the argument -m and -d to specify the model path and data path, respectively. You can also pass in the argument --deepspeed to specify the deepspeed configuration file. We provide two configuration files for ZeRO2 and ZeRO3, and you can choose one based on your requirements. In most cases, we recommend using ZeRO3 for multi-GPU training except for Q-LoRA, where we recommend using ZeRO2.

There are a series of hyperparameters to tune. Passing in --bf16 or --fp16 to specify the precision for mixed precision training. The other significant hyperparameters include:

URL: https://qwen.readthedocs.io/en/latest/training/SFT/example.html

Suggested labels

irthomasthomas commented 4 months ago

Related issues

324: bigcode/tiny_starcoder_py · Hugging Face

### DetailsSimilarity score: 0.9 > **Note:** > > [bigcode/tiny_starcoder_py · Hugging Face](https://huggingface.co/bigcode/tiny_starcoder_py) > > TinyStarCoderPy > > This is a 164M parameters model with the same architecture as StarCoder (8k context length, MQA & FIM). It was trained on the Python data from StarCoderData for ~6 epochs which amounts to 100B tokens. > > Use > > Intended use > > The model was trained on GitHub code, to assist with some tasks like Assisted Generation. For pure code completion, we advise using our 15B models StarCoder or StarCoderBase. > > Generation > > ```python > # pip install -q transformers > from transformers import AutoModelForCausalLM, AutoTokenizer > > checkpoint = "bigcode/tiny_starcoder_py" > device = "cuda" # for GPU usage or "cpu" for CPU usage > > tokenizer = AutoTokenizer.from_pretrained(checkpoint) > model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) > > inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Fill-in-the-middle > > Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: > > ```python > input_text = "def print_one_two_three():\n print('one')\n \n print('three')" > inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Training > > Model > > - Architecture: GPT-2 model with multi-query attention and Fill-in-the-Middle objective > - Pretraining steps: 50k > - Pretraining tokens: 100 billion > - Precision: bfloat16 > > Hardware > > - GPUs: 32 Tesla A100 > - Training time: 18 hours > > Software > > - Orchestration: Megatron-LM > - Neural networks: PyTorch > - BP16 if applicable: apex > > License > > The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/bigcode/tiny_starcoder_py/blob/main/LICENSE). > > #### Suggested labels > > - { "key": "llm-pretraining", "value": "Information related to the pretraining process of Large Language Models" }

167: Model training code from NousResearch/StripedHyenaTrainer

### DetailsSimilarity score: 0.89 - [ ] [NousResearch/StripedHyenaTrainer](https://github.com/NousResearch/StripedHyenaTrainer) This is the training code used to train StripedHyena-Nous-7B. First, tokenize your data python tokenization.py \ --dataset your-super-cool-sharegpt-format-dataset \ --tokenizer togethercomputer/StripedHyena-Hessian-7B \ --output tokenized \ --num-proc 32 \ --pad-to-length 4096 \ --truncate Make sure you have done accelerate config -- we used the provided DeepSpeed configuration. Then, train! accelerate launch finetune.py \ --model togethercomputer/StripedHyena-Hessian-7B \ --dataset tokenized \ --output-dir output \ --epochs 4 \ --batch-size 12 \ --gradient-accumulate-every 12 \ --warmup-steps 350 \ --learning-rate 0.000004 \ --lr-schedule linear \ --weight-decay 0.1 \ --checkpointing-steps 1000 \ --no-decay poles residues The --no-decay option disables weight decay on only the specified parameters. For StripedHyena, we've found that disabling weight decay on the Hyena operator's poles and residues parameters improves performance. There is also an option --frozen that can completely freeze select parameter groups.

389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

### DetailsSimilarity score: 0.89 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569) **Quantization and Acceleration** We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax: ```bash python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors ``` * `TheBloke/Nous-Hermes-Llama2-AWQ`: The name of the repository/model on the Hugging Face Hub. * `output`: Specifies the target directory and model name you want to save. * `format`: Optionally, you can save as safetensors. For llama-like models, we download the `tokenizer.model` and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model. After converting, you will need a config file to run `translate.py` or `run_mmlu_opnenmt.py`. Here's an example of the config: ```yaml transforms: [sentencepiece] #### Subword src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" # Model info model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" # Inference # ... ``` When considering your priority: - For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes. - AWQ models are faster than FP16 for batch size 1. Please read more here: [GitHub - casper-hansen/AutoAWQ](https://github.com/casper-hansen/AutoAWQ) **Important Note:** - There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV. - The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ. - If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible. **Offline Quantizer Script:** - We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT. Enjoy! --- **VS**: Fast Inference with vLLM Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows: - Batch size 1: 80.5 tokens/second - Batch size 60: 98 tokens/second, with GEMV being 20-25% faster. This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time. #### Suggested labels #### { "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

431: awq llama quantization

### DetailsSimilarity score: 0.89 - [ ] [awq llama quantization](huggingface.co) Quantization and Acceleration ---------------------------- We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. ### Model Conversion Here's an example of the syntax for converting a model: ```python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors ``` - `TheBloke/Nous-Hermes-Llama2-AWQ`: The name of the repository/model on the Hugging Face Hub. - `output`: Specifies the target directory and model name you want to save. - `format`: Optionally, you can save as safetensors. For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model. ### Config File After converting, you will need a config file to run `translate.py` or `run_mmlu_opnenmt.py`. Here's an example of the config: ```yaml transforms: [sentencepiece] Subword: src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" Model info: model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" Inference: # ... ``` ### Priority When considering your priority: - For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes. - AWQ models are faster than FP16 for batch size 1. - Read more: [GitHub - casper-hansen/AutoAWQ](https://github.com/casper-hansen/AutoAWQ) ### Important Note - There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV. - The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ. - If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible. ### Offline Quantizer Script We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT. ### vLLM Performance Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows: - Batch size 1: 80.5 tokens/second - Batch size 60: 98 tokens/second, with GEMV being 20-25% faster. - This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time. #### Suggested labels #### null

383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### DetailsSimilarity score: 0.89 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base) Deepseek Coder Introduction ---------------------------- Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks. ### Key Features - **Massive Training Data:** Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages. - **Highly Flexible & Scalable:** Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements. - **Superior Model Performance:** State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks. - **Advanced Code Completion Capabilities:** A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks. ### Model Summary - **deepseek-coder-5.7bmqa-base:** A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens. - **Home Page:** [DeepSeek](http://deepseek.com) - **Repository:** [deepseek-ai/deepseek-coder](https://github.com/deepseek-ai/deepseek-coder) - **Chat With DeepSeek Coder:** [DeepSeek-Coder](https://github.com/deepseek-ai/deepseek-coder/discussions) ### How to Use This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks. #### Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = "#write a quick sort algorithm" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` #### Code Insertion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """<|begin|>def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[0] left = [] right = [] <|hole|> if arr[i] < pivot: left.append(arr[i]) else: right.append(arr[i]) return quick_sort(left) + [pivot] + quick_sort(right)<|end|>""" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):]) ``` #### Repository Level Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """#utils.py import torch from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score def load_data(): iris = datasets.load_iris() X = iris.data y = iris.target # Standardize the data scaler = StandardScaler() X = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Convert numpy data to PyTorch tensors X_train = torch.tensor(X_train, dtype=torch.float32) X_test = torch.tensor(X_test, dtype=torch.float32) y_train = torch.tensor(y_train, dtype=torch.int64) y_test = torch.tensor(y_test, dtype=torch.int64) return X_train, X_test, y_train, y_test def evaluate_predictions(y_test, y_pred): return accuracy_score(y_test, y_pred) #model.py import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset class IrisClassifier(nn.Module): def __init__(self): super(IrisClassifier, self).__init__() self.fc = nn.Sequential( nn.Linear(4, 16), nn.ReLU(), nn.Linear(16, 3) ) def forward(self, x): return self.fc(x) def train_model(self, X_train, y_train, epochs, lr, batch_size): criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(self.parameters(), lr=lr) # Create DataLoader for batches dataset = TensorDataset(X_train, y_train) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) for epoch in range(epochs): for batch_X, batch_y in dataloader: optimizer.zero_grad() outputs = self(batch_X) loss = criterion(outputs, batch_y) loss.backward() optimizer.step() def predict(self, X_test): with torch.no_grad(): outputs = self(X_test) _, predicted = outputs.max(1) return predicted.numpy() #main.py from utils import load_data, evaluate_predictions from model import IrisClassifier as Classifier def main(): # Model training and evaluation """ inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=140) print(tokenizer.decode(outputs[0])) ``` License ------- This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use. See the [LICENSE-MODEL](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) for more details. Contact ------- If you have any questions, please raise an issue or contact us at [agi\_code@deepseek.com](mailto:agi_code@deepseek.com). #### Suggested labels #### { "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

309: openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"

### DetailsSimilarity score: 0.88 - [ ] [openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"](https://github.com/openai/human-eval) HumanEval: Hand-Written Evaluation Set This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". Installation Make sure to use python 3.7 or later: $ conda create -n codex python=3.7 $ conda activate codex Check out and install this repository: $ git clone https://github.com/openai/human-eval $ pip install -e human-eval Usage This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions. After following the above instructions to enable execution, generate samples and save them in the following JSON Lines (jsonl) format, where each sample is formatted into a single line like so: {"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"} We provide example_problem.jsonl and example_solutions.jsonl under data to illustrate the format and help with debugging. Here is nearly functional example code (you just have to provide generate_one_completion to make it work) that saves generated completions to samples.jsonl. from human_eval.data import write_jsonl, read_problems problems = read_problems() num_samples_per_task = 200 samples = [ dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"])) for task_id in problems for _ in range(num_samples_per_task) ] write_jsonl("samples.jsonl", samples) To evaluate the samples, run $ evaluate_functional_correctness samples.jsonl Reading samples... 32800it [00:01, 23787.50it/s] Running test suites... 100%|...| 32800/32800 [16:11<00:00, 33.76it/s] Writing results to samples.jsonl_results.jsonl... 100%|...| 32800/32800 [00:00<00:00, 42876.84it/s] {'pass@1': ..., 'pass@10': ..., 'pass@100': ...} This script provides more fine-grained information in a new file ending in _results.jsonl. Each row now contains whether the completion passed along with the execution result which is one of "passed", "timed out", or "failed". As a quick sanity-check, the example samples should yield 0.5 pass@1. $ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl Reading samples... 6it [00:00, 3397.11it/s] Running example suites... 100%|...| 6/6 [00:03<00:00, 1.96it/s] Writing results to data/example_samples.jsonl_results.jsonl... 100%|...| 6/6 [00:00<00:00, 6148.50it/s] {'pass@1': 0.4999999999999999} Because there is no unbiased way of estimating pass@k when there are fewer samples than k, the script does not evaluate pass@k for these cases. To evaluate with other k values, pass --k=. For other options, see $ evaluate_functional_correctness --help However, we recommend that you use the default values for the rest. Known Issues While evaluation uses very little memory, you might see the following error message when the system is running out of RAM. Since this may cause some correct programs to fail, we recommend that you free some memory and try again. malloc: can't allocate region Citation Please cite using the following bibtex entry: @article{chen2021codex, title={Evaluating Large Language Models Trained on Code}, author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba}, year={2021}, eprint={2107.03374}, archivePrefix={arXiv}, primaryClass={cs.LG} } #### Suggested labels #### { "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }