Closed lucasliunju closed 3 months ago
In addition, may I ask how to evaluate the trained model, could you please provide some examples?
Thanks for your interest in our work.
Due to the data of instruction tuning and continual pretraining is large, we tokenize data before training.
To run instruction tuning, we use allenai/tulu-v2-sft-mixture
with following process code based on
https://github.com/allenai/open-instruct:
You can then run it with following command
RANK=8
deepspeed --num_gpus=8 --num_nodes=2 train.py \
--base_model <LLAMA-2> --micro_batch_size 1\
--wandb_run_name mora_math_r8 --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,down_proj,up_proj \
--num_epochs 2 --deepspeed ds.config --wandb_project lora-instruction --lora_r $RANK --batch_size 128 \
--data_path open-instruct-tokenized \
--save_steps 3000 \
--learning_rate 2e-4 --mora_type 6 \
--logging_steps 5 --use_bf16 --use_16bit --use_mora \
--lr_scheduler_type "cosine" --warmup_steps 150
For continual pretraining, we follow adaptLLM and leverage LLM to process domain data from PubMed abstracts and financial news mixed with instruction data from open-orca.
For evaluation, we first merge the MoRA into the LLM.
To evaluate on math, we directly use eval_gsm8k.py
and eval_math.py
from https://github.com/meta-math/MetaMath?tab=readme-ov-file#evaluation.
For instruction evaluation, we use a similar script from https://github.com/allenai/open-instruct/blob/main/scripts/eval/mmlu.sh to get MMLU results.
To evaluate on continual pretraining, we collect corresponding tasks and obtain results through in-context learning.
Hi @kongds
Thanks for your detailed response.
I find the training prompt is
"Below is an instruction that describes a task, paired with an input that provides further context. " "Write a response that appropriately completes the request.\n\n" "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
But the eval prompt in eval_gsm8k.py
is
"Below is an instruction that describes a task. " "Write a response that appropriately completes the request.\n\n" "### Instruction:\n{instruction}\n\n### Response: Let's think step by step."
I'm not sure whether that will cause some difference.
Thank you very much in advance!
This training prompt is also from MetaMath
, you can find it in
https://github.com/meta-math/MetaMath/blob/fe667b14f9a4c51bde9809adafb82b6b5d255800/train_math.py#L48-L52
Thanks for your help!
Thanks for your great repo again! I think your code is pretty good, we can also reproduce the performance of other baselines.
I have reproduced the results on metamath and currently I want to reproduce the results on open-instruct tasks. I would like to ask something about the hyper-parametes. I find the hyper-parameters on the appendix of your paper: warmup steps is 500. But in your provided example, warmup steps is 150. I would like to ask whether this warmup step can cause some difference.
Thank you very much for you help in advance.
Sorry for the confusion.
I checked the experiments and paper, and it seems that the warmup steps mentioned in the paper are incorrect (I will fix it in next version). The warmup steps in our experiments are 150. The 150 warmup steps are from open-instruction
. They use a 0.03 warmup ratio. Since the training steps are 5000, we use 150 warmup steps.
Regarding the influence of different warmup steps, I haven't searched it before. It seems that smaller warmup steps lead to faster convergence compared to larger steps, but I don't know how it influences the final results.
Thanks for your quick reply, I will try it.
Thanks for your interest in our work.
Due to the data of instruction tuning and continual pretraining is large, we tokenize data before training. To run instruction tuning, we use
allenai/tulu-v2-sft-mixture
with following process code based on https://github.com/allenai/open-instruct:click to collapse
from functools import partial import torch from transformers import AutoTokenizer import datasets def encode_with_messages_format(example, tokenizer, max_seq_length, add_bos=False): ''' Here we assume each example has a 'messages' field Each message is a dict with 'role' and 'content' fields. We concatenate all messages with the roles as delimiters and tokenize them together. ''' messages = example['messages'] if len(messages) == 0: raise ValueError('messages field is empty.') def _concat_messages(messages): message_text = "" for message in messages: if message["role"] == "system": message_text += "<|system|>\n" + message["content"].strip() + "\n" elif message["role"] == "user": message_text += "<|user|>\n" + message["content"].strip() + "\n" elif message["role"] == "assistant": message_text += "<|assistant|>\n" + message["content"].strip() + tokenizer.eos_token + "\n" else: raise ValueError("Invalid role: {}".format(message["role"])) return message_text example_text = _concat_messages(messages).strip() tokenized_example = tokenizer(example_text, return_tensors='pt', max_length=max_seq_length, truncation=True) input_ids = tokenized_example.input_ids labels = input_ids.clone() # mask the non-assistant part for avoiding loss for message_idx, message in enumerate(messages): if message["role"] != "assistant": if message_idx == 0: message_start_idx = 0 else: message_start_idx = tokenizer( _concat_messages(messages[:message_idx]), return_tensors='pt', max_length=max_seq_length, truncation=True ).input_ids.shape[1] if message_idx < len(messages) - 1 and messages[message_idx+1]["role"] == "assistant": # here we also ignore the role of the assistant messages_so_far = _concat_messages(messages[:message_idx+1]) + "<|assistant|>\n" else: messages_so_far = _concat_messages(messages[:message_idx+1]) message_end_idx = tokenizer( messages_so_far, return_tensors='pt', max_length=max_seq_length, truncation=True ).input_ids.shape[1] labels[:, message_start_idx:message_end_idx] = -100 if message_end_idx >= max_seq_length: break attention_mask = torch.ones_like(input_ids) return { 'input_ids': input_ids.flatten().tolist(), 'labels': labels.flatten().tolist(), 'attention_mask': attention_mask.flatten().tolist(), } tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-hf",use_fast=False) dataset = datasets.load_dataset('allenai/tulu-v2-sft-mixture') max_seq_length = 2048 num_added_tokens = tokenizer.add_special_tokens({ "bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", }) encode_function = partial( encode_with_messages_format, tokenizer=tokenizer, max_seq_length=max_seq_length, add_bos=add_bos, ) lm_datasets = dataset['train'].map( encode_function, batched=False, num_proc=40, remove_columns=dataset["train"].column_names, desc="Tokenizing and reformatting instruction data", ) lm_datasets = lm_datasets.filter(lambda example: (torch.LongTensor(example['labels']) != -100).any()) lm_datasets.save_to_disk('open-instruct-tokenized')
You can then run it with following command
RANK=8 deepspeed --num_gpus=8 --num_nodes=2 train.py \ --base_model <LLAMA-2> --micro_batch_size 1\ --wandb_run_name mora_math_r8 --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,down_proj,up_proj \ --num_epochs 2 --deepspeed ds.config --wandb_project lora-instruction --lora_r $RANK --batch_size 128 \ --data_path open-instruct-tokenized \ --save_steps 3000 \ --learning_rate 2e-4 --mora_type 6 \ --logging_steps 5 --use_bf16 --use_16bit --use_mora \ --lr_scheduler_type "cosine" --warmup_steps 150
For continual pretraining, we follow adaptLLM and leverage LLM to process domain data from PubMed abstracts and financial news mixed with instruction data from open-orca.
Hi, I find we still have not define add_bos
and I find the default add_bos=False
cannot work and I guess that should be True
. I would like to ask whether add_bos
should be True
.
Hi, I forgot to remove some code related to add_bos
. (Setting add_bos
to False
should work.)
Here is the revised version. which should run without errors.
from functools import partial
import torch
from transformers import AutoTokenizer
import datasets
def encode_with_messages_format(example, tokenizer, max_seq_length, add_bos=False):
'''
Here we assume each example has a 'messages' field Each message is a dict with 'role' and 'content' fields.
We concatenate all messages with the roles as delimiters and tokenize them together.
'''
messages = example['messages']
if len(messages) == 0:
raise ValueError('messages field is empty.')
def _concat_messages(messages):
message_text = ""
for message in messages:
if message["role"] == "system":
message_text += "<|system|>\n" + message["content"].strip() + "\n"
elif message["role"] == "user":
message_text += "<|user|>\n" + message["content"].strip() + "\n"
elif message["role"] == "assistant":
message_text += "<|assistant|>\n" + message["content"].strip() + tokenizer.eos_token + "\n"
else:
raise ValueError("Invalid role: {}".format(message["role"]))
return message_text
example_text = _concat_messages(messages).strip()
tokenized_example = tokenizer(example_text, return_tensors='pt', max_length=max_seq_length, truncation=True)
input_ids = tokenized_example.input_ids
labels = input_ids.clone()
# mask the non-assistant part for avoiding loss
for message_idx, message in enumerate(messages):
if message["role"] != "assistant":
if message_idx == 0:
message_start_idx = 0
else:
message_start_idx = tokenizer(
_concat_messages(messages[:message_idx]), return_tensors='pt', max_length=max_seq_length, truncation=True
).input_ids.shape[1]
if message_idx < len(messages) - 1 and messages[message_idx+1]["role"] == "assistant":
# here we also ignore the role of the assistant
messages_so_far = _concat_messages(messages[:message_idx+1]) + "<|assistant|>\n"
else:
messages_so_far = _concat_messages(messages[:message_idx+1])
message_end_idx = tokenizer(
messages_so_far,
return_tensors='pt',
max_length=max_seq_length,
truncation=True
).input_ids.shape[1]
labels[:, message_start_idx:message_end_idx] = -100
if message_end_idx >= max_seq_length:
break
attention_mask = torch.ones_like(input_ids)
return {
'input_ids': input_ids.flatten().tolist(),
'labels': labels.flatten().tolist(),
'attention_mask': attention_mask.flatten().tolist(),
}
tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-hf",use_fast=False)
dataset = datasets.load_dataset('allenai/tulu-v2-sft-mixture')
max_seq_length = 2048
num_added_tokens = tokenizer.add_special_tokens({
"bos_token": "<s>",
"eos_token": "</s>",
"unk_token": "<unk>",
"pad_token": "<pad>",
})
encode_function = partial(
encode_with_messages_format,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
)
lm_datasets = dataset['train'].map(
encode_function,
batched=False,
num_proc=40,
remove_columns=dataset["train"].column_names,
desc="Tokenizing and reformatting instruction data",
)
lm_datasets = lm_datasets.filter(lambda example: (torch.LongTensor(example['labels']) != -100).any())
lm_datasets.save_to_disk('open-instruct-tokenized')
Thanks for your reply. I will try it.
Hi, I still encounter an error at 904 step
| 904/5010 [5:33:04<25:04:33, 21.99s/it]../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [835,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [835,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [835,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [835,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [835,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [835,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [835,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [835,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I'm not sure whether this is because the cuda version is not match.
Hi,
I've switched to using another cluster, and it has successfully run over 1000 steps, which looks promising. I'll check if there are any similar errors to what we encountered before.
Interestingly, although they use the term random_seed
I've noticed that the loss curves are different. I suspect that the set_seed
function may not guarantee full reproducibility of results.
For this error, we set the out index token to 0 if the data_path contains open-instruct-tokenized
in following. If you rename the path, which may cause this error for not replacing this out index token.
https://github.com/kongds/MoRA/blob/0ff64b144e60b54fe7c0ff7b4e76c99c949e923d/train.py#L624-L639
For the reproducibility, set_seed
should works in my testing. The difference may cause from the environment or different micro_batch_size.
Hi, thanks for your reply.
I just check my data path and I use the same path as open-instruct-tokenized
.
I check the codes and experiments again.
I forget add --new_pad_token
in command, following command should work.
Sorry for this mistake.
RANK=8
deepspeed --num_gpus=8 --num_nodes=2 train.py \
--base_model <LLAMA-2> --micro_batch_size 1\
--wandb_run_name mora_math_r8 --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,down_proj,up_proj \
--num_epochs 2 --deepspeed ds.config --wandb_project lora-instruction --lora_r $RANK --batch_size 128 \
--data_path open-instruct-tokenized \
--save_steps 3000 \
--learning_rate 2e-4 --mora_type 6 \
--logging_steps 5 --use_bf16 --use_16bit --use_mora \
--lr_scheduler_type "cosine" --warmup_steps 150 --new_pad_token
This error is caused by the tokenizer using 32000 as the pad token. However, there is text containing "< pad >" in open-instruction, which results in a token with an index of 32000.
Using --new_pad_token
to resize the token embedding from 32000 to 32001 can solve this problem.
Hi,
Thanks for your reply. So I just need to add --new_pad_token
and don't need to change the code of train.py
.
Yes
Thanks, I will try it.
Hi, I find the result is quite different when I use single GPU and 2 gpus to run the code. In addition, I find the loss value is very similar if we use two gpus with different seeds (That is not normal because we hope get different result with different seeds). But, if we use single GPU, the loss value with different seeds is different (normal, the results are different). I think the main reason is that set_seed
does not work when we use two gpus.
Sorry I have so many questions.
It's okay. Thanks for reproducing our work.
Regarding the issue with seed
, we use the seed not only in set_seed
but also in transformers.TrainingArguments
. For transformers.TrainingArguments
, the seed is set within the transformer library again to ensure it reproduce.
https://github.com/kongds/MoRA/blob/0ff64b144e60b54fe7c0ff7b4e76c99c949e923d/train.py#L673
I think the difference may be caused by different gradient_accumulation_steps
. The single GPU uses a large gradient_accumulation_steps
, which may cause the training different.
Hi, thanks for your reply.
I find the loss curve of multi-gpu training is very similar when I use different seeds. That make me consider whether we really set the seed for multi-gpu training. But when I use single GPU to train the model with different seeds, the loss value is different. That means the random seed can work for single GPU but not for multi-gpu. I guess maybe that is from the random of deepspeed.
We use the trainer from transformer
and the seed is also set by transformer
. I think it is compatible with deepspeed
.
Hi, thanks for your reply.
I'm trying to reproduce the results and let you know when I finish it.
Thank you. But I still think the seed
is correctly set even with multi GPUs (we train it on 32 GPUs, and the seed works for us). And the reason between this difference is not the seed
.
Regarding the difference in loss, I think it is acceptable for the loss to be similar even if we use different seeds. This is because we use the same type of data, and the losses are averaged with a sufficiently large batch size, such as 128.
Hi, thanks for your reply.
Do you try to use different seeds to run the code when you use 32 GPUs? I try to change the seed and I find the loss curve is very similar. That means we can get same loss value with different seeds.
This is my loss curve with different seeds when use 2 gpus, each device has one data:
But when I use single gpu to train and we can get different loss values with different seeds.
Here is my training loss with different seeds, each runs on 16 GPUs.
So sad. Let me check my environment.
Thanks for your help.
Hi I have finished the training and I find the test accuracy of LoRA and MoRA on MMLU-0 are about 49.2, 49.4. I'm not sure whether these results are correct since I find the reported result of LoRA in the paper is 50.2.
Hello, It seems the settings in your experiment are the same as ours (we also use a 2e-4 learning rate to run LoRA in our paper). However, the gradient_accumulation_steps in your experiment differ from ours, I am not sure weather this may cause the different results of LoRA. For example, we use a micro batch size of 1 with 16 GPUs, and the gradient_accumulation_steps is set to 8. But it seems you use 2 GPUs, where the gradient_accumulation_steps can be set to 64.
Thanks for your detailed reply.
Yes, let me try to use more GPUs.
Hi, may I ask the best learning rate for MoRA and LoRA.
In our experiment, 2e-4 seems better for rank 8 of MoRA and LoRA on instruction tuning.
Hi, may I ask a question which is not related MoRA. I am trying to use this codebase to fine-tune llama3-8b, but I find I cannot use vllm to inference and evaluate it if I directly use the inference code from metamath repo. It could be better if you could give me some advice. Thank you very much in advance.
Hello, do you merge the mora or lora before inference?
yes, I have merged lora or mora to the base model and then inference and evaluate with vllm.
but llama2 can work well.
It seems to be a problem with vllm. Maybe you need to upgrade vllm to support llama3.
Thanks for your reply! I'll have a try. By the way, I find the training performance is quite different if I change the version of transformers and peft.
Hi, May I ask how to run the code on Instruction Tuning task and Continual Pretraining.
Thank you very much in advance!