Closed mnoukhov closed 1 year ago
Amazing work @mnoukhov !!
Will review the PRs asap, as a side note, it seems that I can't see the figures on the wandb report :/
Also, could you confirm with us which versions of the libraries did you used?
Thanks a lot!
Sorry, I moved the runs to the workspace so the graphs should be fixed.
My libraries are
accelerate==0.18.0
evaluate==0.4.0
huggingface-hub==0.13.3
torch==2.0.0
transformers==4.28.1
and the latest version of trl built from source
Hi @mnoukhov , could you explain when/what exactly to merge? I'm following the readme and would really appreciate your help. Specifically, when you say to merge huggyllama/llama-7b
and llama-7b-se-peft
to create llama-7b-se
, is llama-7b-se-peft
referring to the model outputted after running Step 1 (with huggyllama/llama-7b
)?
And when you say then to merge llama-7b-se
with llama-7b-se-rm-peft
to create the reward model, does llama-7b-se-rm-peft
refer to the model outputted after running Step 2 (with llama-7b-se
)?
That's correct.
base: huggyllama
+ llama-7b-se-peft
= llama-7b-se
base llama-7b-se
+ llama-7b-se-rm-peft
= llama-7b-se-rm
base llama-7b-se
+ llama-7b-se-rl-peft
= llama-7b-se-rl
Do you mind sharing the arguments / shell script you used for each step? I'm using what's listed on the repo, having memory requirement issues, which seems odd given peft+lora.
I use the same hyperparameters as those listed with the slight change that I am running on 4 GPUs instead of 8 so I change gradient accumulation steps from 4 to 8. I find that I need ~40GB of GPU memory for the RL finetuning step but it goes back and forth and can get as high as 60 per GPU. Currently working on #436 which should reduce memory requirements to allow for 32GB GPU training
I'm experiencing similar effects, about ~40GB of GPU memory is needed for the second step of training the Reward Model too, right?
You can check the exact memory and compute things by looking at the runs linked in the wandb report (e.g. my RLHF run shows the memory consumption in the "System" charts)
Given that I've essentially repro'd the results and my PRs have been merged, I'm closing this issue and continuing in #471 with repro using the more compute-efficient multi-adapter paradigm. Feel free to keep commmenting about the repro and I'll try to respond.
Hi Michael, continuing the conversation from #401 here. When I try to run the supervised finetuning script out of the box, I get the following warning messages:
Training...
24 Using pad_token, but it is not set yet.
25 UserWarning: You passed
packing=Trueto the SFTTrainer, and you are training your model with
max_stepsstrategy. The dataset will be iterated until the
max_stepsare reached.
26 warnings.warn(
27 FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set
no_deprecation_warning=Trueto disable this warning
28 warnings.warn(
29 Token indices sequence length is longer than the specified maximum sequence length for this model (4899 > 2048). Running this sequence through the model will result in indexing errors
30 You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the
callmethod is faster than using a method to encode the text followed by a call to the
padmethod to get a padded encoding.
31
use_cache=Trueis incompatible with gradient checkpointing. Setting
use_cache=False`...``
I'm getting a sudden spike in train loss around 800 global steps in, and I'm wondering if these warning messages have to do with them. Any ideas?
None of the messages are related. There are other issues related to instability in training, and without more info it's hard to diagnose the problem. If you want some advice specific to your situation it would be helpful to know lots of things like reward, KL, etc...
If you just want to try some things out #462 found that setting a larger minibatch size and larger target KL could help
@mnoukhov Hi Michael, when using the reward model to build out the sentiment pipeline in rl_training.py, the output rewards/scores vary drastically depending on the batch_size used in sent_kwargs. I am wondering if you have investigated this issue more in-depth.
I'm also wondering if in reward_modeling.py, whether the padding_strategy should be True or max_length. Currently, I get the following error message:
UserWarning: max_length
is ignored when padding
=True
and there is no truncation strategy. To pad to max length, use padding='max_length'
.
Hi, I used your RL model (https://huggingface.co/mnoukhov/llama-7b-se-rl-peft) to test the SuperGLUE benchmark using the lm-evaluation-harness. The results are as follows:
Original Model Results: | Task | Version | Metric | Value | | Stderr | | boolq | 1 | acc | 0.7642 | ± | 0.0074 | | cb | 1 | acc | 0.5536 | ± | 0.0670 | | | | f1 | 0.4248 | | | | copa | 0 | acc | 0.8800 | ± | 0.0500 | | multirc | 1 | acc | 0.0084 | ± | 0.0018 | | record | 0 | f1 | 0.9119 | ± | 0.0032 | | | | em | 0.9044 | ± | 0.0032 | | rte | 0 | acc | 0.6282 | ± | 0.0301 | | wic | 0 | acc | 0.4953 | ± | 0.0198 | | wsc | 0 | acc | 0.5673 | ± | 0.0474 |
However, when I trained my own RL model using the following command: ''' accelerate launch --multi_gpu --num_machines 1 --num_processes 8 rl_training.py --model_name=【mnoukhov se model】 --reward_model_name=【mnoukhov rm model】 --adafactor=False --tokenizer_name=【mnoukhov se model】 --save_freq=100 --output_max_length=128 --batch_size=8 --gradient_accumulation_steps=8 --batched_gen=True --ppo_epochs=4 --seed=0 --learning_rate=1.4e-5 --early_stopping=True --output_dir=llama-se-rl-finetune-128-8-8-1.4e-5_adam '''
After one day of training, the result of my own RL model (llama-se-rl-finetune-128-8-8-1.4e-5_adam, trained up to step 1300) is as follows:
Trained Model Results: | Task | Version | Metric | Value | | Stderr | | boolq | 1 | acc | 0.3783 | ± | 0.0085 | | cb | 1 | acc | 0.4107 | ± | 0.0663 | | | | f1 | 0.1941 | | | | copa | 0 | acc | 0.5500 | ± | 0.0500 | | multirc | 1 | acc | 0.0031 | ± | 0.0018 | | record | 0 | f1 | 0.1186 | ± | 0.0032 | | | | em | 0.1151 | ± | 0.0032 | | rte | 0 | acc | 0.5271 | ± | 0.0301 | | wic | 0 | acc | 0.5000 | ± | 0.0198 | | wsc | 0 | acc | 0.6346 | ± | 0.0474 |
It appears that the training of the model did not achieve the desired performance.
Note: I made some assumptions in the text, such as specifying the link to the lm-evaluation-harness repository and providing some context based on the provided information. If there are any inaccuracies or specific information you would like to include, please let me know, and I can adjust the text accordingly.
Can you share the logs from the RL training? E.g. mean rewards and objective/kl are usually helpful metrics to look at to see if the model learned something.
Can you share the logs from the RL training? E.g. mean rewards and objective/kl are usually helpful metrics to look at to see if the model learned something.
hi, here is the logs:https://wandb.ai/630191510/trl/runs/eb02d7zh?workspace=user-630191510
Looks like there was an issue at step ~50: the reward went down significantly. Could you try with a different seed or lower learning rate? Also we added some stability measures in the latest release so try updating trl
.
Hello! Here is the latest log: https://wandb.ai/630191510/trl/runs/kj2kkbq9?workspace=user-630191510
The loss curve looks normal, and the accuracy in SuperGLUE is also normal.
I suppose the key difference in ppo_trainer.py between the two branches is crucial.
That's great! So updating helped?
Can you tell me the differences between Branch[d78d91788017a34ba2536fc1dc5f6461e3533089] and Branch[e448bb69f05c8e88f88fd204b2b72ef46b872bc5] in terms of PPO training? Their ppo_trainer.py are very similar.
it‘s great jobs
I've reproduced the whole StackLLaMA pipeline using the changes in #398 #399 #400
Here is the corresponding wandb report
A couple notes:
huggyllama/llama-7b
I've also published my adapter weights on the hub https://huggingface.co/mnoukhov/llama-7b-se-peft https://huggingface.co/mnoukhov/llama-7b-se-rm-peft https://huggingface.co/mnoukhov/llama-7b-se-rl-peft
Use the
merge_peft
script in #398 to mergehuggyllama/llama-7b
andllama-7b-se-peft
to makellama-7b-se
Then mergellama-7b-se
withllama-7b-se-rm-peft
to make the reward model andllama-7b-se-rl-peft
to the make StackLLaMA