huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.04k stars 1.27k forks source link

Reproducing StackLLaMA #401

Closed mnoukhov closed 1 year ago

mnoukhov commented 1 year ago

I've reproduced the whole StackLLaMA pipeline using the changes in #398 #399 #400

Here is the corresponding wandb report

A couple notes:

I've also published my adapter weights on the hub https://huggingface.co/mnoukhov/llama-7b-se-peft https://huggingface.co/mnoukhov/llama-7b-se-rm-peft https://huggingface.co/mnoukhov/llama-7b-se-rl-peft

Use the merge_peft script in #398 to merge huggyllama/llama-7b and llama-7b-se-peft to make llama-7b-se Then merge llama-7b-se with llama-7b-se-rm-peft to make the reward model and llama-7b-se-rl-peft to the make StackLLaMA

younesbelkada commented 1 year ago

Amazing work @mnoukhov !! Will review the PRs asap, as a side note, it seems that I can't see the figures on the wandb report :/
Screenshot 2023-06-02 at 14 29 33 Also, could you confirm with us which versions of the libraries did you used? Thanks a lot!

mnoukhov commented 1 year ago

Sorry, I moved the runs to the workspace so the graphs should be fixed.

My libraries are

accelerate==0.18.0
evaluate==0.4.0
huggingface-hub==0.13.3
torch==2.0.0
transformers==4.28.1

and the latest version of trl built from source

dh2shin commented 1 year ago

Hi @mnoukhov , could you explain when/what exactly to merge? I'm following the readme and would really appreciate your help. Specifically, when you say to merge huggyllama/llama-7b and llama-7b-se-peft to create llama-7b-se, is llama-7b-se-peft referring to the model outputted after running Step 1 (with huggyllama/llama-7b)? And when you say then to merge llama-7b-se with llama-7b-se-rm-peft to create the reward model, does llama-7b-se-rm-peft refer to the model outputted after running Step 2 (with llama-7b-se)?

mnoukhov commented 1 year ago

That's correct.

base: huggyllama + llama-7b-se-peft = llama-7b-se base llama-7b-se + llama-7b-se-rm-peft = llama-7b-se-rm base llama-7b-se + llama-7b-se-rl-peft = llama-7b-se-rl

dh2shin commented 1 year ago

Do you mind sharing the arguments / shell script you used for each step? I'm using what's listed on the repo, having memory requirement issues, which seems odd given peft+lora.

mnoukhov commented 1 year ago

I use the same hyperparameters as those listed with the slight change that I am running on 4 GPUs instead of 8 so I change gradient accumulation steps from 4 to 8. I find that I need ~40GB of GPU memory for the RL finetuning step but it goes back and forth and can get as high as 60 per GPU. Currently working on #436 which should reduce memory requirements to allow for 32GB GPU training

dh2shin commented 1 year ago

I'm experiencing similar effects, about ~40GB of GPU memory is needed for the second step of training the Reward Model too, right?

mnoukhov commented 1 year ago

You can check the exact memory and compute things by looking at the runs linked in the wandb report (e.g. my RLHF run shows the memory consumption in the "System" charts)

Given that I've essentially repro'd the results and my PRs have been merged, I'm closing this issue and continuing in #471 with repro using the more compute-efficient multi-adapter paradigm. Feel free to keep commmenting about the repro and I'll try to respond.

dh2shin commented 1 year ago

Hi Michael, continuing the conversation from #401 here. When I try to run the supervised finetuning script out of the box, I get the following warning messages: Training... 24 Using pad_token, but it is not set yet. 25 UserWarning: You passedpacking=Trueto the SFTTrainer, and you are training your model withmax_stepsstrategy. The dataset will be iterated until themax_stepsare reached. 26 warnings.warn( 27 FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or setno_deprecation_warning=Trueto disable this warning 28 warnings.warn( 29 Token indices sequence length is longer than the specified maximum sequence length for this model (4899 > 2048). Running this sequence through the model will result in indexing errors 30 You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using thecallmethod is faster than using a method to encode the text followed by a call to thepadmethod to get a padded encoding. 31use_cache=Trueis incompatible with gradient checkpointing. Settinguse_cache=False`...``

I'm getting a sudden spike in train loss around 800 global steps in, and I'm wondering if these warning messages have to do with them. Any ideas?

mnoukhov commented 1 year ago

None of the messages are related. There are other issues related to instability in training, and without more info it's hard to diagnose the problem. If you want some advice specific to your situation it would be helpful to know lots of things like reward, KL, etc...

If you just want to try some things out #462 found that setting a larger minibatch size and larger target KL could help

dh2shin commented 1 year ago

@mnoukhov Hi Michael, when using the reward model to build out the sentiment pipeline in rl_training.py, the output rewards/scores vary drastically depending on the batch_size used in sent_kwargs. I am wondering if you have investigated this issue more in-depth.

I'm also wondering if in reward_modeling.py, whether the padding_strategy should be True or max_length. Currently, I get the following error message: UserWarning: max_length is ignored when padding=True and there is no truncation strategy. To pad to max length, use padding='max_length'.

wangzhao88 commented 1 year ago

Hi, I used your RL model (https://huggingface.co/mnoukhov/llama-7b-se-rl-peft) to test the SuperGLUE benchmark using the lm-evaluation-harness. The results are as follows:

Original Model Results: | Task | Version | Metric | Value | | Stderr | | boolq | 1 | acc | 0.7642 | ± | 0.0074 | | cb | 1 | acc | 0.5536 | ± | 0.0670 | | | | f1 | 0.4248 | | | | copa | 0 | acc | 0.8800 | ± | 0.0500 | | multirc | 1 | acc | 0.0084 | ± | 0.0018 | | record | 0 | f1 | 0.9119 | ± | 0.0032 | | | | em | 0.9044 | ± | 0.0032 | | rte | 0 | acc | 0.6282 | ± | 0.0301 | | wic | 0 | acc | 0.4953 | ± | 0.0198 | | wsc | 0 | acc | 0.5673 | ± | 0.0474 |

However, when I trained my own RL model using the following command: ''' accelerate launch --multi_gpu --num_machines 1 --num_processes 8 rl_training.py --model_name=【mnoukhov se model】 --reward_model_name=【mnoukhov rm model】 --adafactor=False --tokenizer_name=【mnoukhov se model】 --save_freq=100 --output_max_length=128 --batch_size=8 --gradient_accumulation_steps=8 --batched_gen=True --ppo_epochs=4 --seed=0 --learning_rate=1.4e-5 --early_stopping=True --output_dir=llama-se-rl-finetune-128-8-8-1.4e-5_adam '''

After one day of training, the result of my own RL model (llama-se-rl-finetune-128-8-8-1.4e-5_adam, trained up to step 1300) is as follows:

Trained Model Results: | Task | Version | Metric | Value | | Stderr | | boolq | 1 | acc | 0.3783 | ± | 0.0085 | | cb | 1 | acc | 0.4107 | ± | 0.0663 | | | | f1 | 0.1941 | | | | copa | 0 | acc | 0.5500 | ± | 0.0500 | | multirc | 1 | acc | 0.0031 | ± | 0.0018 | | record | 0 | f1 | 0.1186 | ± | 0.0032 | | | | em | 0.1151 | ± | 0.0032 | | rte | 0 | acc | 0.5271 | ± | 0.0301 | | wic | 0 | acc | 0.5000 | ± | 0.0198 | | wsc | 0 | acc | 0.6346 | ± | 0.0474 |

It appears that the training of the model did not achieve the desired performance.

Note: I made some assumptions in the text, such as specifying the link to the lm-evaluation-harness repository and providing some context based on the provided information. If there are any inaccuracies or specific information you would like to include, please let me know, and I can adjust the text accordingly.

lvwerra commented 1 year ago

Can you share the logs from the RL training? E.g. mean rewards and objective/kl are usually helpful metrics to look at to see if the model learned something.

wangzhao88 commented 1 year ago

Can you share the logs from the RL training? E.g. mean rewards and objective/kl are usually helpful metrics to look at to see if the model learned something.

hi, here is the logs:https://wandb.ai/630191510/trl/runs/eb02d7zh?workspace=user-630191510

lvwerra commented 1 year ago

Looks like there was an issue at step ~50: the reward went down significantly. Could you try with a different seed or lower learning rate? Also we added some stability measures in the latest release so try updating trl.

wangzhao88 commented 1 year ago

Hello! Here is the latest log: https://wandb.ai/630191510/trl/runs/kj2kkbq9?workspace=user-630191510

The loss curve looks normal, and the accuracy in SuperGLUE is also normal.

I suppose the key difference in ppo_trainer.py between the two branches is crucial.

lvwerra commented 1 year ago

That's great! So updating helped?

wangzhao88 commented 1 year ago

Can you tell me the differences between Branch[d78d91788017a34ba2536fc1dc5f6461e3533089] and Branch[e448bb69f05c8e88f88fd204b2b72ef46b872bc5] in terms of PPO training? Their ppo_trainer.py are very similar.

zhangfudiyi commented 1 year ago

it‘s great jobs