voidful / TextRL

Implementation of ChatGPT RLHF (Reinforcement Learning with Human Feedback) on any generation model in huggingface's transformer (blommz-176B/bloom/gpt/bart/T5/MetaICL)
MIT License
539 stars 60 forks source link

unfreeze_layer_from_past parameter #25

Open JhonDan1999 opened 1 year ago

JhonDan1999 commented 1 year ago

Nice repo!!!

it seems that the default parameter for the policy will freeze all the layers of the language model we are using and just update the lm_head I tried the provided example of flan-T5 here: https://colab.research.google.com/drive/1DYHt0mi6cyl8ZTMJEkMNpsSZCCvR4jM1?usp=sharing

when I changed the value unfreeze_layer_from_past to be 1 to update the wights of the final layer of flan-t5 like this: Screenshot 2023-09-20 at 1 04 45 PM

the behavior change the the actor starts to generate empty text: Screenshot 2023-09-20 at 1 08 58 PM

Also after training it gave me empty text:

Screenshot 2023-09-20 at 1 09 50 PM

what is the reason of the this behavior?

NOTE: I did not change anything else in the flan-t5 code example.

barthelemymp commented 1 year ago

I observed the same thing, I also tried penalizing the fact of generating the '_' token directly in the reward function. Unfortunately, it does not seems to learn how to stop generating the blank token...

voidful commented 1 year ago

Hi all, the issue probably cause by https://github.com/huggingface/transformers/blob/bffac926ca6bc6c965a92bfbfd00c567a2c0fb90/src/transformers/models/t5/modeling_t5.py#L1147C8-L1147C8

it will add a position_bias after each layer output, so the initialize model will perform badly

daniellucs2002 commented 8 months ago

Hey! Do you guys figure out a solution to this problem? Thanks!

JhonDan1999 commented 8 months ago

Hey! Do you guys figure out a solution to this problem? Thanks!

Unfortunately not yet, I spend a lot of time trying to figure out a way to do it with this library but I ended up leaving it (at least currently)