voidful / TextRL

Implementation of ChatGPT RLHF (Reinforcement Learning with Human Feedback) on any generation model in huggingface's transformer (blommz-176B/bloom/gpt/bart/T5/MetaICL)
MIT License
545 stars 60 forks source link

Are there any examples for T5 or Bart? Why T5 and bart give the same output before/after training? #15

Closed YuXiangLin1234 closed 1 year ago

YuXiangLin1234 commented 1 year ago

Hello,

I used this package to fine-tune a sequence-to-sequence LM, but the prediction after PPO training are always the same with prediction before training.

The things I tried is to change the colab sample code elon_musk_gpt.ipynb. Change model name and from AutoModelWithLMHead to AutoModelForSeq2SeqLM .

image

When I print out decoded sentences during training, I find that the predicted sentences are changing during each iteration, but the prediction after PPO training are always the same with prediction before training. Is there anything I need to care about? Or Is this package not applicable to sequence-to-sequence LM?

Prediction before training:

image

Prediction during iteration:

image

Prediction after training : (

pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=300,
    eval_n_steps=None,
    eval_n_episodes=1,       
    train_max_episode_len=100,  
    eval_interval=10,
    outdir='elon_musk_dogecoin', 
)

agent.load("./elon_musk_dogecoin/best")
actor.predict(observaton_list[0]) #<------- prediction after training

):

image
voidful commented 1 year ago

Hi

Thank you for your feedback. I found that it was somehow related to the optimizer, so I updated a new version to enable optimizer settings.

Here is an example using flan-t5: https://colab.research.google.com/drive/1DYHt0mi6cyl8ZTMJEkMNpsSZCCvR4jM1?usp=sharing

YuXiangLin1234 commented 1 year ago

Thank you very much!