huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.72k stars 1.22k forks source link

Seq2seq model with ppo_trainer samples strange output! #1633

Closed sajastu closed 3 months ago

sajastu commented 5 months ago

Hi,

I'm using PPO with BART (through some slight changes that I made to ppo_trainer to make it adept for seq2seq modeling). The general idea that I followed:

Problem: I'm facing a problem at that the model (BART-based) generates weird/strange/non-sense text after a few samples in the training are visited.

Given some issues and solutions proposed, I came to a fixed generation kwargs. Here's the generation kwargs used for sampling:

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "max_length": 128,
    "eos_token_id": -1
}

Here's the sample screenshot:

image

Any solution to fix?

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.