Seq2seq model with ppo_trainer samples strange output!

Hi,

I'm using PPO with BART (through some slight changes that I made to ppo_trainer to make it adept for seq2seq modeling). The general idea that I followed:

Having the model and reference sample outputs: y and y_b
Using y and y_b (and also gold data) to compute some kind of reward signals.
Having y into the ppo_trainer's step function, where the sampled response will be fed into the model and the reference to get log_probs, and then the rest of the calculation as it is (e.g., kl-divergence, etc.).

Problem: I'm facing a problem at that the model (BART-based) generates weird/strange/non-sense text after a few samples in the training are visited.

Given some issues and solutions proposed, I came to a fixed generation kwargs. Here's the generation kwargs used for sampling:

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "max_length": 128,
    "eos_token_id": -1
}

Here's the sample screenshot:

Any solution to fix?

huggingface / trl

Seq2seq model with ppo_trainer samples strange output! #1633