I'm using PPO with BART (through some slight changes that I made to ppo_trainer to make it adept for seq2seq modeling). The general idea that I followed:
Having the model and reference sample outputs: y and y_b
Using y and y_b (and also gold data) to compute some kind of reward signals.
Having y into the ppo_trainer's step function, where the sampled response will be fed into the model and the reference to get log_probs, and then the rest of the calculation as it is (e.g., kl-divergence, etc.).
Problem: I'm facing a problem at that the model (BART-based) generates weird/strange/non-sense text after a few samples in the training are visited.
Given some issues and solutions proposed, I came to a fixed generation kwargs. Here's the generation kwargs used for sampling:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hi,
I'm using PPO with BART (through some slight changes that I made to ppo_trainer to make it adept for seq2seq modeling). The general idea that I followed:
y
andy_b
y
andy_b
(and also gold data) to compute some kind of reward signals.y
into the ppo_trainer's step function, where the sampled response will be fed into the model and the reference to get log_probs, and then the rest of the calculation as it is (e.g., kl-divergence, etc.).Problem: I'm facing a problem at that the model (BART-based) generates weird/strange/non-sense text after a few samples in the training are visited.
Given some issues and solutions proposed, I came to a fixed generation kwargs. Here's the generation kwargs used for sampling:
Here's the sample screenshot:
Any solution to fix?