mingkaid / rl-prompt

Accompanying repo for the RLPrompt paper
MIT License
286 stars 52 forks source link

About the RL training #34

Closed FayeXXX closed 11 months ago

FayeXXX commented 11 months ago

I am interested in your code, and thank you for your nice work. I have several questions after I run your code:

  1. About the RL training process

The loss is around 7000 to 12000, and the curve didn't converge. And I check the output prompts and find that there are some tokens repeating several times in the generated prompts. Then I use these prompts as the input for test, the result is terrible.

I guess there might be some mistakes during the training process, but I have no idea how to fix it. I have tried to change the hyper-parameters, but that didn't work.

  1. Warning: Empty candidate sentence detected; setting raw BERTscores to 0

I don't know why BERTscores are set to 0 continuously. As the BERTscore is a part of the reward, I'm wondering that may be a reason why I can't get the expected result.

Your reply will be greatly appreciated. Thank you.

mingkaid commented 11 months ago

Hi there,

  1. Could you provide more information about the task you are working with? Is it one of our example tasks, or are you applying it to your own tasks? In RL, the loss doesn't represent convergence. What matters is whether the reward increases.
  2. Sometimes a model may just decide to generate nothing given a prompt. If your reward function assigns appropriately low scores to the output, then they should get filtered out during training.

I am closing this issue now because it is research discussion. Feel free to follow up if you'd like to discuss further.

FayeXXX commented 9 months ago

Thank you for your reply. I am working with the TST on yelp and my own dataset, firstly I try to reproduce the results in your paper. During my experiment, the reward increases but the result is still bad. I notice that in your paper you shape the reward from [0,1] to [-20,80], but in module_helpers.py line34-37, that is reward_shaping_old_min: float = 0 reward_shaping_old_max: float = 100 reward_shaping_new_min: float = -10 reward_shaping_new_max: float = 10 Maybe that inconsistent scale has the bad effect on the final result?

FayeXXX commented 9 months ago

BTW, I'm wondering what are the hyper parameters I should tune to get better results. Could you please share a parameter list for tuning?