Why does this method need so much steps?

mingkaid / rl-prompt

Accompanying repo for the RLPrompt paper

MIT License

286 stars 52 forks source link

Why does this method need so much steps? #39

Closed A11en0 closed 8 months ago

A11en0 commented 8 months ago

Greetings, and I appreciate your exceptional work! I thoroughly reviewed the paper and made an effort to reproduce the source code. In doing so, I found that this method necessitates 12,000 steps (equivalent to 375 epochs) to attain convergence. I'm curious about the reasons for such a lengthy duration. Also, I'm wondering if there is a faster alternative to complete this optimization loop?

A11en0 commented 8 months ago

Another question, can I use early stop in the reward iterative loop?

MM-IR commented 8 months ago

This is the general RL problem. This takes many steps to converge.
Faster Alternatives: If you find better approach (e.g. more efficient RL) to do so, it will have.
You can, but not sure what you mean by early stop. You can stop at any time when you find the training reward is good enough. Usually, the prompt will be with some level of competency.

I am closing this now!