prompt-dependent value function optimization

I saw you mentioned prompt-dependent value function at https://github.com/kvablack/ddpo-pytorch/issues/7#issuecomment-1712920565. By chance, I happen to be using ddpo for related optimizations. Consider the ideal situation, where there is only one prompt and its corresponding reward function. I still found that in the early stages of training, the reward mean is very fluctuate, even if I increase the training batch size or reduce the learning rate, although the overall reward mean is rising in the end. Are there any optimization techniques to make the optimization of a single prompt prompt stable? Any suggestions or insights would be greatly appreciated.

kvablack / ddpo-pytorch

prompt-dependent value function optimization #15