kvablack / ddpo-pytorch

DDPO for finetuning diffusion models, implemented in PyTorch with LoRA support
MIT License
424 stars 42 forks source link

prompt-dependent value function optimization #15

Open hkunzhe opened 1 year ago

hkunzhe commented 1 year ago

I saw you mentioned prompt-dependent value function at https://github.com/kvablack/ddpo-pytorch/issues/7#issuecomment-1712920565. By chance, I happen to be using ddpo for related optimizations. Consider the ideal situation, where there is only one prompt and its corresponding reward function. I still found that in the early stages of training, the reward mean is very fluctuate, even if I increase the training batch size or reduce the learning rate, although the overall reward mean is rising in the end. Are there any optimization techniques to make the optimization of a single prompt prompt stable? Any suggestions or insights would be greatly appreciated.