Open WJ-Fifth opened 1 year ago
Deeply sorry for such a late reply. Indeed, the actor critic in this version is an 'on-policy' one. That is, it updates the policy based on the sampled trajectories of current policy weights. Since there is no supervision with the ground truth data anymore, if it goes to long, the policy will gradually 'forget' what it learns from the data and purely focus on the RL reward. The reward, however, do not include many items to maintain its motion quality (it only focuses on beat alignment). That is why it goes bad when finetuning more iterations.
Hope it clarifies.
Best
You have completed a very good model!
I also achieved very good results when I was working on your model. But there are still some questions that are not very clear. Are you experiencing gradient explosion when implementing the Actor-Critic Learning module? My model still converged at the first epoch, and it did have some improvement compared to GPT. However, during the subsequent iterations, L_AC increased significantly and could not continue to converge. And the visualization results also became very strange.
Looking forward for your reply!