Closed dimeldo closed 8 years ago
I'm closing this. But thanks again for the link! I've read the paper multiple times and this is definitely something I want to implement soon.
Based on the paper, the network must be trained first w/ something like we have here (this repo) and then RL is used to fine-tuned for long term rewards.
I tried to implement this using dpnn's ReinforceCategorical
module and the VRClassReward
criterion. Had some trouble setting up the parameters and inputs/outputs of Seq2Seq layers correctly. If somebody has done this successfully, I'd love to see their code. Else, if somebody simply wants to collaborate, let me know! Thanks.
Very interesting! Thanks for sharing.