Nice, finally found this project that updates the policy using reward from discriminator and aligns itself with the algorithm in the GAIL paper. In many other libraries, they just use the reward from the environment. I was wondering why they do that and if optimizing policy with that reward detached from discriminator can really maximize the objective function.
Nice, finally found this project that updates the policy using reward from discriminator and aligns itself with the algorithm in the GAIL paper. In many other libraries, they just use the reward from the environment. I was wondering why they do that and if optimizing policy with that reward detached from discriminator can really maximize the objective function.