jonathan-laurent / AlphaZero.jl

A generic, simple and fast implementation of Deepmind's AlphaZero algorithm.
https://jonathan-laurent.github.io/AlphaZero.jl/stable/
MIT License
1.24k stars 140 forks source link

fix: Ternary Statistics computation #182

Closed AndrewSpano closed 1 year ago

AndrewSpano commented 1 year ago

This PR resolves #177 by changing the value of gamma when used in the rewards_and_redudancy() function for environments that have ternary rewards:

https://github.com/jonathan-laurent/AlphaZero.jl/blob/66eaed8e4d8f60f8d535d949e5447b5c5f821ee8/src/simulations.jl#L292

Specifically, the changes that have been made are:

  1. In the run() function of benchmark.jl, the following line was produced to check if the current environment has ternary rewards, and if yes then gamma == 1 is used for the report.Evaluation rewards:

    gamma = env.params.ternary_rewards ? 1. : env.params.self_play.mcts.gamma

  2. In training.jl, the functions

    now take the extra argument eval_gamma that is used to compute the rewards in rewards_and_redudancy(). This value is computed in learning_step!(), before compare_networks() is invoked:

      eval_gamma = env.params.ternary_rewards ? 1. : env.params.self_play.mcts.gamma
      eval_report =
        compare_networks(env.gspec, env.curnn, env.bestnn, ap, handler, eval_gamma)
jonathan-laurent commented 1 year ago

This PR includes an extra commit from another PR (GPU implementation of Connect Four). Otherwise, it looks good! I was thinking of a simpler approach, such as simply printing an error when the user specifies ternary_rewards with gamma!=1 but this approach is interesting too.

AndrewSpano commented 1 year ago

Will create another PR that contains only the changes of the latest commit, closing this one.