kobanium / TamaGo

Computer go engine using Monte-Carlo Tree Search written in Python3.
Apache License 2.0
54 stars 10 forks source link

Is it better to use mixed value approximation? #69

Closed CGLemon closed 11 months ago

CGLemon commented 1 year ago

In the paper (Appendix D), DeepMind used the mixed value approximation instead of simple one. It seems that your implementation is simple one. In my experience, the simple one can work on 9x9. But it is crashed on the 19x19. So maybe it is better choice to use mixed value approximation?

    def calculate_completed_q_value(self) -> np.array:

        ~~~~~~~~~~~~~~

        sum_prob = np.sum(policy)
        v_pi = np.sum(policy * q_value)

        return np.where(self.children_visits[:self.num_children] > 0, q_value, v_pi / sum_prob)
kobanium commented 1 year ago

This is because I didn't understand the calculation of v_mix value. So TamaGo must use mixed value approximation. Although I want to change from simple value to mixed value approximation, I'm too busy to change it. I'll change it when I have enough time.

By the way, my experiment on Ray, reinforcement learning using simple value had been done well on 19x19 (16visits/move). So I'm curious why your experiment on 19x19 failed.

CGLemon commented 1 year ago

oh... Maybe your implement is different with my implement. Does Ray resacle the Q value in the Gumbel process?

CGLemon commented 1 year ago

I forgot to explain the v_mix value. The format is very simple. That is

    sum_prob = np.sum(policy)
    v_pi = np.sum(policy * q_value)
    rhs = v_pi / sum_prob

    lhs = parent_nn_value
    factor = np.sum(self.children_visits)

    v_mix = (1 * lhs + factor * rhs) / (1 + factor)
kobanium commented 1 year ago

Thanks for a snippet! Certainly, it is easy to implement.

I don't rescale Q-value because value network output's range is from 0.0 to 1.0. I think I shouldn't rescale Q-value. It is very sensitive for targets of reinforcement learning process.

CGLemon commented 1 year ago

Seem that the rescaling is not necessary for AlphaZero. What’s worse is that it may make the policy too sharp. I fix this issue in my main run. The result shows the new weights is better than before.