Closed CGLemon closed 10 months ago
Q value is an average of value network's output in practical. So your understanding is right. On the other hand, there are $q(a)$ and $\hat{q}(a)$ in the paper. I think $q(a)$ is exact evaluation value (in other words theoretical value), $\hat{q}(a)$ is approximation of $q(a)$.
This paper says CompletedQ $= q(a)$ $(N(a) > 0)$ CompletedQ $= v\pi = \sum{a'} \pi(a')q(a')$ $(N(a) = 0)$ I think this is correct. It's hard to understand because the implementation doesn't use completed Q-value directly, but I think the equation you showed is $\sigma$ as follow $\sigma(\hat{q}(a)) = (c_{visit} + \maxb N(b))c{scale}\hat{q}(a)$ $\sigma(\hat{q}(a))$ is used for improved policy described as $\pi'$. Completed Q-value is used as $\hat{q}(a)$ in this formula.
Thanks! The Gumbel self-play pipe looks working. Hope that I may start the 19x19 self-play before next year.
I try to improve the target distribution with some tricks.
$CompletedQ = q(a) + 0.1 * \tanh(ScoreLead(a/20))$
Looks it can improve the policy in the early steps because the score lead is more sensitive.
$dist = softmax(logits(\pi) + \sigma(CompletedQ))$
The $dist$ is original target distribution from the paper. Then we cut off some bad move. Set zero if a move distribution is lower than $1/intersections^2$ .
$dist(a) = 0, ( dist(a) < 1/intersections^2)$
Looks it can improve the policy too and there is no significantly negative effect.
Does Ray apply any special method to improve the target distribution? Thanks!
Great ideas! I did not use any special tricks. I thought mixing score prediction and value prediction is difficult because Completed-Q value is used for improved policy target. I'm not sure these tricks do not make training process unstable, but I think there is probably no problem.
After fixing few bugs, the current 19x19 self-play training looks work. This trick, 'Mix score lead with q value', may make the training unstable in the early step. But I am not sure that it will reduce the final strength. I will check this result after buying a new computer. I have no remaining GPU power now (T.T).
On the last CGF open, you show the Ray may play weird opening moves. Do you find the problem? Thanks!
Learning score prediction is very difficult. So your result is as I expected. But I cannot predict what the final result of that approach will be.
The 14th UEC cup version of Ray trained addtional 700,000 games. Opening moves are still strange, but they might been reduced. I think the biggest reason is lack of the number of visit on self-play games. 50visits/move is too small. It must be more than or equal to 200 at least. I need more addtional GPU resources ;-)
I will write the Gumbel Learning method to my log. I think there would be some students who may be interested in it, so a detailed description of Gumbel may be good. In order to show the power of Gumbel, I want some Ray's results. As far as I know, Ray is the first successful strong computer go based on Gumbel. The students should be interested in it. And it can help them to understand the advantage or drawback of Gumbel by playing against Ray.
You may think Sayuri is another Gumbel based engine. However, Sayuri is not pure Gumbel. I mix Gumbel and PUCT algorithm. Besides, she is still in progressing. Ray's result is much important and it is completion.
Do you have any plan to release Ray's model, data or developing log? Thanks!
I don't have any plan to release Ray with Deep Learning. There are many difficulties to release Ray. First, Ray's environmental dependencies are quite strong, and it works almost exclusively on Ubuntu. Second, the data size of self-play matches is very large, even if compressed, it is about 700GB for 3,000,000 games. Third, I am developing TamaGo as a replica of Gumbel AlphaZero. Fourth, Ray's reinforcement learning is still in progress.
On the other hand, I understand that there are such requests, and I will consider them positively.
Yuki Kobayashi,
After reading the paper and source code, I am still a bit confused.
1. What's the Q value?
With Gumbel-Top-k trick, we select the best $A{t+1}$ child in the $argtop(g + logits, m)$. The $A{t+1}$ is $argmax(g(a) + logits(a) + \sigma(q(a)))$ [Algorithm 2]. What's Q value? Is it just a average win-rate?
2. What's the completed Q?
The paper says $completedQ = q(a), v{\pi}$ [Formula (10)]. But seem that the source implementation is $completedQ = (C{visit} + max{b}N(b))* C{scale} * q(a)$ [Formula (8)]. What's the correct completed Q formula?
Thanks!