Closed Nightbringers closed 10 months ago
Only Elo differences are informative. In the Gumbel MuZero paper, the Elo is anchored to have Pachi at 1000 Elo.
If the agent is not fully trained, it will probably keep getting stronger. The practice will depend on the quality of the approximations by the neural network.
It will depend on the search algorithms used by the agents. Take two agents and check the behavior.
See Figure 2 from the Gumbel MuZero paper to see the effect of "n", the number of simulations. Maybe starts with a small number of simulations and increase the number of simulations after the agent stops improving.
Two agents both use Gumbel MuZero search algorithms, if I train use 800 simulations, eval in 200 simulations, is that ok?
In Figure 2, simulations=32 almost have same elo as simulations=200, so number of simulations seems not make much difference to the final result. Then if i training use 400 simulations, there's no need to increase the number of simulations to 800 or 1600 ?
You have to try yourself. You may see some differences.
If you want to see evaluations with different numbers of simulations, see Figure 9. https://openreview.net/pdf?id=bERaNdoegnO
I train a new muzero model with previous model generated data, This makes the level of the muzero model greatly improved in a short time. But then i train it use self-play data, it getting weaker, why? Does this normal?
That does not sound normal. You have the access to the data and training code so you can investigate. For example, check that training by self-play improves the model when training from scratch.
My muzero model seems difficult to train from scratch; the Dynamic network doesn't seem to converge at all. What could be causing this? Alzero seems work fine.
And about broadcast net, do you think add two broadcast net( one in value head and another in policy head) wiil improve the network?
MuZero training requires a careful implementation. The correct targets need to be provided, even after the episode ends. Look at existing MuZero implementations.
about broadcast net, do you think add two broadcast net( one in value head and another in policy head) wiil improve the network?
The broadcast net is a detail. You can try network ablations later.
a = jax.nn.one_hot(a, 361)
a = a.reshape(a.shape[0],19,19,1)
x = jnp.concatenate([s, a],axis=-1)
about the input of Dynamic network, a is an action, s is hidden state produced by the representation function or previous Dynamic network. Is this code ok?
The code makes sense. I will not be available to help you with your MuZero implementation. You can get feedback from running the code and visualizing the outputs.
In paper you set every 8th block with a global broadcasting residual block, have you ever try other number, like every 1th, every 2th, every 3th, every 4th and so on ?
Why my search speed so much slower than yours ? You use a network with 256 planes and 32 blocks with bottlenecks and broadcasting achieving 10000 inferences/second. I only can achieving 500 inferences/second use a model about the same size as yours, which test on a single 3090.
I do not know results with broadcasting usages.
Many things can be different. We used TPUs.
So the num of broadcasting is worth experimenting.
Can inferences use mult gpus accelerate when use mctx? when self-play it was mult games in mult gpus, but if I want speed up the move of one game, how to accelerate by use mult gpus?
I use 400 num_simulations training my alzero, it seems stop get stronger now, but it can't achieve superhuman performance, is this normal?
In alzero paper, the elo of go is exceed 5000, but in Gumbel paper, the elo of go is below 3000. why?
if a agent training use num_simulations==800, then i continue train use num_simulations==400, what will happen? Will it keep getting stronger or will it get lower?
evaluation use different num_simulations is it equivalent? in my training, The speed of the evaluation affect the speed of training. So I want to know,if agent1 > agent2 in 100 num_simulations. Does it means agent1 > agent2 in 200 num_simulation? and in 300 num_simulation? and in 400 num_simulation?
How many num_simulations do you recommend for training go? Is there a big difference between 400 searches and 800 searches and 1600 searches or is it will be almost same strong eventually. Or 1600 searches > 800 searches > 400 searches?