google-deepmind / mctx

Monte Carlo tree search in JAX
Apache License 2.0
2.33k stars 188 forks source link

Few questions about training and num_simulations #83

Closed Nightbringers closed 10 months ago

Nightbringers commented 10 months ago

In alzero paper, the elo of go is exceed 5000, but in Gumbel paper, the elo of go is below 3000. why?

if a agent training use num_simulations==800, then i continue train use num_simulations==400, what will happen? Will it keep getting stronger or will it get lower?

evaluation use different num_simulations is it equivalent? in my training, The speed of the evaluation affect the speed of training. So I want to know,if agent1 > agent2 in 100 num_simulations. Does it means agent1 > agent2 in 200 num_simulation? and in 300 num_simulation? and in 400 num_simulation?

How many num_simulations do you recommend for training go? Is there a big difference between 400 searches and 800 searches and 1600 searches or is it will be almost same strong eventually. Or 1600 searches > 800 searches > 400 searches?

fidlej commented 10 months ago
  1. Only Elo differences are informative. In the Gumbel MuZero paper, the Elo is anchored to have Pachi at 1000 Elo.

  2. If the agent is not fully trained, it will probably keep getting stronger. The practice will depend on the quality of the approximations by the neural network.

  3. It will depend on the search algorithms used by the agents. Take two agents and check the behavior.

  4. See Figure 2 from the Gumbel MuZero paper to see the effect of "n", the number of simulations. Maybe starts with a small number of simulations and increase the number of simulations after the agent stops improving.

Nightbringers commented 10 months ago
  1. Two agents both use Gumbel MuZero search algorithms, if I train use 800 simulations, eval in 200 simulations, is that ok?

  2. In Figure 2, simulations=32 almost have same elo as simulations=200, so number of simulations seems not make much difference to the final result. Then if i training use 400 simulations, there's no need to increase the number of simulations to 800 or 1600 ?

fidlej commented 10 months ago

You have to try yourself. You may see some differences.

fidlej commented 9 months ago

If you want to see evaluations with different numbers of simulations, see Figure 9. https://openreview.net/pdf?id=bERaNdoegnO

Nightbringers commented 9 months ago

I train a new muzero model with previous model generated data, This makes the level of the muzero model greatly improved in a short time. But then i train it use self-play data, it getting weaker, why? Does this normal?

fidlej commented 9 months ago

That does not sound normal. You have the access to the data and training code so you can investigate. For example, check that training by self-play improves the model when training from scratch.

Nightbringers commented 9 months ago

My muzero model seems difficult to train from scratch; the Dynamic network doesn't seem to converge at all. What could be causing this? Alzero seems work fine.

And about broadcast net, do you think add two broadcast net( one in value head and another in policy head) wiil improve the network?

fidlej commented 9 months ago

MuZero training requires a careful implementation. The correct targets need to be provided, even after the episode ends. Look at existing MuZero implementations.

Nightbringers commented 9 months ago

about broadcast net, do you think add two broadcast net( one in value head and another in policy head) wiil improve the network?

fidlej commented 9 months ago

The broadcast net is a detail. You can try network ablations later.

Nightbringers commented 9 months ago
    a =  jax.nn.one_hot(a, 361)
    a = a.reshape(a.shape[0],19,19,1)
    x = jnp.concatenate([s, a],axis=-1)

about the input of Dynamic network, a is an action, s is hidden state produced by the representation function or previous Dynamic network. Is this code ok?

fidlej commented 9 months ago

The code makes sense. I will not be available to help you with your MuZero implementation. You can get feedback from running the code and visualizing the outputs.

Nightbringers commented 9 months ago

In paper you set every 8th block with a global broadcasting residual block, have you ever try other number, like every 1th, every 2th, every 3th, every 4th and so on ?

Why my search speed so much slower than yours ? You use a network with 256 planes and 32 blocks with bottlenecks and broadcasting achieving 10000 inferences/second. I only can achieving 500 inferences/second use a model about the same size as yours, which test on a single 3090.

fidlej commented 9 months ago

I do not know results with broadcasting usages.

Many things can be different. We used TPUs.

Nightbringers commented 9 months ago

So the num of broadcasting is worth experimenting.

Can inferences use mult gpus accelerate when use mctx? when self-play it was mult games in mult gpus, but if I want speed up the move of one game, how to accelerate by use mult gpus?

Nightbringers commented 9 months ago

I use 400 num_simulations training my alzero, it seems stop get stronger now, but it can't achieve superhuman performance, is this normal?