jonathan-laurent / AlphaZero.jl

A generic, simple and fast implementation of Deepmind's AlphaZero algorithm.
https://jonathan-laurent.github.io/AlphaZero.jl/stable/
MIT License
1.24k stars 138 forks source link

Hardware sizing with regards to problem complexity #165

Open smart-fr opened 1 year ago

smart-fr commented 1 year ago

Hihi,

Do you have a rule of thumb that could be used in order to determine which sizing of hardware would be required in order to train an agent given the size and complexity of a game model? In terms of CUDA cores, GPU memory, CPU memory, Mflops whatever unit could help configure the hardware before starting dummy_run-ning a game?

jonathan-laurent commented 1 year ago

There is no perfect answer here. Also, note that AlphaZero was not designed to work optimally with every possible cluster configuration out-of-the-box. You may need to do some tweaking to achieve the performance you want on your hardware.

One of the main factors that determines how much compute you will need is the branching factor of your game (along with the average number of moves in a game). For example, connect four has maximum branching factor 7 and a game of connect four is usually ~30 moves. Connect four is about the difficulty of what you can solve easily on commodity hardware (one gaming laptop with a decent GPU).

AlphaZero being sample inefficient, the amount of required compute can scale really fast with the complexity of your game. Depending on your hardware, you can invest this compute differently:

What the best tradeoff is depends on your available hardware, your specific use-case, how costly it is to simulate your environment...

Finally, the best way to make AlphaZero suitable for challenging games without spending too much compute is to initialize the policy with a decent heuristic (possibly learned from human data with supervised learning). This has the practical effect of considerably reducing your branching factor since only actions that are not clearly stupid will be considered most of the times.

smart-fr commented 1 year ago

Thank you for your reply. My game has a huge branching factor. 😨 Filtering the legal actions mask using a decent heuristic is definitely on my list.

Re: compute investment strategy, if I want to explore using larger networks vs more MCTS simulations, what are the main parameters I should play around with, which don't require a deep understanding of all under-the-hood mechanisms? I guess in params.jl these may be the num_filters, num_blocks, conv_kernel_size arguments of the NetLib.ResNetHP() function, the num_iters_per_turn argument of the MctsParams() function, the num_iters argument of the Params() function?

Would you recommend some readings about this question?

fabricerosay commented 1 year ago

Also if your branching factor is huge you will be penalized because the way it is coded Alphazero.jl use all possible moves. For example using Alphazero.jl for chess would need to store more than 1800 moves, policy etc. Whereas in a given position you have at most around 250 moves possible. So you are wasting a lot of memory, preventing I think to train such games without huge amount of ram. (I tried for the game Ataxx, this is very slow, cause you can't play that many games in parallel). It is quit easy to fix (eg storing the move or a move id in actions and retaining only valid actions instead of masking)

jonathan-laurent commented 1 year ago

@fabricerosay You are perfectly right. The reason I made this implementation choice initially is that any problem with a branching factor where this is problematic is probably not learnable from scratch using a reasonable amount of compute. This does not hold anymore when initializing the policy from supervised learning though and so I may want to lift this restriction indeed.

fabricerosay commented 1 year ago

I started again to work on alphazero: new implementation more inline with Alphagpu but not wholly on gpu( i dropped struct nodes etc for a SOA imple), adding NNcache and on connect4 i saw a huge performance gain: 4096 games, 600 rollouts with a 128x5 resnet in under 5 minutes.

smart-fr commented 1 year ago

I started again to work on alphazero: new implementation more inline with Alphagpu but not wholly on gpu( i dropped struct nodes etc for a SOA imple), adding NNcache and on connect4 i saw a huge performance gain: 4096 games, 600 rollouts with a 128x5 resnet in under 5 minutes.

@fabricerosay You are perfectly right. The reason I made this implementation choice initially is that any problem with a branching factor where this is problematic is probably not learnable from scratch using a reasonable amount of compute. This does not hold anymore when initializing the policy from supervised learning though and so I may want to lift this restriction indeed.

Using an heuristic to artificially prevent (mask) the dumbest actions after GI.play!() I could train an agent over an acceptable delay on my PC. I also tried to run the training on a multi-GPU VM. With almost no gain, since AlphaZero.jl seems to use only one GPU. Is it by design? Or should I tweak some settings?

smart-fr commented 1 year ago

I started again to work on alphazero: new implementation more inline with Alphagpu but not wholly on gpu( i dropped struct nodes etc for a SOA imple), adding NNcache and on connect4 i saw a huge performance gain: 4096 games, 600 rollouts with a 128x5 resnet in under 5 minutes.

Interesting. Does your system offer an API similar to AlphaZero.jl's GameInterface?

fabricerosay commented 1 year ago

Interesting. Does your system offer an API similar to AlphaZero.jl's GameInterface?

No it is different, very experimental, and miles away from AlphaZero.jl in terms of coding quality, it is not as generic, but it is probably faster. If you'd be to use it , you would have to dig into the ugly, uncommented code. Very amateurish work, which I am.

smart-fr commented 1 year ago

Does AlphaZero.jl take advantage of multiple GPUs on a single machine, or is a cluster of single-GPU machines the only way to parallelize GPU computing? If both ways are possible:

jonathan-laurent commented 1 year ago

AlphaZero.jl cannot leverage multi-GPU machines out-of-the-box but making it do so would probably only require a small change.

smart-fr commented 1 year ago

It would be super helpful for the exploration of the framework and its possibilities to have some list of rules and constraints linking the parameters, the output indicators and the system hardware characteristics. For example (I don't know if these are true or false):

  1. Memory footprint per MCTS node x num_games x Average num of turns per game ≈ MCTS memory footprint per worker
  2. MCTS memory footprint per worker x num_workers SHOULD PREFERABLY BE < Available system RAM (to prevent swapping during training)
  3. Some_function(Number of network parameters) x batch_size MUST BE < Available GPU RAM (to prevent OOM errors during training)

etc.

If everyone could contribute their observations, it would be a useful start.

jonathan-laurent commented 1 year ago

You are perfectly right and this would be useful. Also, it would be great it someone can contribute such a section to the documentation.

More generally, I am regularly thinking about what a smart framework could look like that performs as much autotuning as possible given one's configuration, makes hyperparameter sanity checks and even suggests relevant hyperparameter variants. This is an open research question though and in any case, I am skeptical an algorithm as complex and computationally-demanding as AlphaZero can ever be used as a black-box.

smart-fr commented 1 year ago

I understand the complexity of the question of a self-adapting framework, and the value of any solution which would get us closer to this goal. From my amateur point of view, this is simply way beyond my power.

But believe it or not, I was able to create a fairly good agent playing my game, in nominal conditions (16x16 board), without any deep knowledge of under-the-hood mechanics, "just" by coding my game's rules according to the GameInterface and with a little bit of parameters tweaking -in particular, skipping benchmark play altogether, reducing the batch sizes and the memory buffer size.

I wouldn't call this "black-box" usage, but it demonstrates the great versatility of this framework you created following DeepMind guidelines.