Closed pepa65 closed 1 year ago
Yes, you probably want smaller networks (e.g. less filters, less layers) and smaller batch sizes.
In ./games/connect-four/params.jl
I halved NetLib.ResNetHP(num_filters)
to start with. Didn't find any reference to layers, but when I halved all parameters with batch_size
in it, it crashes. Even if I only modify num_filters
, it doesn't run. What would be an example of a working set of parameters for smaller GPUs? (I have RTX3050.)
Did you have a look here?
Reducing the mem_buffer_size
might work, even though it's not clear why.
If not, you could share your params.jl
file.
Yes, I had a look at #174 but O can't seem to make any modifications that work. Right now I have the current repo games/connect-four/params.jl
except this diff:
--- a/games/connect-four/params.jl
+++ b/games/connect-four/params.jl
@@ -5,7 +5,7 @@
Network = NetLib.ResNet
netparams = NetLib.ResNetHP(
- num_filters=128,
+ num_filters=32, #128,
num_blocks=5,
conv_kernel_size=(3, 3),
num_policy_head_filters=32,
@@ -66,8 +66,9 @@ params = Params(
use_symmetries=true,
memory_analysis=nothing,
mem_buffer_size=PLSchedule(
- [ 0, 15],
- [400_000, 1_000_000]))
+# [ 0, 15],
+# [400_000, 1_000_000]))
+ [0]. [80_000]))
#####
##### Evaluation benchmark
@@ -93,7 +94,8 @@ benchmark_sim = SimParams(
arena.sim;
num_games=256,
num_workers=256,
- batch_size=256,
+ #batch_size=256,
+ batch_size=16,
alternate_colors=false)
benchmark = [
(It crashes...)
From your stack trace, AlphaZero.jl crashes during the gradient-update phase, not during self-play. So my guess is that you should also lower your batch_size
in LearningParams
.
I'm not sure it makes any difference in Julia or if it should incur a crash during the gradient-update phase, but in your params.jl
, a coma seems to have been replaced by a point in the definition of mem_buffer_size
: "[0]. [80_000]
".
Thanks, replaced the dot with a comma (old eyes...) When I run this, I get:
[ Info: Using the Flux implementation of AlphaZero.NetLib.
Loading environment from: sessions/connect-four
[ Info: Using modified parameters
ERROR: AssertionError: same_json(Network.hyperparams(env.bestnn), e.netparams)
Stacktrace:
[1] Session(e::Experiment; dir::Nothing, autosave::Bool, nostdout::Bool, save_intermediate::Bool)
@ AlphaZero.UserInterface ~/git/AlphaZero.jl/src/ui/session.jl:288
[2] Session
@ ~/git/AlphaZero.jl/src/ui/session.jl:273 [inlined]
[3] train(e::Experiment; args::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ AlphaZero.Scripts ~/git/AlphaZero.jl/src/scripts/scripts.jl:26
[4] train
@ ~/git/AlphaZero.jl/src/scripts/scripts.jl:26 [inlined]
[5] #train#15
@ ~/git/AlphaZero.jl/src/scripts/scripts.jl:28 [inlined]
[6] train(s::String)
@ AlphaZero.Scripts ~/git/AlphaZero.jl/src/scripts/scripts.jl:28
[7] top-level scope
@ none:1
Delete your sessions/connect-four
folder and restart.
I missed batch_size
in LearningParams
, quartered that. Now running with the 2 batch_size
s quartered, and with mem_buffer_size=PLSchedule([0], [80_000]))
. It keeps running now..! I decided also to delete sessions/connect-four
, and see what happens. Thanks for all of your help so far!
Update: It's still going, now on iteration 4; won 32% on iteration 3). Update: It finished after a few days, and is playable!
I followed the readme:
This led to:
Is there some way to tweak parameters so that I can prevent running out of memory?