Open infinitycr opened 6 years ago
msg count is the number of states sent, not number of games. E.g in the beginning of training a game can be 600+ moves
Thanks for your reply. 90000 moves, not 90000 games.
@qucheng Now we have 118000 moves, but still no updated model. We set q_max_size=300 and q_min_size=200. As you said, it will start to update after 50*300=15000 games. We use 64×5 resnet and have a tesla p100, and 118000/600=200? That means we only get 200 games after a month!
Training even 64x5 is going to be a tricky proposition with 1 GPU (no matter how beefy) ☹️ That said, the speed does seem abnormally slow. You could try running nvidia-smi to get a sense of GPU utilization (i.e. is the machine actually producing selfplays)
@jma127
Thanks for you reply.
I have tried running nvidia-smi.
And the GPU utilization is about 80%.
The information in the log.log is as follows:
=== Record Stats (0) ====
B/W/A: 65174/62126/127300 (51.1972%). B #Resign: 51746 (40.6489%), W #Resign: 52969 (41.6096%), #NoResign: 22585 (17.7416%)
Dynamic resign threshold: 0.01
Move: [0, 100): 0, [100, 200): 0, [200, 300): 0, [300, up): 127300
=== End Record Stats ====
what does this mean? 127300 moves? Or means 127300 games?
How many games is sufficient to update the save_0.bin at least?
Most importantly, if we want to update the save_0.bin as fast as possible with a GPU,which parameters in the start_server.sh
and start_client.sh
are the most important? And what should be the value of these parameters adjusted?
This seems to be 127300 games. B/W/A is black wins / white wins/ all.
@qucheng We set q_max_size=300. As you said after 50*300=15000 games, these 127300 games are sufficient for training. Could you please explain Which settings may cause this problem? It is worth mentioning that we are now running server-side scripts and client-side scripts on the same machine. Does this cause problems? Here is the server's script file:
save=./myserver game=elfgames.go.game model=df_kl model_file=elfgames.go.df_model3 \
stdbuf -o 0 -e 0 python -u ./train.py \
--mode train --batchsize 2048 \
--num_games 64 --keys_in_reply V \
--T 1 --use_data_parallel \
--num_minibatch 1000 --num_episode 1000000 \
--mcts_threads 8 --mcts_rollout_per_thread 100 \
--keep_prev_selfplay --keep_prev_selfplay \
--use_mcts --use_mcts_ai2 \
--mcts_persistent_tree --mcts_use_prior \
--mcts_virtual_loss 5 --mcts_epsilon 0.25 \
--mcts_alpha 0.03 --mcts_puct 0.85 \
--resign_thres 0.01 --gpu 0 \
--server_id myserver --eval_num_games 400 \
--eval_winrate_thres 0.55 --port 1234 \
--q_min_size 200 --q_max_size 300 \
--save_first \
--num_block 5 --dim 64 \
--weight_decay 0.0002 --opt_method sgd \
--bn_momentum=0 --num_cooldown=50 \
--expected_num_client 496 \
--selfplay_init_num 0 --selfplay_update_num 0 \
--eval_num_games 0 --selfplay_async \
--lr 0.01 --momentum 0.9 1>> log.log 2>&1 &
Here is the client's script file:
root=./myserver game=elfgames.go.game model=df_pred model_file=elfgames.go.df_model3 \
stdbuf -o 0 -e 0 python ./selfplay.py \
--T 1 --batchsize 128 \
--dim0 64 --dim1 64 --gpu 0 \
--keys_in_reply V rv --mcts_alpha 0.03 \
--mcts_epsilon 0.25 --mcts_persistent_tree \
--mcts_puct 0.85 --mcts_rollout_per_thread 100 \
--mcts_threads 8 --mcts_use_prior \
--mcts_virtual_loss 5 --mode selfplay \
--num_block0 5 --num_block1 5 \
--num_games 32 --ply_pass_enabled 36 \
--policy_distri_cutoff 20 --policy_distri_training_for_all \
--port 1234 \
--no_check_loaded_options0 --no_check_loaded_options1 \
--replace_prefix0 resnet.module,resnet \
--replace_prefix1 resnet.module,resnet \
--resign_thres 0.0 --selfplay_timeout_usec 10 \
--server_id myserver --use_mcts \
--use_fp160 --use_fp161 \
--use_mcts_ai2 --verbose > selfplay.log 2>&1 &
Thank you!
Hi @infinitycr, we recommend that you try the latest code in master (maybe use q_min_size 1), and dump the full server logs here if possible. Thanks!
@jma127 OK,we will try what you said and tell you as soon as we have a detailed log file. Thanks!
@jma127 When we use the latest code in master, the following problem occurred after we input ./start_server.sh
Traceback (most recent call last):
File "./train.py", line 53, in <module>
GC.GC.setInitialVersion(model_ver)
AttributeError: '_elfgames_go.GameContext' object has no attribute 'setInitialVersion'
Thu Jul 26 12:53:39 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ...
Destroying Reader ...
Related to https://github.com/pytorch/ELF/issues/71, we are pushing a fix shortly. Apologies for the inconvenience!
OK,we are testing first with the previous code.
@infinitycr , I believe #72 should fix the issue with master. Please let me know otherwise 😄
@jma127 OK,I will tell you as soon as I have new information about it.
@jma127 We run start_server.sh
and start_client.sh
Simultaneously on one machine, the server is running but there is a problem with the client.
Fri Jul 27 09:33:19 2018 Get actionable request: black_ver = 0, white_ver = -1, #addrs_to_reply: 32
[2018-07-27 09:33:19.850] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] Loading model from ./myserver/save-0.bin
[2018-07-27 09:33:19.850] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] replace_prefix for state dict: [['resnet.module', 'resnet']]
GuardedRecords::updateState[Fri Jul 27 09:33:29 2018] #states: 1[13:2:0:0,] 13
Traceback (most recent call last):
File "./selfplay.py", line 151, in game_start
args, root, ver, actor_name)
File "./selfplay.py", line 82, in reload
reload_model(model_loader, params, mi, actor_name, args)
File "./selfplay.py", line 63, in reload_model
model = model_loader.load_model(params)
File "/home/carc/new_test/ELF/src_py/rlpytorch/model_loader.py", line 165, in load_model
check_loaded_options=self.options.check_loaded_options)
File "/home/carc/new_test/ELF/src_py/rlpytorch/model_base.py", line 147, in load
self.load_state_dict(sd)
File "/home/carc/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias", "init_conv.1.running_mean", "init_conv.1.running_var".
Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.weight", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var", "init_conv.module.1.num_batches_tracked".
register actor_black for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7f3b2e2e09e8>
register actor_white for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7f3b2e2e0978>
Root: "./myserver"
In game start
No previous model loaded, loading from ./myserver
Warning! Same model, skip loading ./myserver/save-0.bin
Traceback (most recent call last):
File "./selfplay.py", line 202, in <module>
main()
File "./selfplay.py", line 196, in main
GC.run()
File "/home/carc/new_test/ELF/src_py/elf/utils_elf.py", line 435, in run
self._call(smem, *args, **kwargs)
File "/home/carc/new_test/ELF/src_py/elf/utils_elf.py", line 398, in _call
reply = self._cb[idx](picked, *args, **kwargs)
File "./selfplay.py", line 131, in <lambda>
lambda batch, e=e, stat=stat: actor(batch, e, stat))
File "./selfplay.py", line 126, in actor
reply = e.actor(batch)
File "/home/carc/new_test/ELF/src_py/rlpytorch/trainer/trainer.py", line 95, in actor
m = self.mi[self.actor_name]
File "/home/carc/new_test/ELF/src_py/rlpytorch/model_interface.py", line 254, in __getitem__
return self.models[key]
KeyError: 'actor_black'
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Here are log.log and selfplay.log(from start_server.sh
and start_client.sh
).
log.log
selfplay.log
Here are script files. start_client.txt start_selfplay.txt start_server.txt
I find something strange in selfplay.log. Our gcc-version is 7.3.0, but there is 7.2.0.
Python version: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
PyTorch version: 2018.05.07
CUDA version 9.0.176
Hi @infinitycr , sorry for the issues. Please also try #76 (explanation: there is an additional requirement based on new PyTorch and parallelization of the initial convolution).
@jma127 According to your suggestion,I have tried the latest version of the ELF.But I met another problem.When I run the start_client.sh
,it stoped at the beginning of the selfplay. Information of selfplay.log
is as follows
[2018-09-04 20:10:26.554] [elfgames::go::train::GuardedRecords19] [info] GuardedRecords::DumpAndClear[Tue Sep 4 20:10:26 2018], #records: 0, #states: 0[]
[2018-09-04 20:11:32.684] [elfgames::go::train::ThreadedWriterCtrl-13] [info] Tue Sep 4 20:11:32 2018, WriterCtrl: no message, seq=5, since_last_sec=0
[2018-09-04 20:11:32.705] [elfgames::go::train::ThreadedWriterCtrl-13] [info] Sleep for 10 sec ..
[2018-09-04 20:11:42.706] [elfgames::go::train::ThreadedWriterCtrl-13] [info] Tue Sep 4 20:11:42 2018 In reply func: Message got. since_last_sec=10, seq=5, {"request":{"client_ctrl":{"async":true,"black_resign_thres":0.009999999776482582,"client_type":1,"never_resign_prob":0.10000000149011612,"num_game_thread_used":-1,"player_swap":false,"white_resign_thres":0.009999999776482582},"vers":{"black_ver":0,"mcts_opt":{"alg_opt":{"c_puct":0.8500000238418579,"root_unexplored_q_zero":false,"unexplored_q_zero":false,"use_prior":true},"log_prefix":"","max_num_moves":0,"num_rollouts_per_batch":1,"num_rollouts_per_thread":200,"num_threads":8,"persistent_tree":true,"pick_method":"most_visited","root_alpha":0.029999999329447746,"root_epsilon":0.25,"seed":0,"verbose":false,"verbose_time":false,"virtual_loss":5},"white_ver":-1}},"seq":5}
[2018-09-04 20:11:44.377] [elfgames::go::train::GuardedRecords19] [info] GuardedRecords::DumpAndClear[Tue Sep 4 20:11:44 2018], #records: 0, #states: 0[]
Thank you!
Hi @infinitycr , have the same problem with
KeyError: 'actor_black'
How did you fix it? I have latest version elf and i have this problem.
After try to modify the q_min_size=200 ,q_max_size=300 and num_reader=50,and have trained 90000 games.But the save_0.bin is not updated. We want to know why it is not updated. We also found some logging information:
These logging information come from the start_server.sh. What's the meaning of '5 valid selfplays'? Does that mean there are only 5 valid selfplay games in 4000 games performed by the client?