Failed to update the save_0.bin

infinitycr commented 6 years ago

After try to modify the q_min_size=200 ,q_max_size=300 and num_reader=50,and have trained 90000 games.But the save_0.bin is not updated. We want to know why it is not updated. We also found some logging information:

[2018-06-06 13:14:16.977] [elfgames::go::TrainCtrl-0] [info] received 3000 records from local-Workstation-1db3-cbb4-5d10-b748, with 32 state updates, 0 records, 0 valid selfplays, and 0 evals

Wed Jun 6 13:14:16 2018, last_identity: local--Workstation-1db3-cbb4-5d10-b748, #msg: 0 #client: 8, Msg count: 3000, avg msg size: 686653, failed count: 0

[2018-06-06 21:30:26.797] [elfgames::go::TrainCtrl-0] [info] received 4000 records from local-699e-247b-e4e-6760, with 32 state updates, 5 records, 5 valid selfplays, and 0 evals

Wed Jun 6 21:30:26 2018, last_identity: local-699e-247b-e4e-6760, #msg: 0 #client: 8, Msg count: 4000, avg msg size: 682586, failed count: 0

These logging information come from the start_server.sh. What's the meaning of '5 valid selfplays'? Does that mean there are only 5 valid selfplay games in 4000 games performed by the client?

qucheng commented 6 years ago

msg count is the number of states sent, not number of games. E.g in the beginning of training a game can be 600+ moves

infinitycr commented 6 years ago

Thanks for your reply. 90000 moves, not 90000 games.

infinitycr commented 6 years ago

@qucheng Now we have 118000 moves, but still no updated model. We set q_max_size=300 and q_min_size=200. As you said, it will start to update after 50*300=15000 games. We use 64×5 resnet and have a tesla p100, and 118000/600=200? That means we only get 200 games after a month!

jma127 commented 6 years ago

Training even 64x5 is going to be a tricky proposition with 1 GPU (no matter how beefy) ☹️ That said, the speed does seem abnormally slow. You could try running nvidia-smi to get a sense of GPU utilization (i.e. is the machine actually producing selfplays)

infinitycr commented 6 years ago

@jma127 Thanks for you reply. I have tried running nvidia-smi.And the GPU utilization is about 80%. The information in the log.log is as follows:

=== Record Stats (0) ====
B/W/A: 65174/62126/127300 (51.1972%). B #Resign: 51746 (40.6489%), W #Resign: 52969 (41.6096%), #NoResign: 22585 (17.7416%)
Dynamic resign threshold: 0.01
Move: [0, 100): 0, [100, 200): 0, [200, 300): 0, [300, up): 127300
=== End Record Stats ====

what does this mean? 127300 moves? Or means 127300 games? How many games is sufficient to update the save_0.bin at least? Most importantly, if we want to update the save_0.bin as fast as possible with a GPU,which parameters in the start_server.sh and start_client.sh are the most important? And what should be the value of these parameters adjusted?

qucheng commented 6 years ago

This seems to be 127300 games. B/W/A is black wins / white wins/ all.

infinitycr commented 6 years ago

@qucheng We set q_max_size=300. As you said after 50*300=15000 games, these 127300 games are sufficient for training. Could you please explain Which settings may cause this problem? It is worth mentioning that we are now running server-side scripts and client-side scripts on the same machine. Does this cause problems? Here is the server's script file:

save=./myserver game=elfgames.go.game model=df_kl model_file=elfgames.go.df_model3 \
    stdbuf -o 0 -e 0 python -u ./train.py \
    --mode train    --batchsize 2048 \
    --num_games 64    --keys_in_reply V \
    --T 1    --use_data_parallel \
    --num_minibatch 1000    --num_episode 1000000 \
    --mcts_threads 8    --mcts_rollout_per_thread 100 \
    --keep_prev_selfplay    --keep_prev_selfplay \
    --use_mcts     --use_mcts_ai2 \
    --mcts_persistent_tree    --mcts_use_prior \
    --mcts_virtual_loss 5     --mcts_epsilon 0.25 \
    --mcts_alpha 0.03     --mcts_puct 0.85 \
    --resign_thres 0.01    --gpu 0 \
    --server_id myserver     --eval_num_games 400 \
    --eval_winrate_thres 0.55     --port 1234 \
    --q_min_size 200     --q_max_size 300 \
    --save_first     \
    --num_block 5     --dim 64 \
    --weight_decay 0.0002    --opt_method sgd \
    --bn_momentum=0 --num_cooldown=50 \
    --expected_num_client 496 \
    --selfplay_init_num 0 --selfplay_update_num 0 \
    --eval_num_games 0 --selfplay_async \
    --lr 0.01    --momentum 0.9     1>> log.log 2>&1 &

Here is the client's script file:

root=./myserver game=elfgames.go.game model=df_pred model_file=elfgames.go.df_model3 \
stdbuf -o 0 -e 0 python ./selfplay.py \
    --T 1    --batchsize 128 \
    --dim0 64    --dim1 64    --gpu 0 \
    --keys_in_reply V rv    --mcts_alpha 0.03 \
    --mcts_epsilon 0.25    --mcts_persistent_tree \
    --mcts_puct 0.85    --mcts_rollout_per_thread 100 \
    --mcts_threads 8    --mcts_use_prior \
    --mcts_virtual_loss 5   --mode selfplay \
    --num_block0 5    --num_block1 5 \
    --num_games 32    --ply_pass_enabled 36 \
    --policy_distri_cutoff 20    --policy_distri_training_for_all \
    --port 1234 \
    --no_check_loaded_options0    --no_check_loaded_options1 \
    --replace_prefix0 resnet.module,resnet \
    --replace_prefix1 resnet.module,resnet \
    --resign_thres 0.0    --selfplay_timeout_usec 10 \
    --server_id myserver    --use_mcts \
    --use_fp160 --use_fp161 \
    --use_mcts_ai2 --verbose > selfplay.log 2>&1 &

Thank you!

jma127 commented 6 years ago

Hi @infinitycr, we recommend that you try the latest code in master (maybe use q_min_size 1), and dump the full server logs here if possible. Thanks!

infinitycr commented 6 years ago

@jma127 OK，we will try what you said and tell you as soon as we have a detailed log file. Thanks!

infinitycr commented 6 years ago

@jma127 When we use the latest code in master, the following problem occurred after we input ./start_server.sh

Traceback (most recent call last):
  File "./train.py", line 53, in <module>
    GC.GC.setInitialVersion(model_ver)
AttributeError: '_elfgames_go.GameContext' object has no attribute 'setInitialVersion'
Thu Jul 26 12:53:39 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ... 
Destroying Reader ...

jma127 commented 6 years ago

Related to https://github.com/pytorch/ELF/issues/71, we are pushing a fix shortly. Apologies for the inconvenience!

infinitycr commented 6 years ago

OK，we are testing first with the previous code.

jma127 commented 6 years ago

@infinitycr , I believe #72 should fix the issue with master. Please let me know otherwise 😄

infinitycr commented 6 years ago

@jma127 OK，I will tell you as soon as I have new information about it.

infinitycr commented 6 years ago

@jma127 We run start_server.sh and start_client.sh Simultaneously on one machine, the server is running but there is a problem with the client.

Fri Jul 27 09:33:19 2018 Get actionable request: black_ver = 0, white_ver = -1, #addrs_to_reply: 32
[2018-07-27 09:33:19.850] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] Loading model from ./myserver/save-0.bin
[2018-07-27 09:33:19.850] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] replace_prefix for state dict: [['resnet.module', 'resnet']]
GuardedRecords::updateState[Fri Jul 27 09:33:29 2018] #states: 1[13:2:0:0,]  13
Traceback (most recent call last):
  File "./selfplay.py", line 151, in game_start
    args, root, ver, actor_name)
  File "./selfplay.py", line 82, in reload
    reload_model(model_loader, params, mi, actor_name, args)
  File "./selfplay.py", line 63, in reload_model
    model = model_loader.load_model(params)
  File "/home/carc/new_test/ELF/src_py/rlpytorch/model_loader.py", line 165, in load_model
    check_loaded_options=self.options.check_loaded_options)
  File "/home/carc/new_test/ELF/src_py/rlpytorch/model_base.py", line 147, in load
    self.load_state_dict(sd)
  File "/home/carc/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
    Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias", "init_conv.1.running_mean", "init_conv.1.running_var". 
    Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.weight", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var", "init_conv.module.1.num_batches_tracked".

register actor_black for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7f3b2e2e09e8>
register actor_white for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7f3b2e2e0978>
Root: "./myserver"
In game start
No previous model loaded, loading from ./myserver
Warning! Same model, skip loading ./myserver/save-0.bin
Traceback (most recent call last):
  File "./selfplay.py", line 202, in <module>
    main()
  File "./selfplay.py", line 196, in main
    GC.run()
  File "/home/carc/new_test/ELF/src_py/elf/utils_elf.py", line 435, in run
    self._call(smem, *args, **kwargs)
  File "/home/carc/new_test/ELF/src_py/elf/utils_elf.py", line 398, in _call
    reply = self._cb[idx](picked, *args, **kwargs)
  File "./selfplay.py", line 131, in <lambda>
    lambda batch, e=e, stat=stat: actor(batch, e, stat))
  File "./selfplay.py", line 126, in actor
    reply = e.actor(batch)
  File "/home/carc/new_test/ELF/src_py/rlpytorch/trainer/trainer.py", line 95, in actor
    m = self.mi[self.actor_name]
  File "/home/carc/new_test/ELF/src_py/rlpytorch/model_interface.py", line 254, in __getitem__
    return self.models[key]
KeyError: 'actor_black'
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument

Here are log.log and selfplay.log(from start_server.sh and start_client.sh). log.log selfplay.log

Here are script files. start_client.txt start_selfplay.txt start_server.txt

I find something strange in selfplay.log. Our gcc-version is 7.3.0, but there is 7.2.0.

Python version: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0]
PyTorch version: 2018.05.07
CUDA version 9.0.176

jma127 commented 6 years ago

Hi @infinitycr , sorry for the issues. Please also try #76 (explanation: there is an additional requirement based on new PyTorch and parallelization of the initial convolution).

infinitycr commented 6 years ago

@jma127 According to your suggestion,I have tried the latest version of the ELF.But I met another problem.When I run the start_client.sh,it stoped at the beginning of the selfplay. Information of selfplay.log is as follows

[2018-09-04 20:10:26.554] [elfgames::go::train::GuardedRecords19] [info] GuardedRecords::DumpAndClear[Tue Sep  4 20:10:26 2018], #records: 0, #states: 0[]  
[2018-09-04 20:11:32.684] [elfgames::go::train::ThreadedWriterCtrl-13] [info] Tue Sep  4 20:11:32 2018, WriterCtrl: no message, seq=5, since_last_sec=0
[2018-09-04 20:11:32.705] [elfgames::go::train::ThreadedWriterCtrl-13] [info] Sleep for 10 sec .. 
[2018-09-04 20:11:42.706] [elfgames::go::train::ThreadedWriterCtrl-13] [info] Tue Sep  4 20:11:42 2018 In reply func: Message got. since_last_sec=10, seq=5, {"request":{"client_ctrl":{"async":true,"black_resign_thres":0.009999999776482582,"client_type":1,"never_resign_prob":0.10000000149011612,"num_game_thread_used":-1,"player_swap":false,"white_resign_thres":0.009999999776482582},"vers":{"black_ver":0,"mcts_opt":{"alg_opt":{"c_puct":0.8500000238418579,"root_unexplored_q_zero":false,"unexplored_q_zero":false,"use_prior":true},"log_prefix":"","max_num_moves":0,"num_rollouts_per_batch":1,"num_rollouts_per_thread":200,"num_threads":8,"persistent_tree":true,"pick_method":"most_visited","root_alpha":0.029999999329447746,"root_epsilon":0.25,"seed":0,"verbose":false,"verbose_time":false,"virtual_loss":5},"white_ver":-1}},"seq":5}
[2018-09-04 20:11:44.377] [elfgames::go::train::GuardedRecords19] [info] GuardedRecords::DumpAndClear[Tue Sep  4 20:11:44 2018], #records: 0, #states: 0[]

Thank you!

alatyshe commented 6 years ago

Hi @infinitycr , have the same problem with KeyError: 'actor_black' How did you fix it? I have latest version elf and i have this problem.

pytorch / ELF

Failed to update the save_0.bin #68