Open xphoniex opened 6 years ago
can you please help with the sharedmem
error? @jma127
Ok. I will take a look at this issue. Did you change BOARDSIZE to be 9?
q_min
minimal size of the replay buffer before starting training.
q_max
maximal size of the replay buffer (old experience will be discarded).
Note that both numbers will be multiplied by 50 (num_readers) since there are 50 buffers for the sake of concurrency
@yuandong-tian I inserted define BOARD9x9 1
at line 28 in src_cpp/elfgames/go/base/board.h
and self-play is showing me 9x9 boards. I tried with 19x19 and I'm still getting the sharedmem error.
I have changed queue parameters in server:
--q_min_size 1 --q_max_size 2 --num_reader 2
this should make the server start training after 4 games, right?
in client I have added:
--suicide_after_n_games 16
which makes the client quit after 16 games and occasionally I don't even hit the sharedmem error on server side and I have to run client one more time. Why is this?
As is explained in the previous comment this will set replay buffer size of min=50 and max=100
@qucheng Are you saying min/max is hard-coded somewhere in the code? Because:
sharedmem
error after 6-8 games? (usually)Regardless my issue is not the minimum # of games required, it's the fact that my server hits sharedmem
error and won't train.
Can you try to set batchsize to 1, also you might try to increase #games a little. Batchsize needs to be much smaller than number of games due to concurrency (elf gather different selfplay games when they become available)
I set batchize to 1, q_min to 5 (on both client and server) which is what I assume you meant by saying increase #games. (instead of num_games which increases the # of threads)
same problem. log.log
:
Python version: 3.6.5 (default, May 3 2018, 10:08:28)
[GCC 5.4.0 20160609]
PyTorch version: 0.4.1
CUDA version 9.2.148
Conda env:
[2018-09-18 00:56:24.983] [rlpytorch.model_loader.load_env] [info] Loading env
<module 'elfgames.go.game' from '/home/user/ELF/src_py/elfgames/go/game.py'> elfgames.go.game
<module 'elfgames.go.df_model3' from '/home/user/ELF/src_py/elfgames/go/df_model3.py'> elfgames.go.df_model3
[2018-09-18 00:56:25.088] [rlpytorch.model_loader.load_env] [info] Parsed options: {'T': 1,
'actor_only': False,
'adam_eps': 0.001,
'additional_labels': [],
'backprop': True,
'batchsize': 1,
'batchsize2': -1,
'black_use_policy_network_only': False,
'bn': True,
'bn_eps': 1e-05,
'bn_momentum': 0.0,
'cheat_eval_new_model_wins_half': False,
'cheat_selfplay_random_result': False,
'check_loaded_options': True,
'client_max_delay_sec': 1200,
'comment': '',
'data_aug': -1,
'dim': 224,
'dist_rank': -1,
'dist_url': '',
'dist_world_size': -1,
'dump_record_prefix': '',
'epsilon': 0.0,
'eval_model_pair': '',
'eval_num_games': 0,
'eval_old_model': -1,
'eval_winrate_thres': 0.55,
'expected_num_clients': 1,
'following_pass': False,
'freq_update': 1,
'gpu': 0,
'keep_prev_selfplay': True,
'keys_in_reply': ['V'],
'latest_symlink': 'latest',
'leaky_relu': False,
'list_files': [],
'load': '',
'load_model_sleep_interval': 0.0,
'loglevel': 'info',
'lr': 0.01,
'mcts_alpha': 0.03,
'mcts_epsilon': 0.25,
'mcts_persistent_tree': True,
'mcts_pick_method': 'most_visited',
'mcts_puct': 0.85,
'mcts_rollout_per_batch': 1,
'mcts_rollout_per_thread': 1,
'mcts_root_unexplored_q_zero': False,
'mcts_threads': 4,
'mcts_unexplored_q_zero': False,
'mcts_use_prior': True,
'mcts_verbose': False,
'mcts_verbose_time': False,
'mcts_virtual_loss': 5,
'mode': 'train',
'momentum': 0.9,
'move_cutoff': -1,
'num_block': 20,
'num_cooldown': 50,
'num_episode': 10,
'num_future_actions': 1,
'num_games': 1,
'num_games_per_thread': -1,
'num_minibatch': 2,
'num_reader': 2,
'num_reset_ranking': 5000,
'omit_keys': [],
'onload': [],
'opt_method': 'sgd',
'parameter_print': True,
'parsed_args': ['./train.py',
'--mode',
'train',
'--num_reader',
'2',
'--batchsize',
'1',
'--num_games',
'1',
'--keys_in_reply',
'V',
'--T',
'1',
'--use_data_parallel',
'--num_minibatch',
'2',
'--num_episode',
'10',
'--mcts_threads',
'4',
'--mcts_rollout_per_thread',
'1',
'--keep_prev_selfplay',
'--keep_prev_selfplay',
'--use_mcts',
'--use_mcts_ai2',
'--mcts_persistent_tree',
'--mcts_use_prior',
'--mcts_virtual_loss',
'5',
'--mcts_epsilon',
'0.25',
'--mcts_alpha',
'0.03',
'--mcts_puct',
'0.85',
'--resign_thres',
'0.01',
'--gpu',
'0',
'--server_id',
'myserver',
'--eval_num_games',
'1',
'--eval_winrate_thres',
'0.55',
'--port',
'1234',
'--q_min_size',
'5',
'--q_max_size',
'20',
'--save_first',
'--num_block',
'20',
'--dim',
'224',
'--weight_decay',
'0.0002',
'--opt_method',
'sgd',
'--bn_momentum=0',
'--num_cooldown=50',
'--expected_num_client',
'1',
'--selfplay_init_num',
'0',
'--selfplay_update_num',
'0',
'--eval_num_games',
'0',
'--selfplay_async',
'--lr',
'0.01',
'--momentum',
'0.9'],
'ply_pass_enabled': 0,
'policy_distri_cutoff': 0,
'policy_distri_training_for_all': False,
'port': 1234,
'preload_sgf': '',
'preload_sgf_move_to': -1,
'print_result': False,
'q_max_size': 20,
'q_min_size': 5,
'ratio_pre_moves': 0,
'record_dir': './record',
'replace_prefix': [],
'resign_thres': 0.01,
'sample_nodes': ['pi,a'],
'sample_policy': 'epsilon-greedy',
'save_dir': './myserver',
'save_first': True,
'save_prefix': 'save',
'selfplay_async': True,
'selfplay_init_num': 0,
'selfplay_timeout_usec': 0,
'selfplay_update_num': 0,
'server_addr': '',
'server_id': 'myserver',
'start_ratio_pre_moves': 0.5,
'store_greedy': False,
'suicide_after_n_games': -1,
'tqdm': False,
'trainer_stats': '',
'use_data_parallel': True,
'use_data_parallel_distributed': False,
'use_df_feature': False,
'use_fp16': False,
'use_mcts': True,
'use_mcts_ai2': True,
'verbose': False,
'weight_decay': 0.0002,
'white_mcts_rollout_per_batch': -1,
'white_mcts_rollout_per_thread': -1,
'white_puct': -1.0,
'white_use_policy_network_only': False}
Stats: Name is not known!
[2018-09-18 00:56:25.091] [rlpytorch.model_loader.load_env] [info] Finished loading env
[2018-09-18 00:56:25.091] [elf::legacy::ContextOptions-0] [info] JobId: local
[2018-09-18 00:56:25.091] [elf::legacy::ContextOptions-0] [info] #Game: 1
[2018-09-18 00:56:25.091] [elf::legacy::ContextOptions-0] [info] T: 1
[2018-09-18 00:56:25.091] [elf::legacy::ContextOptions-0] [info] [#th=4][rl=1][per=1][eps=0.25][alpha=0.03][prior=1][c_puct=0.85][uqz=0][r_uqz=0]
[2018-09-18 00:56:25.091] [elfgames::go::train::TrainCtrl-11] [info] Finished initializing replay_buffer #Queue: 2, spec: ReaderQueue: Queue [min=5][max=20], Length: 0, 0, Total: 0, MinSizeSatisfied: 0
[2018-09-18 00:56:25.111] [elfgames::go::train::DataOnlineLoader-17] [info] ZMQVer: 4.2.3 Reader[db=data-1537215985.db] [local] Connect to [::1]:1234, ipv6: True, verbose: False
[2018-09-18 00:56:25.111] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:25 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ...
**** Options ****
Seed: 0
Time signature: 180918-005625
Client max delay in sec: 1200
#FutureActions: 1
#GamePerThread: -1
mode: train
Selfplay init min #games: 0, update #games: 0, async: True
UseMCTS: True
Data Aug: -1
Start_ratio_pre_moves: 0.5
ratio_pre_moves: 0
MoveCutOff: -1
Use DF feature: False
PolicyDistriCutOff: 0
Expected #client: 1
Server_addr: [::1], server_id: myserver, port: 1234
#Reader: 2, Qmin_sz: 5, Qmax_sz: 20
Verbose: False
Policy distri training for all moves: False
Min Ply from which pass is enabled: 0
Reset move ranking after 5000 actions
Resign Threshold: 0.01, Dynamic Resign Threshold, resign_prob_never: 0.1, target_fp_rate: 0.05, bounded within [1e-09, 0.5]
Komi: 3.5
*****************
Version: a39e7dcdd12208a2e068d80f352948407176b219_unstaged
Mode: train
Num Actions: 82
train: {'input': ['s', 'offline_a', 'winner', 'mcts_scores', 'move_idx', 'selfplay_ver'], 'reply': None}
SharedMem: "train", keys: ['s', 'offline_a', 'move_idx', 'winner', 'mcts_scores', 'selfplay_ver']
s float [1, 18, 9, 9]
offline_a int64_t [1, 1]
move_idx int32_t [1]
winner float [1]
mcts_scores float [1, 82]
selfplay_ver int64_t [1]
s float [1, 18, 9, 9]
offline_a int64_t [1, 1]
move_idx int32_t [1]
winner float [1]
mcts_scores float [1, 82]
selfplay_ver int64_t [1]
train_ctrl: {'input': ['selfplay_ver'], 'reply': None, 'batchsize': 1}
SharedMem: "train_ctrl", keys: ['selfplay_ver']
selfplay_ver int64_t [1]
selfplay_ver int64_t [1]
[2018-09-18 00:56:27.860] [elfgames::go::train::ThreadedCtrl-13] [info] Setting init version: 0
[2018-09-18 00:56:27.860] [elfgames::go::train::EvalSubCtrl-15] [info] Set new baseline model, ver: 0
[2018-09-18 00:56:27.860] [elfgames::go::train::SelfPlaySubCtrl-14] [info] SelfPlay: -1 -> 0
Root: "./myserver"
Keep prev_selfplay: True
Save first:
Save to ./myserver
Filename = ./myserver/save-0.bin
About to wait for sufficient selfplay
[2018-09-18 00:56:28.018] [elfgames::go::train::ThreadedCtrl-13] [info] Tue Sep 18 00:56:28 2018, Sufficient sample for model 0
[2018-09-18 00:56:35.111] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:35 2018 Ctrl from local-user-A-e738-d962-2541-f1b9[1]: 1537215985
[2018-09-18 00:56:35.112] [elfgames::go::train::TrainCtrl-11] [info] New allocated: local-user-A-e738-d962-2541-f1b9, Clients[1][#max_eval=-1][#max_th=1][#client_delay=1200], SelfplayOnly[1/100%], EvalThenSelfplay[0/0%]
[2018-09-18 00:56:35.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:35 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ...
[2018-09-18 00:56:45.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:45 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ...
[2018-09-18 00:56:55.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:55 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ...
[2018-09-18 00:57:05.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:05 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ...
[2018-09-18 00:57:15.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:15 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ...
[2018-09-18 00:57:25.113] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:25 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ...
[2018-09-18 00:57:35.113] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:35 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ...
[2018-09-18 00:57:45.113] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:45 2018, Reader: no message, Stats: 1/0/0, wait for 10 sec ...
[2018-09-18 00:57:55.177] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:55 2018, Reader: no message, Stats: 2/0/0, wait for 10 sec ...
[2018-09-18 00:58:05.200] [elf::distributed::Reader-21] [info] Tue Sep 18 00:58:05 2018, Reader: no message, Stats: 3/0/0, wait for 10 sec ...
[2018-09-18 00:58:15.200] [elf::distributed::Reader-21] [info] Tue Sep 18 00:58:15 2018, Reader: no message, Stats: 3/0/0, wait for 10 sec ...
[2018-09-18 00:58:25.218] [elf::distributed::Reader-21] [info] Tue Sep 18 00:58:25 2018, Reader: no message, Stats: 4/0/0, wait for 10 sec ...
[2018-09-18 00:58:28.028] [elf::base::SharedMem-87] [info] Error: active_batch_size = 0, max_batch_size: 1, min_batch_size: 1, #msg count: 0
python3: /home/user/ELF/src_cpp/elf/base/sharedmem.h:156: void elf::SharedMem::waitBatchFillMem(elf::Server*): Assertion `false' failed.
sharedmem error is caused by /ELF/src_cpp/elf/comm/broadcast.h
line 122:
if ((int)(message.data.size() + data_count) > opt.batchsize) {
unpop_msg(message);
break;
}
Data is being sent in batch size of 64 (I still don't know where 64 comes from) and it's bigger than our opt.batchsize
which is 1. Setting the batchsize on server to 64 solved the issue. I don't think it should break
here if data_count == 0
.
I ran into another issue, client can't replace the new model:
[2018-09-20 20:02:57.430] [elfgames::go::common::DispatcherCallback-12] [info] Thu Sep 20 20:02:57 2018 Received actionable request: black_ver = 1, white_ver = -1, #addrs_to_reply: 1
In game start
[2018-09-20 20:02:57.875] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] Loading model from ./myserver/save-1.bin
[2018-09-20 20:02:57.875] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] replace_prefix for state dict: [['resnet.module', 'resnet'], ['init_conv.module', 'init_conv']]
[2018-09-20 20:02:58.023] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] Finished loading model from ./myserver/save-1.bin
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [64,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [65,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [66,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [67,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [68,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [69,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [70,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [71,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [72,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [73,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [74,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [75,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [76,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [77,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [78,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [79,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [80,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [81,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [0,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [1,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [2,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [3,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [4,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [5,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [6,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [7,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [8,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [9,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [10,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [11,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [12,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [13,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [14,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [15,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [16,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [17,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [18,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [19,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [20,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [21,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [22,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [23,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [24,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [25,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [26,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [27,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [28,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [29,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [30,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [31,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [32,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [33,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [34,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [35,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [36,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [37,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [38,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [39,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [40,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [41,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [42,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [43,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [44,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [45,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [46,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [47,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [48,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [49,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [50,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [51,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [52,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [53,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [54,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [55,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [56,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [57,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [58,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [59,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [60,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [61,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [62,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [63,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=59 : device-side assert triggered
Traceback (most recent call last):
File "./selfplay.py", line 202, in <module>
main()
File "./selfplay.py", line 196, in main
GC.run()
File "/home/user/ELF/src_py/elf/utils_elf.py", line 435, in run
self._call(smem, *args, **kwargs)
File "/home/user/ELF/src_py/elf/utils_elf.py", line 398, in _call
reply = self._cb[idx](picked, *args, **kwargs)
File "./selfplay.py", line 131, in <lambda>
lambda batch, e=e, stat=stat: actor(batch, e, stat))
File "./selfplay.py", line 126, in actor
reply = e.actor(batch)
File "/home/user/ELF/src_py/rlpytorch/trainer/trainer.py", line 101, in actor
reply_msg = self.sampler.sample(state_curr)
File "/home/user/ELF/src_py/rlpytorch/sampler/sampler.py", line 56, in sample
actions[a_node] = sampler(state_curr, self.options, node=pi_node)
File "/home/user/ELF/src_py/rlpytorch/sampler/sample_methods.py", line 125, in sample_multinomial
return sample_eps_with_check(probs, args.epsilon, greedy=greedy)
File "/home/user/ELF/src_py/rlpytorch/sampler/sample_methods.py", line 74, in sample_eps_with_check
actions = sample_with_check(probs, greedy=greedy)
File "/home/user/ELF/src_py/rlpytorch/sampler/sample_methods.py", line 43, in sample_with_check
cond1 = (actions < 0).sum()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generated/../THCReduceAll.cuh:317
A polite reminder that this issue is still standing. Read my previous comment please. @qucheng @yuandong-tian
I'm trying to train for 9x9 board using just a few self plays just to see the whole process (selfplay -> training NN using self-play result -> selfplaying using new NN) on my own machine. Can someone please explain what these parameters do?
edit: _apparently num_reader is the size of queues we share with trainer and qsize is the min/max of games for training to get started.
difference between:
and why some parameters are repeated in
start_server.sh
likeeval_num_games
start_client.sh
start_server.sh
my client simply doesn't stop self-playing and the server has an error