Closed z619850002 closed 6 years ago
which gpu are you running it on?
There are 2 TITANX on my server, NVIDIA Corporation GP102 [TITAN X]
Did you specify --gpu 0
Yeah, I just download the source code and make it without any modification except the server address.
I also log the information of the self.option in load_model When I start the server, the info is as below: Namespace(bn=True, bn_eps=1e-05, bn_momentum=0.0, check_loaded_options=True, dim=224, dist_rank=-1, dist_url='', dist_world_size=-1, gpu=0, leaky_relu=False, load='', load_model_sleep_interval=0.0, num_block=20, omit_keys=[], onload=[], replace_prefix=[], use_data_parallel=True, use_data_parallel_distributed=False, use_fp16=False) While when I start the client, the gpu turns into -1: Namespace(bn=True, bn_eps=1e-05, bn_momentum=0.1, check_loaded_options=False, dim=224, dist_rank=-1, dist_url='', dist_world_size=-1, gpu=-1, leaky_relu=False, load='./myserver/save-0.bin', load_model_sleep_interval=0.0, num_block=20, omit_keys=[], onload=[], replace_prefix=['resnet.module,resnet', 'init_conv.module,init_conv'], use_data_parallel=False, use_data_parallel_distributed=False, use_fp16=True) This may be the reason in my opinion
cannot reproduce on my end. Can you check why --gpu 0 is not passed through? default is -1, and will have a problem
I run the client and the server on one computer so that there may be some conflicts. After I add some arguments like -gpu0=1 in the client code the error disappeared. But the program still can not run normally and always make the computer get stuck even I eliminate the batch size and some other parameters. So if the server program and the client can't run on same computer?
It should be able to run on the same gpu. What is the error?
I think maybe if I run the program on one computer, it will lack of memory and the IO become stuck. It may be a hardware trouble.
Thank you very much for your help @qucheng
You can reduce number of games to reduce the memory usage. Hope it helps.
I have already clone the repository and finish all building tasks follow the guide. Make test also passed. While I start the server and client, the client output the error message as below No previous model loaded, loading from ./myserver Traceback (most recent call last): File "./selfplay.py", line 151, in game_start args, root, ver, actor_name) File "./selfplay.py", line 82, in reload reload_model(model_loader, params, mi, actor_name, args) File "./selfplay.py", line 63, in reload_model model = model_loader.load_model(params) File "/home/zhangtianjun/ElfFramework/ELF/src_py/rlpytorch/model_loader.py", line 144, in load_model model = self.model_class(self.option_map_for_model, params) File "/home/zhangtianjun/ElfFramework/ELF/src_py/elf/options/import_options.py", line 33, in call return fn(self, option_map, *args, **kwargs) File "/home/zhangtianjun/ElfFramework/ELF/src_py/elfgames/go/df_model3.py", line 201, in init self.init_conv.cuda(self.options.gpu) File "/home/zhangtianjun/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 258, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/zhangtianjun/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply module._apply(fn) File "/home/zhangtianjun/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 191, in _apply param.data = fn(param.data) File "/home/zhangtianjun/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 258, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: Device index must not be negative
Here are details for my environment Python version: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0] PyTorch version: 0.4.1 CUDA version 9.0.176 Ubuntu 18.04.1 LTS