Something wrong in the client

z619850002 commented 6 years ago

I have already clone the repository and finish all building tasks follow the guide. Make test also passed. While I start the server and client, the client output the error message as below No previous model loaded, loading from ./myserver Traceback (most recent call last): File "./selfplay.py", line 151, in game_start args, root, ver, actor_name) File "./selfplay.py", line 82, in reload reload_model(model_loader, params, mi, actor_name, args) File "./selfplay.py", line 63, in reload_model model = model_loader.load_model(params) File "/home/zhangtianjun/ElfFramework/ELF/src_py/rlpytorch/model_loader.py", line 144, in load_model model = self.model_class(self.option_map_for_model, params) File "/home/zhangtianjun/ElfFramework/ELF/src_py/elf/options/import_options.py", line 33, in call return fn(self, option_map, *args, **kwargs) File "/home/zhangtianjun/ElfFramework/ELF/src_py/elfgames/go/df_model3.py", line 201, in init self.init_conv.cuda(self.options.gpu) File "/home/zhangtianjun/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 258, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/zhangtianjun/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply module._apply(fn) File "/home/zhangtianjun/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 191, in _apply param.data = fn(param.data) File "/home/zhangtianjun/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 258, in return self._apply(lambda t: t.cuda(device)) RuntimeError: Device index must not be negative

Here are details for my environment Python version: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0] PyTorch version: 0.4.1 CUDA version 9.0.176 Ubuntu 18.04.1 LTS

qucheng commented 6 years ago

which gpu are you running it on?

z619850002 commented 6 years ago

There are 2 TITANX on my server, NVIDIA Corporation GP102 [TITAN X]

qucheng commented 6 years ago

Did you specify --gpu 0

z619850002 commented 6 years ago

Yeah, I just download the source code and make it without any modification except the server address.

z619850002 commented 6 years ago

I also log the information of the self.option in load_model When I start the server, the info is as below: Namespace(bn=True, bn_eps=1e-05, bn_momentum=0.0, check_loaded_options=True, dim=224, dist_rank=-1, dist_url='', dist_world_size=-1, gpu=0, leaky_relu=False, load='', load_model_sleep_interval=0.0, num_block=20, omit_keys=[], onload=[], replace_prefix=[], use_data_parallel=True, use_data_parallel_distributed=False, use_fp16=False) While when I start the client, the gpu turns into -1: Namespace(bn=True, bn_eps=1e-05, bn_momentum=0.1, check_loaded_options=False, dim=224, dist_rank=-1, dist_url='', dist_world_size=-1, gpu=-1, leaky_relu=False, load='./myserver/save-0.bin', load_model_sleep_interval=0.0, num_block=20, omit_keys=[], onload=[], replace_prefix=['resnet.module,resnet', 'init_conv.module,init_conv'], use_data_parallel=False, use_data_parallel_distributed=False, use_fp16=True) This may be the reason in my opinion

qucheng commented 6 years ago

cannot reproduce on my end. Can you check why --gpu 0 is not passed through? default is -1, and will have a problem

z619850002 commented 6 years ago

I run the client and the server on one computer so that there may be some conflicts. After I add some arguments like -gpu0=1 in the client code the error disappeared. But the program still can not run normally and always make the computer get stuck even I eliminate the batch size and some other parameters. So if the server program and the client can't run on same computer?

qucheng commented 6 years ago

It should be able to run on the same gpu. What is the error?

z619850002 commented 6 years ago

I think maybe if I run the program on one computer, it will lack of memory and the IO become stuck. It may be a hardware trouble.

z619850002 commented 6 years ago

Thank you very much for your help @qucheng

qucheng commented 6 years ago

You can reduce number of games to reduce the memory usage. Hope it helps.

pytorch / ELF

Something wrong in the client #90