leoxiaobin / deep-high-resolution-net.pytorch

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"
https://jingdongwang2017.github.io/Projects/HRNet/PoseEstimation.html
MIT License
4.31k stars 908 forks source link

writer error #98

Open ooFormalinoo opened 5 years ago

ooFormalinoo commented 5 years ago

there are some errors when i am training. It's seems because of tensorboardX.However, I have installed lastest tensorboardX with the command pip install tensorboardX.

=> init weights from normal distribution => loading pretrained model models/pytorch/imagenet/hrnet_w32-36af842e.pth Traceback (most recent call last): File "tools/train.py", line 223, in main() File "tools/train.py", line 111, in main writer_dict['writer'].add_graph(model, (dump_input, )) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/tensorboardX/writer.py", line 738, in add_graph self._get_file_writer().add_graph(graph(model, input_to_model, verbose, kwargs)) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/tensorboardX/pytorch_graph.py", line 240, in graph trace = torch.jit.trace(model, args) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 772, in trace check_tolerance, _force_outplace, _module_class) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 898, in trace_module module = make_module(mod, _module_class, _compilation_unit) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 669, in make_module return _module_class(mod, _compilation_unit=_compilation_unit) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1386, in init_then_register original_init(self, *args, *kwargs) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1386, in init_then_register original_init(self, args, kwargs) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1881, in init self._modules[name] = TracedModule(submodule, id_set) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1386, in init_then_register original_init(self, *args, kwargs) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1881, in init self._modules[name] = TracedModule(submodule, id_set) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1386, in init_then_register original_init(self, *args, *kwargs) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1881, in init self._modules[name] = TracedModule(submodule, id_set) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1386, in init_then_register original_init(self, args, kwargs) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1881, in init self._modules[name] = TracedModule(submodule, id_set) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1386, in init_then_register original_init(self, *args, *kwargs) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1881, in init self._modules[name] = TracedModule(submodule, id_set) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1386, in init_then_register original_init(self, args, **kwargs) File "/home/kjq/.conda/envs/hrnet/lib/python3.7/site-packages/torch/jit/init.py", line 1855, in init assert(isinstance(orig, torch.nn.Module)) AssertionError

ahanahaner commented 5 years ago

I have the same problem as you. Have you solved it?

ooFormalinoo commented 5 years ago

I have the same problem as you. Have you solved it?

I solved it by changing the tensorboardX version. it seems okay with pytorch1.0 and tensorboardX1.6

ahanahaner commented 5 years ago

Thank you very much for your reply.

YibinXie commented 4 years ago

I encountered this problem when I run this code. Then I switched my torch vision from 1.2.0 to 1.0.0 and torchvision 0.4.0 0.2.0. TensorboardX stays the version of 1.6.0 as the author requires. Then the problem solved. So I think this is because of the incompatible versions between tensorboardX and pytorch. writer_dict['writer'].add_graph(model, (dump_input, )) https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/blob/015c946696abe02749f90b33a27f63652ea06a16/tools/train.py#L111

Muhtasham commented 4 years ago

@YibinXie97 i did like you said but having the error below, can you please help? should i comment writer_dict['writer'].add_graph(model, (dump_input, )) ?

File "tools/train.py", line 223, in main() File "tools/train.py", line 115, in main model = torch.nn.DataParallel(model, device_ids=cfg.GPUS).cuda() File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 131, in init _check_balance(self.device_ids) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 18, in _check_balance dev_props = [torch.cuda.get_device_properties(i) for i in device_ids] File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 18, in dev_props = [torch.cuda.get_device_properties(i) for i in device_ids] File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 301, in get_device_properties raise AssertionError("Invalid device id") AssertionError: Invalid device id

i'm working on google colab

U-C-J commented 4 years ago

Hello I met same error message. I have tried many ways and met other bugs also. I finally make it to run by doing the following. I am new to linux. If you find anything not right, please let me know.

  1. check cuda version and environment variable setting. open lib\nms\linux_setup.py, check whether cuda and nvcc works right. (1) os.environ['CUDAHOME'] (2) nvcc = pjoin(home, 'bin', 'nvcc') which means you should have nvcc under %CUDAHOME%\bin

  2. when you run testing and training code provided in the tutorial, make sure you change the yaml file GPU setting: GPUS: (0,1,2,3) to your own gpu device number. I have two gpus. So I changed it to GPUS:(0,1) This will remove the error: AssertionError: Invalid device id

  3. use the following version: pytorch== 1.0.1 torchvision==0.2.2 tensorboardX==1.6 Pillow==5.2.0

Now I have tried the testing and training code for MPII. It seems it is running right.