jkulhanek / visual-navigation-agent-pytorch

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning implemented in PyTorch
MIT License
63 stars 18 forks source link

program stop at load_state_dict() #2

Open GELIELEO opened 4 years ago

GELIELEO commented 4 years ago

I try to run the program and it stpped at the function _sync_network(). I found actually it stop at the load_state_dict() function in _sync_network(), but I can not solve it. There is no any error message.

jkulhanek commented 4 years ago

What OS/python/torch version do you use? Can you dump stacktrace? Do you run it in debug mode (in this case I would need more precise description of your configuration)?

GELIELEO commented 4 years ago

@jkulhanek Thank you for your reply. I try pytorch version1.1 and version1.3 in linux and python3.6.9. and there is no error or warning message. I uses the code from load_state_dict to test load_state_dict and it is OK.
And I don't know what is debug mode. I just use the command 'python3 train,py' to run the program

jkulhanek commented 4 years ago

Ok, can you try to replace the code spawning multiple processes with single function call? You can do that by changing line 229 in the train.py file. This way we can check if you have a problem with multiprocessing.

GELIELEO commented 4 years ago

@jkulhanek I tried thread.run() to launch the program and succeeded. so I tried thsi program in torch 0.4.1 and the policy_network succeed in loading state_dict, but the program stopped at (policy, value) = policy_network(...). I found all the problems are about policy_network, however there is no any error message :(

jkulhanek commented 4 years ago

Ok, then it might be a problem with multiprocessing causing the deadlock. Can you please follow the instructions here: http://code.activestate.com/recipes/577334-how-to-debug-deadlocked-multi-threaded-programs/ to dump the stacktrace and upload the result here.

GELIELEO commented 4 years ago

@jkulhanek it seems because of waiting pid

ThreadID: 140068893906688

File: "/usr/lib/python3.6/threading.py", line 884, in _bootstrap self._bootstrap_inner() File: "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File: "/home/usr/ws/visual-navigation-agent-pytorch/stacktracer.py", line 64, in run self.stacktraces() File: "/home/usr/ws/visual-navigation-agent-pytorch/stacktracer.py", line 78, in stacktraces fout.write(stacktraces()) File: "/home/usr/ws/visual-navigation-agent-pytorch/stacktracer.py", line 26, in stacktraces for filename, lineno, name, line in traceback.extract_stack(stack):

ThreadID: 140068973619008

File: "/usr/lib/python3.6/multiprocessing/util.py", line 319, in _exit_function p.join() File: "/usr/lib/python3.6/multiprocessing/process.py", line 124, in join res = self._popen.wait(timeout) File: "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait return self.poll(os.WNOHANG if timeout == 0.0 else 0) File: "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll pid, sts = os.waitpid(self.pid, flag)

GELIELEO commented 4 years ago

@jkulhanek When I release the code "mp.set_start_method("spawn"), the program ran, but it can not be stopped when I press ctr+c. And I think this operation may have negative impact on the calculation result.

GELIELEO commented 4 years ago

@jkulhanek any idea? please

jkulhanek commented 4 years ago

What is the problem? What do you mean by "have negative impact on the calculation result"? The reason you are not able to stop the program is that it uses multiprocessing, and I do not propagate the signal to other processes. It can be solved by catching signals and killing the processes, but it was not of a concern at the time of writing the code.