train problem - Githubissues

rokhmatf commented 2 years ago

after I successfully installed the DeepMind Lab package, dmhouse and Deep RL PyTorch, then I ran the command python train.py dmhouse

however I found an error and I don't know how to handle it.

Traceback (most recent call last):
  File "train.py", line 41, in <module>
    import trainer as _  # noqa:F401
  File "/home/rokhmat/tesis/robot-visual-navigation/python/trainer.py", line 6, in <module>
    from deep_rl.common.env import RewardCollector, TransposeImage
  File "/home/rokhmat/tesis/robot-visual-navigation/deep-rl-pytorch/deep_rl/common/env.py", line 170, in <module>
    class VecTransposeImage(gym.vector.vector_env.VectorEnvWrapper):
AttributeError: module 'gym.vector.vector_env' has no attribute 'VectorEnvWrapper'

in another PC and virtual env i tried to install an older version of gym and run the command as above i.e. python train.py dmhouse, but i got a different error as below

(tesis) rokhmat@b401:~/tesis/robot-visual-navigation/python$ python train.py dmhouse
Registering trainer turtlebot
Registering trainer turtlebot-end
Registering trainer turtlebot-noprior
Registering trainer turtlebot-unreal
Registering trainer turtlebot-unreal-noprior
Registering trainer turtlebot-a2c
Registering trainer turtlebot-a2c-noprior
Registering trainer dmhouse-a2c
Registering trainer dmhouse-unreal
Registering trainer dmhouse-ppo
Registering trainer dmhouse-ppo-a2cvn
Registering trainer turtlebot-ppo-a2cvn
Registering trainer dmhouse-ppo-unreal
Registering trainer dmhouse-dqn
Registering agent dmhouse-dqn
Registering trainer dmhouse
Registering agent turtlebot-noprior
Registering agent turtlebot-end
Registering agent turtlebot
Registering agent dmhouse
Registering agent dmhouse-a2c
Registering agent dmhouse-ppo
Registering agent dmhouse-unreal
Registering agent turtlebot-a2c
Registering agent turtlebot-unreal
Registering agent turtlebot-a2c-noprior
Registering agent turtlebot-unreal-noprior
starting dmhouse
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
setting iterationsWithSameMap to 50
================================================================
Total params: 4,892,691
Trainable params: 4,892,691
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.07
Params size (MB): 18.66
================================================================
Using CPU only
ERROR: Received the following error from Worker-2: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-2.
ERROR: Received the following error from Worker-1: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-1.
ERROR: Received the following error from Worker-0: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-0.
ERROR: Received the following error from Worker-5: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-5.
ERROR: Received the following error from Worker-9: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-9.
ERROR: Received the following error from Worker-12: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-12.
ERROR: Received the following error from Worker-13: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-13.
ERROR: Received the following error from Worker-10: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-10.
ERROR: Received the following error from Worker-4: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-4.
ERROR: Received the following error from Worker-11: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-11.
ERROR: Received the following error from Worker-14: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-14.
ERROR: Received the following error from Worker-3: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-3.
ERROR: Received the following error from Worker-15: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-15.
ERROR: Received the following error from Worker-7: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-7.
ERROR: Received the following error from Worker-6: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-6.
ERROR: Received the following error from Worker-8: ValueError: could not broadcast input array from shape (21168,) into shape (28224,)
ERROR: Shutting down Worker-8.
ERROR: Raising the last exception back to the main process.
Traceback (most recent call last):
  File "train.py", line 49, in <module>
    trainer.run()
  File "/home/rokhmat/tesis/deep-rl-pytorch/deep_rl/core.py", line 219, in run
    return self.trainer.run(self.process)
  File "/home/rokhmat/tesis/deep-rl-pytorch/deep_rl/common/train_wrappers.py", line 47, in run
    ret = super().run(*args, **kwargs)
  File "/home/rokhmat/tesis/deep-rl-pytorch/deep_rl/core.py", line 207, in run
    return self.trainer.run(process, **kwargs)
  File "/home/rokhmat/tesis/deep-rl-pytorch/deep_rl/core.py", line 207, in run
    return self.trainer.run(process, **kwargs)
  File "/home/rokhmat/tesis/deep-rl-pytorch/deep_rl/common/train_wrappers.py", line 126, in run
    return super().run(_late_process, **kwargs)
  File "/home/rokhmat/tesis/deep-rl-pytorch/deep_rl/core.py", line 207, in run
    return self.trainer.run(process, **kwargs)
  File "/home/rokhmat/tesis/deep-rl-pytorch/deep_rl/core.py", line 239, in run
    super().run(process, **kwargs)
  File "/home/rokhmat/tesis/deep-rl-pytorch/deep_rl/core.py", line 180, in run
    self.model = self._initialize(**self._model_kwargs)
  File "/home/rokhmat/tesis/deep-rl-pytorch/deep_rl/actor_critic/unreal/unreal.py", line 260, in _initialize
    self.rollouts = RolloutStorage(self.env.reset(), self._initial_states(self.num_processes))
  File "/home/rokhmat/anaconda3/envs/tesis/lib/python3.8/site-packages/gym/vector/vector_env.py", line 80, in reset
    return self.reset_wait(seed=seed, return_info=return_info, options=options)
  File "/home/rokhmat/anaconda3/envs/tesis/lib/python3.8/site-packages/gym/vector/async_vector_env.py", line 308, in reset_wait
    self._raise_if_errors(successes)
  File "/home/rokhmat/anaconda3/envs/tesis/lib/python3.8/site-packages/gym/vector/async_vector_env.py", line 627, in _raise_if_errors
    raise exctype(value)
ValueError: could not broadcast input array from shape (21168,) into shape (28224,)

Is there any suggestion that can solve the error I'm having? Thanks for your attention. I’m looking forward to your reply.

jkulhanek commented 2 years ago

Can you please:

pull the new version of the repo
start from a fresh venv
install all dependencies by running "pip install -r requirements.txt" (do not install DeepMind Lab!)
download checkpoints and try if evaluation works correctly
try running the training

The instructions are in the new README. If you run into issues with any step, please post the error here.

rokhmatf commented 2 years ago

I have followed the steps you suggested and did the one in the new README. I did this experiment using ubuntu 20.04 and for the virtual environment I used anaconda.

I have successfully run the evaluation command on DMHouse simulator python evaluate_dmhouse.py dmhouse --num-episodes 100 as below

(robotvn) rokhmat@b401:~/tesis/robot-visual-navigation/python$ python evaluate_dmhouse.py dmhouse --num-episodes 100
/home/rokhmat/anaconda3/envs/robotvn/lib/python3.8/site-packages/deep_rl/common/util.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
  from collections import OrderedDict, Callable
Registering trainer turtlebot
Registering trainer turtlebot-noprior
Registering trainer turtlebot-unreal
Registering trainer turtlebot-unreal-noprior
Registering trainer turtlebot-a2c
Registering trainer turtlebot-a2c-noprior
Registering trainer dmhouse-a2c
Registering trainer dmhouse-unreal
Registering trainer dmhouse-ppo
Registering trainer dmhouse-ppo-a2cvn
Registering trainer turtlebot-ppo-a2cvn
Registering trainer dmhouse-ppo-unreal
Registering trainer dmhouse-dqn
Registering agent dmhouse-dqn
Registering trainer dmhouse
Registering agent turtlebot-noprior
Registering agent turtlebot
Registering agent dmhouse
Registering agent dmhouse-a2c
Registering agent dmhouse-ppo
Registering agent dmhouse-unreal
Registering agent turtlebot-a2c
Registering agent turtlebot-unreal
Registering agent turtlebot-a2c-noprior
Registering agent turtlebot-unreal-noprior
Registering agent random
Registering agent random-end
Registering agent turtleroom-constant-stochastic
Registering agent shortest-path
/home/rokhmat/anaconda3/envs/robotvn/lib/python3.8/site-packages/dmhouse/__init__.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
 |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
success rate: 10.0000%
avg. episode steps: 516.7000
avg. distance travelled: 104.4753

and successfully run the evaluation command on the real-world dataset python evaluate_turtlebot.py turtlebot --num-episodes 100 as below

(robotvn) rokhmat@b401:~/tesis/robot-visual-navigation/python$ python evaluate_turtlebot.py turtlebot --num-episodes 100
/home/rokhmat/anaconda3/envs/robotvn/lib/python3.8/site-packages/deep_rl/common/util.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
  from collections import OrderedDict, Callable
Registering trainer turtlebot
Registering trainer turtlebot-noprior
Registering trainer turtlebot-unreal
Registering trainer turtlebot-unreal-noprior
Registering trainer turtlebot-a2c
Registering trainer turtlebot-a2c-noprior
Registering trainer dmhouse-a2c
Registering trainer dmhouse-unreal
Registering trainer dmhouse-ppo
Registering trainer dmhouse-ppo-a2cvn
Registering trainer turtlebot-ppo-a2cvn
Registering trainer dmhouse-ppo-unreal
Registering trainer dmhouse-dqn
Registering agent dmhouse-dqn
Registering trainer dmhouse
Registering agent turtlebot-noprior
Registering agent turtlebot
Registering agent dmhouse
Registering agent dmhouse-a2c
Registering agent dmhouse-ppo
Registering agent dmhouse-unreal
Registering agent turtlebot-a2c
Registering agent turtlebot-unreal
Registering agent turtlebot-a2c-noprior
Registering agent turtlebot-unreal-noprior
Registering agent random
Registering agent random-end
Registering agent turtleroom-constant-stochastic
Registering agent shortest-path
 |████████████████████████████████████████████████████████████████████████████████████████████████████| 100.0% 
success rate: 98.0000%
avg. episode steps: 12.6327
avg. goal distance: 0.1366

but I get an error when running the playground notebook provided in the new repository as shown in the following image, whereas I have successfully run evaluate_dmhouse.py and evaluate_turtlebot.py Screenshot from 2022-03-26 12-00-01

for the training section I don't get an error, but when the python train.py dmhouse command is executed, the results obtained stuck on Using CPU only. Screenshot from 2022-03-26 10-58-27

Has it entered the training process?

jkulhanek commented 2 years ago

Turtlebot results look plausible. DMHouse result not. You should have 100% success rate. Did you really downloaded the pre-trained dmhouse models?

As for why the notebook did not work locally for you: you did not add the python directory to the PYTHONPATH. Try running it in Google Colab.

Finally, the training started, yes. I would recommend running it on GPU though. The training would be much faster there. It will take a longer time for the first logs to appear because the replay buffer has to be filled first and because it is logging every 10 episodes and the first episodes are much longer.

Training logs should look like this: output

The following plot shows the episode length during training: Screenshot 2022-03-26 100518

As you can see, first ones are longer.

rokhmatf commented 2 years ago

yes i have downloaded the pre-trained dmhouse model, the first time i run the command python evaluate_dmhouse.py dmhouse --num-episodes 100 result i receive 100% success rate as following image Screenshot from 2022-03-26 11-00-21

but i don't know why on second time i get success rate 10 % and for the third time I got a success rate of 22%. maybe i would try to re-download the pre-trained dmhouse model.

Thank you very much for sending sample images from successful experiments. i have tried to experiment using GPU by changing 'cpu' with 'cuda' on line 493 and 496 on trainer.py.

https://github.com/jkulhanek/robot-visual-navigation/blob/a8c8a73cc349fc66eddc802263d7f47633afe880/python/trainer.py#L493-L496

then i run python train.py dmhouse --allow-gpu and produce output in terminal like below

(robotvn) rokhmat@b401:~/tesis/robot-visual-navigation/python$ python train.py dmhouse --allow-gpu
/home/rokhmat/anaconda3/envs/robotvn/lib/python3.8/site-packages/deep_rl/common/util.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
  from collections import OrderedDict, Callable
Registering trainer turtlebot
Registering trainer turtlebot-noprior
Registering trainer turtlebot-unreal
Registering trainer turtlebot-unreal-noprior
Registering trainer turtlebot-a2c
Registering trainer turtlebot-a2c-noprior
Registering trainer dmhouse-a2c
Registering trainer dmhouse-unreal
Registering trainer dmhouse-ppo
Registering trainer dmhouse-ppo-a2cvn
Registering trainer turtlebot-ppo-a2cvn
Registering trainer dmhouse-ppo-unreal
Registering trainer dmhouse-dqn
Registering agent dmhouse-dqn
Registering trainer dmhouse
Registering agent turtlebot-noprior
Registering agent turtlebot
Registering agent dmhouse
Registering agent dmhouse-a2c
Registering agent dmhouse-ppo
Registering agent dmhouse-unreal
Registering agent turtlebot-a2c
Registering agent turtlebot-unreal
Registering agent turtlebot-a2c-noprior
Registering agent turtlebot-unreal-noprior
starting dmhouse
/home/rokhmat/anaconda3/envs/robotvn/lib/python3.8/site-packages/dmhouse/__init__.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
================================================================
Total params: 4,879,379
Trainable params: 4,879,379
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.07
Params size (MB): 18.61
================================================================
Using single GPU

did i do it right? and if it's true I just have to wait for the training process to get results like in the example you gave?

jkulhanek commented 2 years ago

Hmm, I wonder why you get a lower success rate on consecutive runs. Perhaps the checkpoint got rewritten somewhere?

You are correct with the training, just wait for the results - it shouldn't take that long for the first logs to appear.

jkulhanek commented 2 years ago

Also, for GPU training, you do not need to change the code. The line you changed is only for inference and by changing it you broke it (at the time you load the model it is on CPU).

jkulhanek commented 2 years ago

I tried running evaluation multiple times and I get similar results each time.

rokhmatf commented 2 years ago

after i re-download the trained dmhouse model, then run evaluation on dmhouse i get the same result over and over with 100% success rate. I guess the problem is because after I successfully run evaluation with 100% success rate, I deleted the __pycache__ folder.

I have also run the playground notebook successfully on Google Colab.

thanks for the answer I would try to do the training process using the GPU. I'm curious, after the training process is complete, can the video be shown like on a playground notebook?

jkulhanek commented 2 years ago

You should be able to render the video even without the training - I published pre-trained models.

rokhmatf commented 2 years ago

I will create a new issue related to rendering the video, because I'm still confused about how to use it.

related to GPU usage, I saw on wandb.ai monitoring, GPU count is 1, can it use more than 1 core/count?

jkulhanek commented 2 years ago

Does the training work for you? Can I close this issue?

rokhmatf commented 2 years ago

Yes, the training has worked, I will ask another question on another issue. Thank you very much for your help.

jkulhanek / robot-visual-navigation

train problem #4