vincentberaud / Minecraft-Reinforcement-Learning

Deep Recurrent Q-Learning vs Deep Q Learning on a simple Partially Observable Markov Decision Process with Minecraft
49 stars 6 forks source link

Hello vincentberaud, we are facing an error after running your code. could you please help us? thanks #1

Closed adil25 closed 5 years ago

adil25 commented 5 years ago

Being as researchers, we thought to extend your this work further but we are facing an error after running your code on our GPU, could you please help us to overcome this error? we will be thankful to you. below is the output of your program with an error:

    [2018-10-07 10:56:26,665] Making new env: MinecraftBasic-v0

2018-10-07 10:57:26.056640: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA ####################################### % Win : 0.2% % Nothing : 0.0% % Loss : 0.0% Nb J before win: 79.0 /usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py:2909: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /usr/local/lib/python3.5/dist-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Nb J before die: nan Total Steps: 79 I: 0 Epsilon: 1

LAST EPISODE MOVES


Process finished with exit code 135 (interrupted by signal 7: SIGEMT)

adil25 commented 5 years ago

Is the problem exist in this line of code: s = env.reset()

env.render(mode="human")

What do you think?

ClementRomac commented 5 years ago

Hi Adil25,

We're glad to hear our project could help you :)

There's no clear python error... Could you please tell me more about your environment (OS, Python version, Tensorflow / Gym / Gym-Minecraft version...) so I can do more tests ?

What makes you think the issue comes from the env.render(mode="human") ? Did you try removing it ? Does the process finished error disapear ?

I remember we had troubles with that method too, you can also try replacing it by env.render(mode="rgb_array") and just obtain the image.

adil25 commented 5 years ago

@ClementRomac Thanks for your positive response. At last we became successful in overcoming the bugs, your program is running now on our PC but you know sometimes it becomes very difficult to understand the code written by other programmers so at some points we are struggling to understand the way how it works. In short, could you guys please share the documentation with us so that we can understand the output and real objective of this program. Please share some detail or explain your work in a few lines of words. we will also ask questions from you with the passage of time when get fail to understand any technical point. By the way we are using: Ubuntu 16.04.3, gym 0.10.5, gym-mincraft 0.0.2, python version 3.5 , and Tensorflow 1.10.1 Thanks and we are looking forward to hear from you soon. I think minecraft is a good project to work on.........

ClementRomac commented 5 years ago

@adil25 I'm sorry I haven't any proper document to share with you. So here's a little explanation of our work :

We wanted to test DeepQ-Learning methods and thought Minecraft could be a challenging environment. Thus we used the Malmo project and gym-minecraft as a wrapper to interact with this environment in an OpenAI-Gym like API (the installation of the whole environment can be found in the "Gym-Minecraft-Installation.ipynb" notebook).

The implementation of DeepQ-Learning algorithms is gathered in "DRQN_minecraft.ipynb". We chose "MinecraftBasic-v0"(https://github.com/tambetm/gym-minecraft) as the first environment. The env is configured with a xml file in which you can change settings such as the rewards. The first part of the notebooks loads the env (with a custom resolution and a reducted list of allowed actions to get a faster convergence) and parses the xml configuration file to get the rewards.

We then declared three algorithms :

  1. A FeedForward NN
  2. A Convolutional NN
  3. A Convolutional + Recurrent NN (with LSTM blocks) => an overview can be found in the "LSTM_Architecture.png". This NN gave us the best results as we expected because Minecraft is a Partially Observed Markov Decision Process (POMDP).

The three NNs use the classical loss function tf.reduce_mean(tf.square(self.nextQ_scaled - self.Q)) with Q being the Q-value computed by the network and _nextQscaled the scaled version of the result of the Bellman's equation result (https://arxiv.org/pdf/1312.5602v1.pdf). Note that we scaled the Q-values for stability.

We then choose one of these NN and create the experience buffer to stack the state/action/reward/next_state tuples (https://arxiv.org/pdf/1312.5602v1.pdf). This allows the network to train with mini-batches diversified. The implementation of the experience buffer for the Conv+RNN is quite more complicated. It handles the fact that mini-batchs must contain following frames.

We then implement a version of the Double DeepQ-Learning (https://arxiv.org/pdf/1509.06461.pdf) with a method to update the target network. This code comes from Arthur Juliani's work.

The next part concerns settings for the NN's training and also for the epsilon-greedy strategy.

We finally have the training part. This can be done in multiple times by setting load_model=True. The agents plays randomly in order to discover the env until pre_train_episodes is reached. It then uses the epsilon-greedy strategy to trade off between exploration and exploitation. We store some infos about the states/actions/rewards and print them every 500 episodes.

We also added the Testing part at the end of the notebook to test a trained model on an episode.

I'd be happy to answer your questions if you have some ! Hope this helps :)

adil25 commented 5 years ago

Yes, It helped. Thanks alot.  On Thursday, October 11, 2018, 5:37:35 AM GMT+8, Clément ROMAC notifications@github.com wrote:

@adil25 I'm sorry I haven't any proper document to share with you. So here's a little explanation of our work :

We wanted to test DeepQ-Learning methods and thought Minecraft could be a challenging environment. Thus we used the Malmo project and gym-minecraft as a wrapper to interact with this environment in an OpenAI-Gym like API (the installation of the whole environment can be found in the "Gym-Minecraft-Installation.ipynb" notebook).

The implementation of DeepQ-Learning algorithms is gathered in "DRQN_minecraft.ipynb". We chose "MinecraftBasic-v0"(https://github.com/tambetm/gym-minecraft) as the first environment. The env is configured with a xml file in which you can change settings such as the rewards. The first part of the notebooks loads the env (with a custom resolution and a reducted list of allowed actions to get a faster convergence) and parses the xml configuration file to get the rewards.

We then declared three algorithms :

The three NNs use the classical loss function tf.reduce_mean(tf.square(self.nextQ_scaled - self.Q)) with Q being the Q-value computed by the network and nextQ_scaled the scaled version of the result of the Bellman's equation result (https://arxiv.org/pdf/1312.5602v1.pdf). Note that we scaled the Q-values for stability.

We then choose one of these NN and create the experience buffer to stack the state/action/reward/next_state tuples (https://arxiv.org/pdf/1312.5602v1.pdf). This allows the network to train with mini-batches diversified. The implementation of the experience buffer for the Conv+RNN is quite more complicated. It handles the fact that mini-batchs must contain following frames.

We then implement a version of the Double DeepQ-Learning (https://arxiv.org/pdf/1509.06461.pdf) with a method to update the target network. This code comes from Arthur Juliani's work.

The next part concerns settings for the NN's training and also for the epsilon-greedy strategy.

We finally have the training part. This can be done in multiple times by setting load_model=True. The agents plays randomly in order to discover the env until pre_train_episodes is reached. It then uses the epsilon-greedy strategy to trade off between exploration and exploitation. We store some infos about the states/actions/rewards and print them every 500 episodes.

We also added the Testing part at the end of the notebook to test a trained model on an episode.

I'd be happy to answer your questions if you have some ! Hope this helps :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.