thomashopkins32 / Minecraft-Virtual-Intelligence

MIT License
0 stars 0 forks source link

Map out plan to use ICM as the curiosity module #8

Open thomashopkins32 opened 4 months ago

thomashopkins32 commented 4 months ago

Paper here: https://arxiv.org/pdf/1705.05363.pdf

Discussion: This is the first paper I have seen that could be a viable way to use curiosity in a game as large as Minecraft. The modules are already broken down into a structure that is amenable to this "inverse dynamics model".

thomashopkins32 commented 4 months ago

There are three learning problems to solve:

Where $\phi$ is the feature space, rather than the raw pixel input.

thomashopkins32 commented 4 months ago

We have already merged the code for solving the optimal policy learning problem. We are using PPO.

We still need to build a learning algorithm for the inverse dynamics model and the forward dynamics model.

Since a PPOTrajectory already uses this information, I think we can re-use a lot of what has been built out for PPO. This probably means separating the running of the algorithm from PPO since we will now have to coordinate three different learning algorithms. The neural network can stay mostly the same, although it will require two new heads: inverse dynamics and forward dynamics. The inverse dynamics head will take in the output of the VisualPerception module run on both $st$ and $s{t+1}$. I am not sure just yet if they will share parameters or not, but we should be able to process both of these in a single batch. The forward dynamics model will take as input $s_t$ and $a_t$, where $s_t$ will be encoded through the VisualPerception module and $a_t$ will be the full 10-action vector.

I won't have a good idea on how this will look in the code until I start implementing some of it. It will probably take me a few iterations to get right. It might also help to write this algorithm on a much simpler problem to start with, that way I am familiar with how the different learning algorithms fit together.

thomashopkins32 commented 3 months ago

There are a few options for using ICM as I see it now:

  1. Use ICM completely offline
    • Fill the trajectory buffer until full
    • Run ICM to compute the curiosity reward
    • Train the ICM forward and inverse dynamics models
  2. Use ICM completely online
    • Run ICM at each environment step
    • Train ICM at each environment step
  3. Use ICM online but train offline
    • Run ICM at each environment step to compute the reward
    • Train the ICM forward and inverse dynamics models after the trajectory buffer is full

With the PPO formulated as it is now, options 2 & 3 do not make sense. We are waiting until the trajectory buffer is full to update the policy, so there is no benefit to computing the curiosity reward any earlier. This means that we should be using option 1, where we compute the rewards and train the ICM module after the trajectory buffer is full.

Furthermore, I think the trajectory buffer size should be fairly small. It should only be large enough to store a few minutes of gameplay at most. This is to keep the updates frequent enough that the agent will actually learn to explore its environment using the curiosity reward. So if we are able to process game states at 60 frames per second and we want the models to update every minute, then we should be training every $60 * 60 = 3600$ frames per minute.

If we ever move away from policy gradient methods back towards TD-learning, then we can revisit options 2 & 3. Something like TD(0) could benefit a lot from having the curiosity reward computed at each step. I'm not sure that this will be necessary as long as we keep the trajectory buffer sufficiently small.

I do not have any ideas for how this will work once we add model-based planning methods, although we will figure that out when the time comes.,,

thomashopkins32 commented 3 months ago

Using ICM offline will be the way we handle it going forward for now. Work on this ticket is blocked until #12 is closed.