Closed piojanu closed 5 years ago
This sounds like an interesting project! Maybe the render function of your environment is returning a np.float32
array with elements between [0, 1] while it should be returning a np.uint8
array with elements between [0, 255]?
Only the image decoder and the reward
state component contribute to training the model. You do not need any other state components. However, if you have true information about the environment state available, adding these could help with interpreting what the agent learns. The PlaNet code will try to predict these additional state components from a copy of the learned latent space (without contributing gradients back to the rest of the model) and add scalar and image summaries about the predictions to TensorBoard.
In deed, my _preproc_obs
method returned floats whereas it should return uint8 and my action space should be from -1 to 1 not from 0 to 1 ;) Thanks for hint, it runs now! I close this issue, but I have one more unrelated question: Before I'll go further, I would like to understand better your code (probably I'll try to extend it with other planners like I said, I don't want implement it myself). Do you have some resources that describes this code architecture etc.? Some documentation? It's hard to figure out what is going on from raw code 😮
EDIT:
@danijar to be more specific: I'm the most confused with those Experiments and Runs in train.py
. Could you elaborate on what gets executed in parallel (I've seen already that environments are batched, but what else, what are those experiments and runs that workers execute?) and how this messaging interface based on files works?
Great to hear. You don't need to worry about the Experiment
and Run
classes in training/running.py
. They are just for running multiple (independent) experiments on a cluster with a limited number of machines. This can be useful for hyper parameter search but has nothing to do with parallel training. Please see https://github.com/google-research/planet/issues/3#issuecomment-471316616 for how to ignore this code.
Okay, now I understand it more! Thank you :) So what is in deed parallelised per one experiment is environments execution (those are batched) and ...? There are multiple processes, the parameters training is distributed too or it runs on one process and other processes are used for data gathering only? Test runs are run in parallel too or in sequence? Please direct me to the place in code with some hint how it works and I should catch up with that a lot faster than on my own :)
There is no parallelization besides TensorFlow's thread pool. The data collection can be parallelized but I'm not using this so far and it's not a tested feature.
Hi again!
TL;DR: I know it's quite long, at the end is specific question I have. Here is an introduction to what I'm trying to accomplish and where I'm now.
I try to add new environment to your code: Sokoban. I know PlaNet originally is supposed to work with continuous control tasks, but what I'm interested in is the planning network. I then want to use it to train TD-Search algorithm using imagined episodes and latent state as an input. My research is investigating if other planning algorithms (like TD-Search, AlphaZero etc.) are able to: (1) use learned model and (2) use its abstract state representation as high-level features making it easier for them to learn (something like WorldModels proposed with decomposition of representation learning into Vision and Memory. I tried their architecture (here is my implementation), but it didn't work for Sokoban. I decided to drop it and try with PlaNet.).
I think, I'm on the right track, you can see it here: The Sokoban wrapper that makes action space "continuous" and resize the observation. It doesn't have to work great, I just want to run your code with this environment.
Sokoban task and its factory function:
I can run it with this command:
But it crashes in the first epoch (phase train) with this error:
Could you provide me with any ideas where should I start the debug? I have to say, that I don't really get your code. I don't understand all of those experiments and runs and even model definition looks scary ;) I didn't work with such a big codebase in python yet.
One specific question that I have right now though is:, I don't understand what those
state_components
are used for (I know that those are some components of observation that dm_control returns) and what should I put in there in Sokoban factory function. I see that there are some heads created for each state component indefine_model.py:55
:I don't get it. In the paper there is nothing about state components. PlaNet is supposed to work on images, so what for are those heads then?
Greetings, Piotr