Difference between train and test?

0xsamgreen commented 5 years ago

Hi @danijar

My "train" agents look great, e.g.

walker_posterior_train

But the test agents don't look like they're doing very well

walker_posterior_test

Is this expected? Could you please summarize the difference between train and test?

danijar commented 5 years ago

The test set is by default just 5 episodes under random actions. There is also an option to collect more test episodes on the fly, alongside collecting training episodes. More generally, the plots you are looking at are video predictions for sequence chunks randomly drawn from the data set, so it's expected to also contain episodes from early during training. To judge the planning performance, search for the simulation GIF.

mathkobe commented 5 years ago

Hi @danijar

My "train" agents look great, e.g.

But the test agents don't look like they're doing very well

Is this expected? Could you please summarize the difference between train and test?

Hi Sam: May I ask which function enables you to have those gifs ? Also, is the "Epoch" in the code means "episode" in the paper ? Thanks a lot!

0xsamgreen commented 5 years ago

Hi @mathkobe, when your agent starts training (see this command), a log directory will be created at the path specified by the --logdir argument. These gifs are created automatically by the PlaNet code. The best way to view them is to start TensorBoard in a separate terminal and point it to the log directory given in the training command above. E.g. tensorboard --logdir my_logs. In a web browser, you then navigate to http://localhost:6006 to view the results.

0xsamgreen commented 5 years ago

I don't think "epoch" and "episode" are the same in the paper/code. If you search the code for the two, you'll see them both being used. In general episode has to do with how long the agent gets to run before being terminated and epoch has to do with iterating over the collected data for network updates.

mathkobe commented 5 years ago

I don't think "epoch" and "episode" are the same in the paper/code. If you search the code for the two, you'll see them both being used. In general episode has to do with how long the agent gets to run before being terminated and epoch has to do with iterating over the collected data for network updates.

Thank you so much for your valuable suggestions. I try to reproduce the results of figure 4 in this paper. May I ask what's the hyperparameters are you using for the cartpole balance task ? I ran 1 day on single 1080TI and the score printed out on testing phase is only around 400. I'm a new learner in RL so those questions maybe stupid. Thank you again for your help!

0xsamgreen commented 5 years ago

Danijar's code is still in flux. He helped me with this issue. Try running cartpole balance with this command: python -m planet.scripts.train --logdir ./logs_patch/cheetah_run/ --config default --params '{future_rnn: True, overshooting: 0, global_divergence_scale: 0.0, reward_scale: 1.0, tasks: [cartpole_balance]}

Regarding figure 4, those plots are not generated (as far as I can tell) by the PlaNet code, although you'd be able to get the PlaNet results from the log files. The results for the other comparison algorithms were probably generated by other code solving the same environments.

danijar commented 5 years ago

Yes, we'll update the paper and code a bit for the camera-ready version in the next two weeks. The agent learns a bit faster now than the figures we have currently in the paper.

mathkobe commented 5 years ago

Yes, we'll update the paper and code a bit for the camera-ready version in the next two weeks. The agent learns a bit faster now than the figures we have currently in the paper.

Hi Thanks a lot for your reply. May I ask what the test_steps should be for the task cartpole_balance ? The default value is 100. Should I add more ?

mathkobe commented 5 years ago

Danijar's code is still in flux. He helped me with this issue. Try running cartpole balance with this command: python -m planet.scripts.train --logdir ./logs_patch/cheetah_run/ --config default --params '{future_rnn: True, overshooting: 0, global_divergence_scale: 0.0, reward_scale: 1.0, tasks: [cartpole_balance]}

Regarding figure 4, those plots are not generated (as far as I can tell) by the PlaNet code, although you'd be able to get the PlaNet results from the log files. The results for the other comparison algorithms were probably generated by other code solving the same environments.

I run your command exactly without changing anything in the code. The total epochs I have ran is 400. The score is around 200-400 during the first 80 epochs and the max Score is 822 near 200 epoch(Test Phase Score). And dropped quickly back to 500+ after wards. According to the result in figure 4 (cartpole_balance task), I should keep stable around 800 even considering the variance. I think I should get something wrong on understanding the "Score" in the test phase?

danijar commented 5 years ago

Hi, I'm not sure what you mean by epochs. Running the code for 5M steps, which corresponds to 1000 episodes, should reproduce the results in the paper. This is with the updated hyper parameters for our camera-ready paper.

google-research / planet

Difference between train and test? #33