tianheyu927 / mopo

Code for MOPO: Model-based Offline Policy Optimization
MIT License
171 stars 42 forks source link

Why MOPO is given access to terminal function in rollout generation? #4

Open IcarusWizard opened 3 years ago

IcarusWizard commented 3 years ago

Hi, thanks for sharing your great work. However, I am confused about the rollout generation process.

As I see in the code, the agent can access to a pre-defined terminal function to cut down the unrealistic state. Is this assumption generally holds up for broad cases of offline RL? To my understanding, in the offline setting, the agent should only access to a fix dataset without anything else. It feels like a little cheating for me, especially when, in the paper, the authors argue that one of the difference between MOPO and MOReL is that the soft penalty, rather than a hard terminal, of MOPO allow the agent to take risky actions.

Besides, if MOPO really needs the terminal function, why not learn one by neural net? I have already seen many model-based works on Atari games that uses a learned terminal function.

vermouth1992 commented 3 years ago

Actually, given the terminal function is reasonable in practice to perform model-based RL. Also, we can even get access to a given reward function when doing model-based learning. The reward and terminal function is essentially defined by human experts. On the contrary, the transition dynamics is governed by nature. Thus, reverse engineering reward and terminal function from data is not necessary in practice.

What worries me the most is that the numbers reported in the paper after it is accepted (the latest version) is different from what is reported in the first arxiv version when this paper is under review.

IcarusWizard commented 3 years ago

I agree with your point. In real-world scenarios, the reward function and the terminal function are available in the most cases (MDP settings and some POMDP settings). I guess future works can take this into account.

About the result, I guess they just redo the experiments in a more standardized way. However, it indeed needs a bit of luck to get a good result even with the correct hyper-parameters.