tianheyu927 / mopo

Code for MOPO: Model-based Offline Policy Optimization
MIT License
172 stars 43 forks source link

Questions about the number in the paper #5

Open MSRA-COLT opened 4 years ago

MSRA-COLT commented 4 years ago

Hi, I really appreciate your open source code. My question is how is your performance number reported in the paper.

For example, in Table 1, do you use the max evaluation return during the learning process or use the last evaluation return. The return of the policy has large variance in different iteration.

image

Thanks, Yue

weihongwei0586 commented 3 years ago

Hi, I really appreciate your open source code. My question is how is your performance number reported in the paper.

For example, in Table 1, do you use the max evaluation return during the learning process or use the last evaluation return. The return of the policy has large variance in different iteration.

image

Thanks, Yue

I have the same problem, When i run the demo not in mixed, the results has large variance. image

HYDesmondLiu commented 2 years ago

Also, which version of D4RL were you using (also in COMBO)? The reason why I ask is that the buffer quality is quite different in v0~v2. (you could refer to the TD3BC paper for details).

typoverflow commented 2 years ago

@HYDesmondLiu The config file in this repo says they used '-v0' dataset for MOPO. But I'm still curious about the dataset version used in COMBO, is COMBO's source code even released? I am also having trouble stabilizing MOPO's performance. The variance of performance across epochs is quite huge.

HYDesmondLiu commented 2 years ago

@typoverflow AFAIK, COMBO source code is not shared. As I recall they use D4RL v2 buffers since the performance between v0 and v2 is quite different. You could easily spot the difference. "Some" DRL methods are notorious for being unreproducible. You could refer to this paper and other related research for more information.