quangr / jax-rl

jax version of ppo algorithm in mujoco enviroment, achieve SOTA(tianshou)
0 stars 0 forks source link

reproduce ppo benchmark #2

Closed quangr closed 1 year ago

quangr commented 1 year ago

I can't find a way to make ppo compariable to tianshou benchmark, especially in half-cheetah env, where we can't acheive half of score..

Benchmark:

Tianshou: Hopper-v3: 2609.3+-700.8 Half-Cheetah-v3: 5783.9+-1244.0

My: Hopper-v3:1683+-307 Half-Cheetah-v3: 1926+-254

Where goes wrong?

So far I have test following assumption

Result: add masking in ppo step and make using the value bootstrap not improve much

Result: change different version won't help.

Result: Setting learning at a constant result or setting total step to 3m not improve much.

Result: copy remap method from tianshou, still not work

Result: When Use exact data from tianshou, the loss produced by them are same.

Result: Don't know how to test this.

It turns out that we need a observation normalizer