reproduce ppo benchmark

I can't find a way to make ppo compariable to tianshou benchmark, especially in half-cheetah env, where we can't acheive half of score..

Benchmark:

Tianshou: Hopper-v3: 2609.3+-700.8 Half-Cheetah-v3: 5783.9+-1244.0

My: Hopper-v3:1683+-307 Half-Cheetah-v3: 1926+-254

Where goes wrong?

So far I have test following assumption

Result: add masking in ppo step and make using the value bootstrap not improve much

Result: change different version won't help.

Did the learning decay not working right? since the tianshou using 3m step, so the learning rate will only decay to 2/3 in 1m step.

Result: Setting learning at a constant result or setting total step to 3m not improve much.

Result: copy remap method from tianshou, still not work

Result: When Use exact data from tianshou, the loss produced by them are same.

Result: Don't know how to test this.

It turns out that we need a observation normalizer

quangr / jax-rl