Closed twni2016 closed 2 years ago
Now I cannot reproduce the results given the same seed. I confirm that it is from pytorch side, not numpy or gym side.
Close this PR as it will be merged to main via https://github.com/twni2016/pomdp-baselines/pull/13
This PR is not intended to be merged, but as a showcase for support pixel observation with discrete action space, e.g. Atari games.
We take delayed-catch environment as a sanity check, introduced by IMPALA+SR https://arxiv.org/abs/2102.12425 The environment has only terminal reward and requires long-term memory. It has image size of 1x7x7, discrete action of 3, horizon of ~runs*7. We use a simple image encoder for image observation to replace the MLP encoder for vector observation.
We try delayed-catch with 5, 10, 20, 40 runs. The more the runs, the harder the problems. Below are the learning curves of 10, 20, 40 runs for IMPALA and IMPALA+SR (their Fig. 7b).
Our running command:
where we found fixed temperature works much better than auto-tuning it with target entropy in this task. (Still a bit strange why this can work but that cannot; auto-tuning will finally has zero actor gradient).
Different fixed alpha value:
With alpha=0.1: