twni2016 / pomdp-baselines

Simple (but often Strong) Baselines for POMDPs in PyTorch, ICML 2022
https://sites.google.com/view/pomdp-baselines
MIT License
293 stars 41 forks source link

Pixel observation with recurrent SAC-Discrete #2

Closed twni2016 closed 2 years ago

twni2016 commented 2 years ago

This PR is not intended to be merged, but as a showcase for support pixel observation with discrete action space, e.g. Atari games.

We take delayed-catch environment as a sanity check, introduced by IMPALA+SR https://arxiv.org/abs/2102.12425 The environment has only terminal reward and requires long-term memory. It has image size of 1x7x7, discrete action of 3, horizon of ~runs*7. We use a simple image encoder for image observation to replace the MLP encoder for vector observation.

We try delayed-catch with 5, 10, 20, 40 runs. The more the runs, the harder the problems. Below are the learning curves of 10, 20, 40 runs for IMPALA and IMPALA+SR (their Fig. 7b).

Screen Shot 2022-03-02 at 1 40 29 PM

Our running command:

# We sweep over the following range
python3 policies/main.py --cfg configs/pomdp/catch/rnn.yml --noautomatic_entropy_tuning --entropy_alpha [0.1,0.01,0.001]

where we found fixed temperature works much better than auto-tuning it with target entropy in this task. (Still a bit strange why this can work but that cannot; auto-tuning will finally has zero actor gradient).

Screen Shot 2022-03-09 at 2 22 10 AM Screen Shot 2022-03-09 at 2 21 21 AM Screen Shot 2022-03-09 at 2 22 09 PM

Different fixed alpha value:

Screen Shot 2022-03-11 at 2 05 30 PM

With alpha=0.1:

Screen Shot 2022-03-13 at 5 07 47 PM
twni2016 commented 2 years ago

Now I cannot reproduce the results given the same seed. I confirm that it is from pytorch side, not numpy or gym side.

twni2016 commented 2 years ago

Close this PR as it will be merged to main via https://github.com/twni2016/pomdp-baselines/pull/13