Pixel observation with recurrent SAC-Discrete

twni2016 commented 2 years ago

This PR is not intended to be merged, but as a showcase for support pixel observation with discrete action space, e.g. Atari games.

We take delayed-catch environment as a sanity check, introduced by IMPALA+SR https://arxiv.org/abs/2102.12425 The environment has only terminal reward and requires long-term memory. It has image size of 1x7x7, discrete action of 3, horizon of ~runs*7. We use a simple image encoder for image observation to replace the MLP encoder for vector observation.

We try delayed-catch with 5, 10, 20, 40 runs. The more the runs, the harder the problems. Below are the learning curves of 10, 20, 40 runs for IMPALA and IMPALA+SR (their Fig. 7b).

Our running command:

# We sweep over the following range
python3 policies/main.py --cfg configs/pomdp/catch/rnn.yml --noautomatic_entropy_tuning --entropy_alpha [0.1,0.01,0.001]

where we found fixed temperature works much better than auto-tuning it with target entropy in this task. (Still a bit strange why this can work but that cannot; auto-tuning will finally has zero actor gradient).

Delayed-cach with 5 runs: solve it with 100k samples

Delayed-cach with 10 runs: solve it with 400k samples (vs 50M for IMPALA+SR)

Delayed-cach with 20 runs: solve it with 700k samples (vs 100M for IMPALA+SR)

Delayed-cach with 40 runs: after hparam tuning, can solve it with 2M samples (vs 200M for IMPALA+SR)

Different fixed alpha value:

With alpha=0.1:

twni2016 commented 2 years ago

Now I cannot reproduce the results given the same seed. I confirm that it is from pytorch side, not numpy or gym side.

twni2016 commented 2 years ago

Close this PR as it will be merged to main via https://github.com/twni2016/pomdp-baselines/pull/13

twni2016 / pomdp-baselines

Pixel observation with recurrent SAC-Discrete #2