rail-berkeley / rlkit

Collection of reinforcement learning algorithms
MIT License
2.43k stars 547 forks source link

Difficulty Reproducing HalfCheetah-v2 SAC Results #128

Open xanderdunn opened 3 years ago

xanderdunn commented 3 years ago

Huge thanks for providing this implementation, it's very high quality.

I'm having difficulty reproducing the results of the original SAC paper using the provided examples/sac.py script.

The paper reports a mean return of 15,000 in 3M steps (blue and orange lines are SAC):

Screen Shot 2021-02-21 at 08 52 07

My runs on the unmodified examples/sac.py script appear to be considerably less sample efficient:

Screen Shot 2021-02-21 at 08 58 18

My runs are pretty consistently achieving 13,000 average return on 10M steps. They may eventually get to 15,000 average return if left to run for millions of steps further, but my runs are requiring more than 3x the number of steps to achieve 13k vs 15k mean return.

I have found that results can vary greatly from run to run. Notice the pink line in my above chart that does poorly. Is the paper doing many runs and reporting the best? I didn't see this mentioned in the Experiments section of the paper.

It appears to me that the hyper parameters shown in the paper are the same in the script, which I have not modified:

Screen Shot 2021-02-21 at 08 50 22

Am I interpreting the "num total steps" and "Returns Mean" correctly? Do you know what might cause this difference in sample efficiency and final return?

vitchyr commented 3 years ago

Hi, thanks for pointing this out. One possible cause for this difference is that this implementation alternates between sampling entire trajectories and taking gradient steps, where as the original SAC paper alternates between one environment step and one gradient step. It's hard to compare the two exactly, but I'm guessing that something small like increase num_trains_per_train_loop would compensate for this difference.

Another possible differences are differences in network initialization or very minor differences in the Adam optimizer implementation (I've seen people talk about this, though I don't particularly suspect this).

xanderdunn commented 3 years ago

@vitchyr Thanks very much, I will try increasing num_trains_per_train_loop.

I don't see mention in the SAC paper of how the network's weights were initialized. I might look at the official implementation to see if it differs.

xanderdunn commented 3 years ago

What values of num_trains_per_train_loop would you recommend trying? With values 1000-3000 I'm not seeing a large difference in sample efficiency:

Screen Shot 2021-02-22 at 08 10 38

Light blue is the default 1000 and the others are 2000 or 3000. The best I'm seeing by step 3M is mean return 10.2k, vs. the paper's 15k.

vitchyr commented 3 years ago

Thanks for trying that. My main suspicion then is the difference between the batch data collection versus the intertwining data collection that could cause the difference. If you want to investigate this, replace the evaluation path collector with a step collector and replace the batch RL algorithm with an online RL algorithm. It might take a few more edits to get it to run, but these components should be fairly plug-and-play.

xanderdunn commented 3 years ago

Thanks again for your help @vitchyr.

It looks like this issue in the soft learning repo is related: rail-berkeley/softlearning#75

However, I managed to get the same experiment running in soft learning and found the results matched those in the paper. Running this:

softlearning run_example_local examples.development \
    --algorithm SAC \
    --universe gym \
    --domain HalfCheetah \
    --task v2 \
    --exp-name my-sac-experiment-1 \
    --checkpoint-frequency 1000

I got these results on four different seeds:

Screen Shot 2021-02-23 at 07 50 32

These results match the paper's reported result achieving ~15,000 mean return on the first 3M timesteps. The evaluation mean return was >15k for all runs. Note that each of these runs took 10.7 hours.

Compare to rlkit runs with four different values of num_trains_per_train_loop:

Screen Shot 2021-02-23 at 07 47 31

Mean return on the first 3M tilmesteps ranges from 6,200-11,000. Due to the high values of num_trains_per_train_loop, these results also took longer to compute. The best performing one, with num_trains_per_train_loop==5000, took 14 hours under the same hardware conditions.

rlkit has more RL algorithms implemented and is better maintained, but for now I will continue with the tensorflow implementation since the baseline is immediately accessible. The sample and computational efficiency are important aspects for our work.

ZhenhuiTang commented 1 year ago

Hii, where could I see the results, when I run "python3 examples/ddpg.py" ? I could not find the 'output' file.