hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.1k stars 727 forks source link

[question] SAC target q nets may be updated many times in each round? #900

Open xuanqing94 opened 4 years ago

xuanqing94 commented 4 years ago

This line of code: https://github.com/hill-a/stable-baselines/blob/d288b6d911094268b17abb61dbe9ac3ca05e1ce3/stable_baselines/sac/sac.py#L464

since step is in range(total_timesteps) and grad_step is nested in range(self.gradient_steps), then if self.gradient_steps is not 1, the target networks are updated repeatedly?

Miffyli commented 4 years ago

I have not rehearsed SAC internals, but if goal is to update target network every target_update_interval updates to main network, then multiple updates per step are to be expected. However there is a related mistake here: with gradient_steps > 1, the timing of updates gets screwed up. See example below. Zeros should be equally far apart from each other.

@araffin Can you confirm and check this?

for i in range(10):
    for j in range(3):
        print((i + j) % 4)

0
1
2
1
2
3
2
3
0
3
0
1
0
1
2
1
2
3
2
3
0
3
0
1
0
1
2
1
2
3
araffin commented 4 years ago

Can you confirm and check this?

In short: yes it is a bug but luckily target_update_interval=1, so it won't change any results :sweat_smile:.

Looking at the official implementation, there is apparently something wrong in their implementation:

https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py#L294

https://github.com/rail-berkeley/softlearning/blob/daf076b5042625704f63373b145509f716e67819/softlearning/algorithms/rl_algorithm.py#L345

They check iteration % self._target_update_interval == 0 but ìteration is the same for several gradient steps...

@hartikainen could you comment on that?

We may also need to fix that in SB3, the implementation has the same behavior as the original one and followed what is done in TD3: the target networks are updated after each gradient step.

hartikainen commented 4 years ago

Hey @araffin. The iteration is intentionally constant for all the gradient steps in softlearning code. Basically the logic currently checks if we should train on this iteration and if so, then apply self._n_train_repeat gradient steps. However, I think the way these values are configured in softlearning is really confusing so you all might want to think a better way to configure these 🙂 For SAC, it doesn't really matter what the exact gradient step interval is, that is, it's pretty much the same to apply x/y gradient steps after each collected sample or Nx gradient steps after every Ny samples, as long as the ratio between gradient steps and environment samples stays the same and the gradient interval is somewhat reasonable (e.g. sampling a full 1000 step episode should work for sure).

araffin commented 4 years ago

Thanks @hartikainen for your answer.

to apply x/y gradient steps after each collected sample or Nx gradient steps after every Ny samples

yes, I have also experienced that. My question was more about the target network update. In DQN, it is independent of the gradient update. In TD3, there is one soft update per gradient step. For SAC, it is the same as TD3 for now, but the current code in soft q learning seems to be closer to DQN.... (currently it does not change anything though because target_update_interval=1.

hartikainen commented 4 years ago

Ah I see, yeah. The target update in my opinion should be dependent of the gradient updates so probably preferable to do what TD3 does.