sfujim / TD3

Author's PyTorch implementation of TD3 for OpenAI gym tasks
MIT License
1.68k stars 433 forks source link

Would it be possible to use a single network? #9

Closed kayuksel closed 5 years ago

kayuksel commented 5 years ago

Hello, thank you for this great work. I have few questions as I am a newbie in Reinforcement Learning.

Would it be possible to use a single network with multi-heads rather than two? I am actually training such network successfully. I am just not giving the reward of the actor based on the critic yet.

What is the advantage of using the critic reward rather than the real reward for the actor? When I do the multi-tasking by using the critic as a regularization loss, it already improves the actor.

However, I understand that it is advisable to make one policy update after two Q-value updates via delays. What if I do backward only with the relevant loss at each epoch according to this schedule?

Note: While writing this, I have actually just started using the minimum of the real reward and the reward estimated by the critic within actor loss and will see if that will improve the results in some way.

Although I am concurrently optimizing the critic loss (MSE), I am afraid that it might be over-estimating to have the real reward at the worst-case. Is that why having two separate networks are necessary?

The reason it would be desirable making this with a single network is resources such as memory and more importantly learning two tasks together might be helping the convergence of both tasks (as it is in my case). So even there are two separate networks, maybe one can have shared layers between two?

sfujim commented 5 years ago

Hello, I'm glad you've enjoyed my work! Some answers to your questions:

Would it be possible to use a single network with multi-heads rather than two?

Yes, it would be possible to use a multi-headed network, however the benefits will be reduced as the networks will be more correlated (same feature representation). I don't recommend this.

What is the advantage of using the critic reward rather than the real reward for the actor?

I assume you mean Q-network vs. monte carlo returns? You can find some of the benefits/trade-offs of temporal difference learning in Sutton & Barto, or any standard RL course lecture notes.

I am afraid that it might be over-estimating to have the real reward at the worst-case. Is that why having two separate networks are necessary?

Yes, combating overestimation bias is the primary goal of this work.