Currently the text just says "The DDPG algorithm of [Lil+16],
which stands for “deep deterministic policy gradient”, uses the DQN method (Section 35.2.6)
to update Q that is represented by deep neural networks. " We expand on this a little bit,
since DDPG is quite widely used.
Sec 35.4 (MBRL)
Significant expansion and restructuring of the content
Sec 35.5.3 (deadly triad)
Added brief discussion of gradient TD and target networks
to stabilize off-policy learning.
Sec 35.5.4 (new: offpolicy in practice)
add very short new section listing common off-policy methods,
for ease of reference.
Sec 35.6 (control as inference)
Added clearer subsection headings, to add more structured
Moved the subsection on 'imitation learning' into its own sec 35.7.
Sec 35.7 (new: imitation learning)
This now contains the content that used to be in 35.6.3
Sec 35.8 (new: "Other topics in RL")
Added brief discussions of various topics, such as GVF, temporal abstraction (options),
partial observability, reward functions (including shaping and hacking), and offline RL.
To avoid too much divergence from the original text, I have rolled back these changes.
This new content will be added to a new tutorial on RL that I am writing.
Sec 35.3.5 (DDPG)
Sec 35.4 (MBRL)
Sec 35.5.3 (deadly triad)
Sec 35.5.4 (new: offpolicy in practice)
Sec 35.6 (control as inference)
Sec 35.7 (new: imitation learning)
Sec 35.8 (new: "Other topics in RL")