Closed nosyndicate closed 6 years ago
Good catch! Indeed we don't have a target policy, that is a mistake in our paper. A target network is only needed to stabilize TD learning, and for learning the policy there is no such need. We'll fix this for the next version.
Hi, it seems to me the policy is not using a target network as stated in the paper Ihttps://github.com/haarnoja/softqlearning/blob/aca29d2aee66c44ee052a298f049a22fa14792a5/softqlearning/algos/softqlearning.py#L380
Am I miss something here?