Hi, in your implementation, SAC is used but V is estimated by Q-function when updating critic and calculating target Q, instead of a separated value network in the original SAC paper. Would you please explain it or give some references? Thanks
Since we're using discrete action spaces our Q-function outputs a value for each possible action. As such, we can marginalize Q in order to get V instead of estimating V separately.
Hi, in your implementation, SAC is used but V is estimated by Q-function when updating critic and calculating target Q, instead of a separated value network in the original SAC paper. Would you please explain it or give some references? Thanks