thiagopbueno / model-aware-policy-optimization

MAPO: Model-Aware Policy Optimization algorithm
GNU General Public License v3.0
1 stars 0 forks source link

Add bounded output layer in critic network (parametrized via config flag) for problems with non-positive reward #79

Open thiagopbueno opened 5 years ago

thiagopbueno commented 5 years ago

Make sure we constrain the set of function approximators for the Q-function in problems that the expected sum of rewards is bounded (e.g., never positive).

0xangelo commented 5 years ago

I'm wondering if this is really necessary, as I've noticed improvement using general Q networks in TD3. One can even see the initial period where the networks predictions start becoming all negative in TensorBoard.