ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.57k stars 829 forks source link

unexpected negative rewards in bench monitor #184

Closed mattroos closed 5 years ago

mattroos commented 5 years ago

Apologies if this is an issue with baselines, rather than this repo. But I'm not sure at this point.

I built a custom environment that has only positive rewards, and a cap/limit to the total rewards over an episode. Yet occasionally the monitor files have negative reward values or values greater than the cap. The large values might be explained by division by std dev, as mentioned in issue #88. But I can't fathom why negative values should appear and haven't been able to track it down in the code.

I also see values larger than the cap in the max reward values reported by main.py (to stdout). But I do not observe negative values.

Any thoughts/suggestions?

mattroos commented 5 years ago

As I should have guessed, I had a code bug that led to this problem. Has a=b, but needed a=np.copy(b). And then ripple effects. Oops.