Batch size needs to be large enough to get a good distribution of different state/action changes. If we're using random starting positions, this means it should probably be large enough to contain all steps in multiple games (I had some success with 10). "Replay priority" would help with this, also.
For best performance, there still needs to be a little bit of randomness. Acting purely on-policy causes the robots to get stuck on things (as evidenced in the checkpoint videos vs watching training). Final epsilon of .08-.1 seemed best for the basic "Push good ball" env. Maybe, instead of purely random, we could institute a weighted selection policy for the "final policy" based on how good each action is rated? (instead of just "top pick" vs "pure random")
[ ] Hyperparameter tweaking
[x] Addition of "target value network"
[ ] Replay priority
[ ] Move to Dueling Deep Q
Research other improvements/extensions that have been made to DQN networks over the years (one added currently is about as basic as it gets)