This PR reduces the VRAM usage of the training loop: fixes #16
Allows 4X increase in batchsize with same number of samples on a 1080TI. Instead of taking up 4X the amount of memory as all the other GPUs, now it only takes up an additional 200mb.
It moves the inference images to CPU as soon as possible which tells PyTorch immediately frees the underlying GPU memory.
It also zeros the gradient of the previous backprop so that the memory of those gradients don't have to co-exist with the second forward pass used for inference.
It ensures the gradients of the generator aren't accidentally stored in the accumulated ema weights (they shouldn't be in the current codebase but this is just an extra sanity check / documentation of desired behavior).
Overall, this allows me to fix a batch size of 4 on a 1080TI, where as before I could only fit a batch size of 1. This all works with the default number of samples (16).
This PR reduces the VRAM usage of the training loop: fixes #16