Reuse Existing Cuda Streams each Epoch

This PR modifies the FFCV Loader and EpochIterator to create one set of Cuda streams and reuse them each epoch instead of creating new Cuda streams every epoch.

The current method of creating new Cuda streams every epoch can cause increasing memory allocation if using a GPU transform, as each epoch the GPU transform will allocate new memory in the new Cuda stream. This doesn't cause any errors, as the prior epochs' allocation can be reused. But it does make keeping track of GPU memory usage more difficult and can hide real memory overflow errors.

I don't think this will cause any issues with distributed training, but am unable to test.

If wanted, I can create a flag like recompile to recreate Cuda streams every epoch to match current behavior.

libffcv / ffcv

Reuse Existing Cuda Streams each Epoch #308