Closed TheOncomingStorm closed 7 years ago
I just pushed an update onto the master.
By moving the batch_pointer write operation into the feed operation, the read/writes for each batch has been halved and a similar result should be seen with the times for batch processing as read/write to GPU memory can be a significant bottleneck depending on hardware.
I'm using a MacBook with a shared GPU. As a result my batch times tend to vary a bit.
With the new changes training now starts slowing down. First it's normal speed but after about 1000 or so batches it more than doubles the time and eventually it takes almost 50 times as much time to complete a batch and seems to go up geometrically at some point.
My batch times went from 0.7 seconds each to 30 seconds each in under 10,000 batches.
I have read that this usually means the graph has to be finalized or the like but I'm still trying to figure out how that applies here so maybe someone else will beat me to it.
If I stop the training and resume, the time will go back to around 0.7 seconds per batch.
Edit: just removed the graph code and tested and something is still doing it. I reverted to the original code and it doesn't appear to be having the same problems so it's definitely not the graph but there's something causing it for sure.