Closed gooofy closed 5 years ago
ok, it looks like this issue is unrelated to my gradient checkpointing changes - seems like _valid_batch_iter can return empty batches, I have added a check against that - will see if training is stable now
no worries! I am planning to focus on pytorch myself with my ml work anyways :)
I have not done any serious benchmarking - I did play around with the small model a little bit and found that with checkpointing I can increase the batch size significantly but performance will stay slower than without checkpointing to begin with.
however, checkpointing does enable training a 345M model on my 1080ti which I never managed to do without it enabled.
Thanks @gooofy 👍
I finally got to use this and it works great, very small performance overhead and much larger models possible, thank you @gooofy
thanks for your feedback :) - I wasn't aware that you have automated tests in place, very cool! I have moved my tf related changes to a separate branch now and will focus on pytorch. I have also added the missing argument so tests should run cleanly.
however, please be aware that it looks like my changes have introduced a new bug which I am trying to hunt down right now:
so it is probably a good idea to delay the merge until I have figured out what is going wrong there.