Hi, I love your model and have used it to interpolate many videos on Colab! The best rate I've gotten on a T4 by interpolating 2x was about 14 new frames a second, which isn't bad, but I was wondering if there could be any benefit from allowing TPUs (using PyTorch's XLA library) to run the model?
It only took me about 30 minutes as a complete novice to XLA to get everything to run on a single TPU core (just by importing torch_xla.core.xla_model as xm and calling device = xm.xla_device() really), but the speed is terrible - about 8 seconds to generate just one new frame on the demo video! I was trying to think of ways to parallelize the computations on frames... each TPU core could get its own frame and put it in a global queue, but each frame would need to be tagged with its position so they can be encoded in order - plus it takes 8 seconds per frame anyway, so there's really no speed advantage.
Feel free to close this if TPUs just don't make sense for this workload. I was hoping some of the functions might be able to be optimized for TPUs with XLA and perhaps there is a way to take advantage of parallelization, but I'm not sure.
Hi, I love your model and have used it to interpolate many videos on Colab! The best rate I've gotten on a T4 by interpolating 2x was about 14 new frames a second, which isn't bad, but I was wondering if there could be any benefit from allowing TPUs (using PyTorch's XLA library) to run the model?
It only took me about 30 minutes as a complete novice to XLA to get everything to run on a single TPU core (just by importing
torch_xla.core.xla_model as xm
and callingdevice = xm.xla_device()
really), but the speed is terrible - about 8 seconds to generate just one new frame on the demo video! I was trying to think of ways to parallelize the computations on frames... each TPU core could get its own frame and put it in a global queue, but each frame would need to be tagged with its position so they can be encoded in order - plus it takes 8 seconds per frame anyway, so there's really no speed advantage.Feel free to close this if TPUs just don't make sense for this workload. I was hoping some of the functions might be able to be optimized for TPUs with XLA and perhaps there is a way to take advantage of parallelization, but I'm not sure.