Closed ahmadsharif1 closed 3 weeks ago
Creating a cuda context is slow and takes about 400 MB of VRAM on GPU
This PR ensures we reuse the existing cuda context from pytorch when creating decoders
Thank you @fmassa for pointing out this issue and helping to resolve it.
Benchmark results show a decent speed-up, especially for short videos:
Before:
python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/frame_numbers_1920x1080.mp4 [------------------ Decode+Resize Time -----------------] | video=frame_numbers_1920x1080.mp4 1 threads: ---------------------------------------------- D=cuda R=none | 2.0 Times are in seconds (s).
After:
python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/frame_numbers_1920x1080.mp4 [------------------ Decode+Resize Time -----------------] | video=frame_numbers_1920x1080.mp4 1 threads: ---------------------------------------------- D=cuda R=none | 1.3 Times are in seconds (s).
This makes decoding single videos even without resize competitive with CPU:
[------------------ Decode+Resize Time -----------------] | video=frame_numbers_1920x1080.mp4 1 threads: ---------------------------------------------- D=cuda R=none | 1.5 D=cpu R=none | 2.8 Times are in seconds (s).
Creating a cuda context is slow and takes about 400 MB of VRAM on GPU
This PR ensures we reuse the existing cuda context from pytorch when creating decoders
Thank you @fmassa for pointing out this issue and helping to resolve it.
Benchmark results show a decent speed-up, especially for short videos:
Before:
After:
This makes decoding single videos even without resize competitive with CPU: