Reuse existing cuda context if possible when creating decoders

Creating a cuda context is slow and takes about 400 MB of VRAM on GPU

This PR ensures we reuse the existing cuda context from pytorch when creating decoders

Thank you @fmassa for pointing out this issue and helping to resolve it.

Benchmark results show a decent speed-up, especially for short videos:

Before:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/frame_numbers_1920x1080.mp4

[------------------ Decode+Resize Time -----------------]
                     |  video=frame_numbers_1920x1080.mp4
1 threads: ----------------------------------------------
      D=cuda R=none  |                 2.0               

Times are in seconds (s).

After:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/frame_numbers_1920x1080.mp4

[------------------ Decode+Resize Time -----------------]
                     |  video=frame_numbers_1920x1080.mp4
1 threads: ----------------------------------------------
      D=cuda R=none  |                 1.3               

Times are in seconds (s).

This makes decoding single videos even without resize competitive with CPU:

[------------------ Decode+Resize Time -----------------]
                     |  video=frame_numbers_1920x1080.mp4
1 threads: ----------------------------------------------
      D=cuda R=none  |                 1.5               
      D=cpu R=none   |                 2.8               

Times are in seconds (s).

pytorch / torchcodec

Reuse existing cuda context if possible when creating decoders #263