pytorch / torchcodec

PyTorch video decoding
BSD 3-Clause "New" or "Revised" License
77 stars 9 forks source link

Add batch decoding support to CUDA #319

Closed ahmadsharif1 closed 1 week ago

ahmadsharif1 commented 1 week ago
  1. Allocate a batch tensor on the correct device. When cuda is passed in it uses that now.
  2. Pass in the batch tensor's view to the color conversion function convertAVFrameToDecodedOutputOnCuda().
  3. Add a test to test frame contents.
  4. Added a TODO to eventually merge preAllocatedOutputTesnor into RawDecodedOutput because it doesn't make sense to pass in two output data pointers.
  5. Add device to VideoDecoder class
  6. Update sampler benchmark to take in device and video arguments from the commandline

Sampler benchmark results:

CPU:
python benchmarks/samplers/benchmark_samplers.py --device=cpu
----------
num_clips = 1
clips_at_random_indices     med = 23.16ms +- 16.18  med fps = 431.8
clips_at_regular_indices    med = 5.67ms +- 0.43  med fps = 1764.3
clips_at_random_timestamps  med = 22.54ms +- 16.21  med fps = 443.7
clips_at_regular_timestamps med = 7.46ms +- 5.66  med fps = 1339.7
----------
num_clips = 50
clips_at_random_indices     med = 2400.86ms +- 803.05  med fps = 208.3
clips_at_regular_indices    med = 1343.50ms +- 288.18  med fps = 372.2
clips_at_random_timestamps  med = 1170.24ms +- 727.77  med fps = 427.3
clips_at_regular_timestamps med = 950.92ms +- 294.30  med fps = 515.3

CUDA:
python benchmarks/samplers/benchmark_samplers.py --device=cuda:0
----------
num_clips = 1
[AVHWDeviceContext @ 0x8793680] Using current CUDA context.
clips_at_random_indices     med = 245.46ms +- 116.64  med fps = 40.7
clips_at_regular_indices    med = 284.49ms +- 39.86  med fps = 35.2
clips_at_random_timestamps  med = 264.93ms +- 115.74  med fps = 37.7
clips_at_regular_timestamps med = 283.26ms +- 9.99  med fps = 35.3
----------
num_clips = 50
[AVHWDeviceContext @ 0x8d0d680] Using current CUDA context.
clips_at_random_indices     med = 308.00ms +- 104.52  med fps = 1623.4
clips_at_regular_indices    med = 286.54ms +- 12.69  med fps = 1744.9
clips_at_random_timestamps  med = 368.12ms +- 105.73  med fps = 1358.3
clips_at_regular_timestamps med = 285.32ms +- 13.19  med fps = 1717.4

CUDA is only worth it for lots of decoding (and could win at throughput) and potentially for higher resolution videos.

Also interestingly enough the variability in CUDA is quite low.