Add the ability to benchmark throughput using multiple threads

The batch mode is a new mode that decodes a batch of 40 copies of the decoder using 8 threads.

Tested:

video=/home/ahmads/personal/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4, decoder=TorchCodecPublic
[---------------------------------------------------------------- video=/home/ahmads/personal/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps -----------------------------------------------------------------]
                        |  uniform 10 seek()+next()  |  batch uniform 10 seek()+next()  |  random 10 seek()+next()  |  batch random 10 seek()+next()  |  1 next()  |  batch 1 next()  |  10 next()  |  batch 10 next()  |  100 next()  |  batch 100 next()  |  create()+next()
1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      TorchCodecPublic  |            67.0            |              841.9               |            60.5           |              743.9              |    21.4    |      219.4       |     24.1    |       276.5       |     69.9     |       812.5        |                 
      TorchCodecCore    |                            |                                  |                           |                                 |            |                  |             |                   |              |                    |        18.5     

Times are in milliseconds (ms).

pytorch / torchcodec

Add the ability to benchmark throughput using multiple threads #359