More benchmark refactoring, more benchmarks.

This PR makes the following changes:

Renames the existing Decord decoder kind to DecordAccurate. This is because the API calls use accurate seeking.
Adds a new Decord decoder kind, DecordAccurateBatch. This uses the batch APIs. We believe this is an accurate API.
Adds a Decord benchmark kind to the README graph.
Renames the existing TorchCodecCore decoder kind to TorchCodecCoreNonBatch.
Adds the decoder kind TorchCodecCore - while it has the same name as a previous decoder kind, it's using the best core API for each scenario. We can directly compare it to TorchCodecPublic. Any systematic difference is likely caused by the logic in VideoDecoder itself.
Removes all of the fine-grained calls to timeit inside of the experiments. If we want that data, we should create separate experiments for it. In general, if we are going to do something N iterations, and then time how long the N iterators take, we can't also time each N iteration. We don't want the cost of the fine-grained timers to add to the overall time. If we want fine-grained timers, we can't time the batch. And if we time the batch, we can't do fine-grained timers.
Refactors benchmark_decoders.py so that we have a registry of decoder kinds, and we access that registry to know what to run. This eliminates a lot of the bespoke logic. Adding new decoder kinds is now easy: just make a new entry to the registry, and the rest of the code works. As a bonus, this unifies specifying and adding options for decoder kinds.

The following results were run with:

python benchmarks/decoders/benchmark_decoders.py --bm_video_speed_min_run_seconds=20

These are four different calls of the above:

[--------- video=/home/scottas/github/torchcodec/benchmarks/decoders/../../test/resources/nasa_13013.mp4 h264 480x270, 13.013s 29.97002997002997fps ---------]
                                         |  uniform 10 seek()+next()  |  random 10 seek()+next()  |  1 next()  |  10 next()  |  100 next()  |  create()+next()
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------
      DecordAccurate                     |            53.1            |           119.0           |    12.8    |     16.1    |     68.6     |                 
      DecordAccurateBatch                |            52.5            |           117.6           |    13.0    |     16.1    |     47.9     |                 
      TorchAudio                         |           468.6            |           524.2           |     8.0    |     14.0    |     69.0     |                 
      TorchVision[backend=video_reader]  |           343.2            |           331.4           |    12.7    |     15.9    |     44.1     |              
      TorchCodecCoreNonBatch             |            47.4            |           109.7           |     9.6    |     12.6    |     42.5     |                 
      TorchCodecCoreBatch                |            49.8            |            44.1           |    11.7    |     14.6    |     61.9     |                 
      TorchCodecCore:                    |            49.2            |            44.1           |     9.5    |     12.7    |     43.2     |        9.5      
      TorchCodecCore:num_threads=1       |           111.9            |           102.8           |     6.9    |     11.9    |     53.9     |                 
      TorchCodecPublic                   |            50.5            |            44.3           |    11.6    |     14.9    |     45.3     |                 

      DecordAccurate                     |            53.4            |           119.2           |    12.9    |     16.7    |     68.0     |                 
      DecordAccurateBatch                |            52.8            |           119.1           |    13.1    |     16.2    |     48.0     |                 
      TorchAudio                         |           472.7            |           519.2           |     8.0    |     14.0    |     72.3     |                 
      TorchVision[backend=video_reader]  |           343.8            |           328.3           |    12.7    |     15.9    |     44.2     |                 
      TorchCodecCoreNonBatch             |            47.5            |           109.8           |     9.5    |     12.7    |     46.1     |                 
      TorchCodecCoreBatch                |            50.2            |            44.6           |    11.6    |     14.6    |     61.4     |                 
      TorchCodecCore:                    |            49.4            |            43.8           |     9.5    |     12.6    |     46.4     |        9.5      
      TorchCodecCore:num_threads=1       |           111.1            |           101.8           |     6.9    |     11.8    |     69.1     |                 
      TorchCodecPublic                   |            49.3            |            44.0           |    11.7    |     14.8    |     48.8     |                 

      DecordAccurate                     |            52.6            |           117.8           |    12.9    |     16.1    |     68.0     |                 
      DecordAccurateBatch                |            52.5            |           120.1           |    13.0    |     16.1    |     48.1     |                 
      TorchAudio                         |           470.9            |           520.7           |     7.9    |     15.4    |     77.1     |                 
      TorchVision[backend=video_reader]  |           351.0            |           329.0           |    12.7    |     16.1    |     44.3     |                 
      TorchCodecCoreNonBatch             |            47.3            |           109.3           |     9.5    |     12.6    |     49.7     |                 
      TorchCodecCoreBatch                |            49.7            |            44.0           |    11.6    |     14.8    |     61.6     |                 
      TorchCodecCore:                    |            50.2            |            43.8           |     9.5    |     12.7    |     49.8     |        9.5      
      TorchCodecCore:num_threads=1       |           111.9            |           102.1           |     6.9    |     11.8    |     61.4     |                 
      TorchCodecPublic                   |            49.6            |            44.4           |    11.8    |     15.0    |     53.8     | 

      DecordAccurate                     |            52.7            |           117.7           |    12.9    |     16.2    |     68.3     |                 
      DecordAccurateBatch                |            52.2            |           117.8           |    13.0    |     16.1    |     48.0     |                 
      TorchAudio                         |           468.1            |           515.4           |     7.9    |     15.3    |     84.1     |                 
      TorchVision[backend=video_reader]  |           348.0            |           330.3           |    12.7    |     16.0    |     50.7     |                 
      TorchCodecCoreNonBatch             |            47.9            |           109.9           |     9.6    |     12.6    |     57.6     |                 
      TorchCodecCoreBatch                |            49.7            |            44.2           |    11.6    |     14.8    |     61.6     |                 
      TorchCodecCore:                    |            50.1            |            44.7           |     9.6    |     12.7    |     57.8     |        9.5      
      TorchCodecCore:num_threads=1       |           111.5            |           102.5           |     6.9    |     11.8    |     69.3     |                 
      TorchCodecPublic                   |            49.9            |            44.4           |    11.8    |     14.9    |     60.7     |

Some observations:

The sampler-inspired experiments (random and uniform) are remarkably consistent across all decoders.
1 next and 10 next are also remarkably consistent across all decoders.
100 next is consistent across: a. DecordAccurate. b. DecordAccurateBatch. c. TorchVision. d. TorchCodecCoreBatch.
100 next has remarkable variation across: a. TorchAudio. b. TorchCodecCoreNonBatch. c. TorchCodecCore. d. TorchCodecPublic.
TorchCodecCore is consistently slightly faster than TorchCodecPublic. This means we have an opportunity to shave off some time in the logic in the public API.
While both TorchCodecCore and TorchCodecPublic display variation across runs, they notably always move together within a run. That is, if TorchCodecCore has a "good" run, then so does TorchCodecPublic. That means there may be something systematic going on that determines if a run is "good" or not. Maybe something to do with how the video gets laid out in memory?
TorchVision is consistently the best performer in 100 next.

pytorch / torchcodec

More benchmark refactoring, more benchmarks. #337